SMR

So I’ve been going through the tests making them work again.

I’m currently working on SMR.

Having not touched the SMR code for some time I come back to it fresh – and I have perceived what I think is a design flaw. Not a bug, it works, but, well, maybe it doesn’t work very well when load is high.

The core issue with SMR is knowing when it’s safe to advance the generation counter. Each SMR instantiation maintains a generation counter, which begins at 0, and when an allocation is submitted to SMR (to be returned when it is safe to reuse or free, i.e. when no thread could possibly access it) it is assigned the value at that time of the generation counter.

When the generation counter has advanced by *two* more than the value in an allocation (I can’t remember why two, offhand – there is a vital reason for it, or there was at any rate, as I initially had it set to one, and that was a bug; I realised why and changed it to two) then it is safe to be reused.

Now, the design flaw is this : to advance the generation counter, the user calls an SMR function, which checks for certain criteria (I’ll come to that in just a mo) and if they are satisfied, the generation counter is advanced.

Remember here we’re checking for the possibility of a thread accessing an element which has already been by one thread submitted to SMR for reuse – i.e. that thread has emerged from a read section holding the element, having removed it probably from a data structure, and wants it to be re-used. The problem SMR is solving is that other threads might be engaged at that very moment in lock-free work and hold pointers to that element. We need all threads to be known to have exited their read sections (the code which does lock-free work) or to have been idle.

The critera are that since the last call to this function, all threads have either been idle (which is to say, not entered a read section – i.e. have done no lock-free work), or have entered and then exited a read section. If however any thread is currently IN a read section, then it could be it holds a pointer to this allocation.

So the problem is this – if the data structure is very busy, and there are many threads, some threads will always or almost always be in a read section. We never get to advance the generation counter!

A solution which comes to mind is this : each thread in its per-thread SMR state keeps track of the number of tiems it has entered and exited a read section. It also keeps a copy of the values of those counters from when the user last checked to see if the generation counter could be advanced.

This way even if we have lots of busy threads, we can still advance the generation counter, becuse we can see *how often each thread has exited a read section*. So even if many threads are currently IN a read-section, we can still know it’s safe to advance the ccounter.

So now I have more work to do – in the sense that I need to make the current SMR pass it tests again (which is itself already quite improved since the benchmark app was working), get the benchmark app running again, benchmark, then make these changes, then benchmark again.