SMR Redux

So, my previous post about a design flaw in SMR, was incorrect.

Where I’d not touched that code for a while, I was not fully understanding what was going on. In fact, the code knows that a thread has at any time in the past (since the most recent generation advance) exited a read section, and so is not dependent on checking at a moment when no threads are in read sections.

So, I’ve been working on the tests, getting them to work, and I’ve finally come to a test which is really using SMR in anger, and of course this reveals to you what it’s like to actually use the API – where that API has changed, and is now no longer called every now and then when entering/exiting a read section or submitting an element, but explicitly and manually by the user.

One inherent aspect of SMR, which is visible and awkward for the developer, is that it is never possible to guarantee that an attempt to advance the generation counter will work. Another thread which is inside a read section blocks the advance of the generation counter beyond its current value. Of course, read sections are by design intended to be extremely brief, so in practise it’s not an issue – but it does mean no guarantee.

One thing which has become clear is that the function call which attempts to advance the generation counter has to be multi-threaded – it is ugly and onerous to put upon the user the burden of ensuring this is only called by a single thread at a time.

SMR

So I’ve been going through the tests making them work again.

I’m currently working on SMR.

Having not touched the SMR code for some time I come back to it fresh – and I have perceived what I think is a design flaw. Not a bug, it works, but, well, maybe it doesn’t work very well when load is high.

The core issue with SMR is knowing when it’s safe to advance the generation counter. Each SMR instantiation maintains a generation counter, which begins at 0, and when an allocation is submitted to SMR (to be returned when it is safe to reuse or free, i.e. when no thread could possibly access it) it is assigned the value at that time of the generation counter.

When the generation counter has advanced by *two* more than the value in an allocation (I can’t remember why two, offhand – there is a vital reason for it, or there was at any rate, as I initially had it set to one, and that was a bug; I realised why and changed it to two) then it is safe to be reused.

Now, the design flaw is this : to advance the generation counter, the user calls an SMR function, which checks for certain criteria (I’ll come to that in just a mo) and if they are satisfied, the generation counter is advanced.

Remember here we’re checking for the possibility of a thread accessing an element which has already been by one thread submitted to SMR for reuse – i.e. that thread has emerged from a read section holding the element, having removed it probably from a data structure, and wants it to be re-used. The problem SMR is solving is that other threads might be engaged at that very moment in lock-free work and hold pointers to that element. We need all threads to be known to have exited their read sections (the code which does lock-free work) or to have been idle.

The critera are that since the last call to this function, all threads have either been idle (which is to say, not entered a read section – i.e. have done no lock-free work), or have entered and then exited a read section. If however any thread is currently IN a read section, then it could be it holds a pointer to this allocation.

So the problem is this – if the data structure is very busy, and there are many threads, some threads will always or almost always be in a read section. We never get to advance the generation counter!

A solution which comes to mind is this : each thread in its per-thread SMR state keeps track of the number of tiems it has entered and exited a read section. It also keeps a copy of the values of those counters from when the user last checked to see if the generation counter could be advanced.

This way even if we have lots of busy threads, we can still advance the generation counter, becuse we can see *how often each thread has exited a read section*. So even if many threads are currently IN a read-section, we can still know it’s safe to advance the ccounter.

So now I have more work to do – in the sense that I need to make the current SMR pass it tests again (which is itself already quite improved since the benchmark app was working), get the benchmark app running again, benchmark, then make these changes, then benchmark again.

Progress

Ha. My post has the same name as a Russian spaceship.

A weekend ago I made liblfds compile again.

I’m now working on the test suite. It’s a lot of work. I’m going to release without the benchmark programme – it’ll just take too long.

So, current goal is to get test compiling.

I’d like to get it done this weekend, but we’ll see.

Managed to really bang a little toe into a metal bar, yesterday, so I’m stuck indoors today (and prolly tomorrow). There’s a Statue of Liberty kayak trip this Sunday, which by my trip up to WDC Thu/Fri I completely forgot to book in time – uber bleh 🙁 like the third time I’ve missed it, and it only runs every two weeks.

Status

Have made the gross code changes to liblfds, now need to slap it around until it compiles.

Then need to rewrite all the tests, and also the benchmark app too.

This thing has one hell of a tail. I need an automatic code adjuster, something which understands C and adheres to my code style; but that won’t help with test or benchmark.