SMR design flaw and improvements

So, the freelist rapid push/pop test, using SMR, revealed an SMR bug, which in turn revealed the changes I made to the SMR function for advancing the generation counter, such that that function became multi-threaded (rather than being a critical section bounded by CAS flag) were broken.

What I’ve come up with now instead is the idea that there is a “setting” phase, where threads set their SMR state flags, and then when a thread sees that the generation counter can be advanced, we then have a “clearing” phase, where threads calling the generation-advance function act not to check to see if the generation counter can be advanced, but rather they check to see if all the per-thread flags have been cleared, so that we’re then in the correct state for threads to start the work again of indicating the SMR state so we can again see if the generation counter can be advanced.

The whole point of all this is to ensure any single thread entering the check function can make forward progress – before, this was not the case. I think the design is sound, although I’ve not yet implemented, because…

…I’ve realised in the course of this there is a design flaw, and I’ve not yet resolved it. The handle idle threads, I have it so that when a thread enters a lock-free section, it sets a flag – “LOCK_FREE_IN_PROGRESS” – and lowers that flag on exit. Design flaw is that the code only uses a store barrier, so there’s no guaranteee that flag is visible to other threads (who have issued a load barrier) until an atomic operation is performed – which typically occurs in the act of *performing* the lock-free work being done, i.e. *after* we’ve read in and are using sensitive memory addresses.

Any real SMR has to support idle threads, so I have to think of a solution.

Two steps forward, one step backwards, the mantra of all lock-free design work 🙂

Some real actual work

Well wadda ya know – I’ve actually done some real actual more work today – about six or so hours. I’ve been working on the first real SMR test, making it work and getting a feel for how the new SMR API is to use. Found a design flaw in how SMR data structure cleanup was arranged and fixed it.

The basic use principles of the new SMR API seem to simpify out to something like; think only about your current thread, take it as read that anything you put into SMR won’t come out till you call SMR release processing, and remember that any given call to SMR release processing may release zero elements, but that sooner or later, calling SMR release processing will release elements, so you can often find the way to write code is to have your operation in a loop, which exits when the operation is successful, where in the loop you call SMR release processing if the operation failed.

Google! evil AND awful

I’ve had a little trouble recently emailing via my own server, so I had cause to try to resend one or two emails – I have a Gmail account, titled “Review Account”, which I use to publish reviews on Google Maps, and I thought to use this.

I logged into Gmail, then after literally ten minutes of hunting, including very very nearly being led into creating a G+ account without knowing it, I in my by then desperate clicking on anything at all stumbled across the method to change the account name and did so. (Along the way I discovered Google had added a new advertising option, which was set to “yes”, where my profile would be used to endorse products).

I then sent my emails. I then discovered they were still sent as “Review Account”. I presume it is in fact necessary to log out and then log back in, to have the name changed in the Gmail session.

All in all – “Google! evil AND awful”.

SMR design

So, I actually did a bit of Googling to see the literature for SMR. I wanted to see if I was missing something obvious. Probably I still am 🙂

However, it looks like I might actually have something minutely novel, so I figured I’d write it up a bit here.

As an aside, I’ve also now make the generation-advancing function safely multi-threaded, so callers don’t need to worry about that anymore, and you’ll still get forward progress, etc.

So, SMR design.

It’s epoch based.

There’s a main SMR state, and there’s one state per thread.

A thread registers its state with the main state, which has an atomic add-only list of thread states (which can be in the states active, available, retired, etc – it’s add-only, so callers can try to re-use an available state in the list; each thread state knows its NUMA node number, since we care a lot on the per-thread level about NUMA appropriateness).

Each thread state has a single state variable, which is used as a bitmask, so we can atomically (single CAS) modify all the state information in one operation.

The main state has a generation counter which begins at 0.

So, we begin operations – make a main state, each thread makes a per-thread state and registers with the main state.

All the thread begin idle.

Now a thread comes to perform an operation using lock-free operations; it calls a macro, “BEGIN_LOCK_FREE”, which non-atomically sets a bit in the per-thread state which indicates a read section is in progress, and issues a store barrier.

The thread then finishes the lock-free operations, and so calls a second macro, “END_LOCK_FREE”. This non-atomically clears the “read section in progress” bit, and sets a second bit, “exited read section”.

When a thread has removed an allocation from the (say) lock-free data structure and wants to reuse it, it places the allocation in a single-threaded list which is in the per-thread state, noting the generation count in the main state.

So, now we’re humming along – threads are entering and exiting read sections, threads are submitting elements for reuse.

Now what?

Now we come to the point where the user calls the function to try to advance the generation counter in the main state.

We iterate over the list of thread states in the main state. Now, each thread has two status bits we care about – one is a flag, which is raised when the thread enters a lockfree section and lowered when it leaves, the other is a flag which is raised when a thread exit a lockfree section; the threads themselves never lower this flag – it is lowered, and in every thread state, by this function we’re in now, which is trying to advance the generation counter, if and only if the generation counter is advanced.

What we need, to advance the generation counter, is that every thread has exited a lock-free section (which means all elements queued for reuse are safe) or has been idle (hasn’t entered a lock-free section at all, since the last scan, and so all elements queued for reuse are safe).

So; if we see the exited flag is set, we know a thread has been in and has exited a read section – we’re good.

If we see the active flag is set, and the exited flag is set, we’re still good.

However, if we see the active flag is set, and the exited flag is *not* set, then we’re screwed – we can’t advance the counter.

So, if we can’t advance the counter, we don’t, that’s that. We return.

If we can advance the counter, we do so – but now comes one final vital point.

So – we have threads queuing up elements for reuse. The generation counter begins at 0, so these elements have a generation count of 0. When we check to advance the generation counter, we need to know that every thread has been completely idle, or has exited a read section *after the element was submitted for reuse*. However, when we do get round to scanning the thread states, all we can see is that the exit bit has been set… so we know all the threads HAVE exited a read section, but we don’t know WHEN. It could have been (say) right after only the very first element was submitted – and then it might be one of the threads has been stuck in a long running read section all the time since then – which would mean only the first element was actually safe for reuse.

How do we handle this?

The answer is that when a thread comes to scan its release candidate list, comparing each elements generation counter with the current main state generation counter, elements are only released when the difference is greater than TWO, not ONE.

In other words – having done this first scan (and finding all threads have been idle or have exited), we do advance the generation counter *and we set the exited bit in each thread state to lowered* (this is vital, remember it) – but that does not mean the previous generation (0 in this case) can now be released. It cannot – for we do not know when each thread exited, so we cannot know which elements are safe to release. However, having lowered the exit bit, when we come to scan again, if we see *AGAIN* all threads have been idle or have exited a read section, THEN ALL THREADS MUST HAVE EXITED **AFTER** THE FINAL GENERATION 0 ELEMENT WAS SUBMITTED – which means generaton zero is now safe to release.

So we’re always a generation behind on releasing.

I’m thinking I must have missed an obvious way to simplify this – but I can’t see it. We have to know if a thread had entered a lockfree section (to detect idle threads). We have to know when all the threads have exited (so we can know to advance the generation counter). I mean, basically, the design is that threads indicate they’ve exited a lockfree section, and we notice this only whenever the user calls the generation counter advance function, so we can’t know when they exited, only that they have, so we have to have two rounds of every thread having exited to know the generation before last is clear and safe to release.

Inverted SMR

So, yeah, thought about it a bit.

What stops the generation counter from advancing past a given generation? a thread which is in a read section. So when a thread enters a read section, it posts in its per-thread state the current main state generation value, and then clears that when it exits the read section. When we come to release reuse candidates, we scan the per-thread states, pick up any busy threads’ posted main state generation value, and the lowest value of them is how far up to we can release reuse candidates.

Idle threads are always permissive – no need to have any extra house-keeping to detect them – and what’s nice is because we’re now reversed, it doesn’t matter if threads when posting read an older version of the main state counter – it just makes us less efficient, rather than breaking the system.

However, it still means the main state counter has to increment every time a thread enters a read section (doing only on reuse is no good – I think you end up needing a period where no threads are in read sections, to be able to release reuse candidates, otherwise on a busy system you end up always seeing threads are in read sections) and these increments do need to be atomic – if we lost a write, a thread would think it is in an earlier generation than it really is, so we could reuse elements not yet safe to reuse.

Basically, I’m barking up the wrong tree – for performance, all information which lets a scan to advance the generation counter has to be stored and maintained in the per-thread state, with only read access to the main state. Only the scan to advance the generation counter can write the main state.

This is how the current mechanism works.

New SMR design

So! I don’t think this idea works, because to make it work the performance would be too poor, but…

You have a main state, and a per-thread state.

Each per-thread state registers with the main state.

The main state holds a counter, the “current counter”, which begins at 0. Every time a thread adds an element to its reuse candidate list, it stores in the reuse candidate the value of the main counter, and atomically increments the main counter. (On a busy system, the contention on that counter would drive performance into the ground – however, there is a scalable counter design…)

There is another counter in the main state, the “safe counter”, which also begins at 0 – it begin safe to reuse elements up to this value.

Every time a thread exits a read section, it stores the current value of that main counter in its thread state.

When we come to check to see if we can advance the safe counter, we iterate over the thread states, looking at their counter (the value of the main state current counter when they last exited a read section), and find the lowest value of them all (ignore idle threads for a moment). We then advance the safe counter to this lowest value – i.e. this is the point after which not all threads have exited a read section, and so the reuse candidates cannot yet be reused.

To deal with idle threads, we keep a flag in the thread state, which is raised when the thread enters a read section and lowered when we check to advance the safe counter. If the flag is lowered, then the thread has been idle.

In conceptual terms, ignoring actual practial performance, the design has high fidelity with regard to knowing which elements can be reused. As each element has its own generation count, and we know to which generation count we’re safe to re-use elements, we reuse every possible element we can.

What strikes me now though is that perhaps another way is to invert the paradym; rather than assuming we can’t advance, and then checking to see how far it is safe to advance (and so having trouble with idle threads, since they are silent), what about assuming we can advance (and so idle threads naturally say the right thing), with blockers in the way for the points beyond which we cannot advance?

Have to think about this, might just be crazy in the first instance, will see.

SMR Redux

So, my previous post about a design flaw in SMR, was incorrect.

Where I’d not touched that code for a while, I was not fully understanding what was going on. In fact, the code knows that a thread has at any time in the past (since the most recent generation advance) exited a read section, and so is not dependent on checking at a moment when no threads are in read sections.

So, I’ve been working on the tests, getting them to work, and I’ve finally come to a test which is really using SMR in anger, and of course this reveals to you what it’s like to actually use the API – where that API has changed, and is now no longer called every now and then when entering/exiting a read section or submitting an element, but explicitly and manually by the user.

One inherent aspect of SMR, which is visible and awkward for the developer, is that it is never possible to guarantee that an attempt to advance the generation counter will work. Another thread which is inside a read section blocks the advance of the generation counter beyond its current value. Of course, read sections are by design intended to be extremely brief, so in practise it’s not an issue – but it does mean no guarantee.

One thing which has become clear is that the function call which attempts to advance the generation counter has to be multi-threaded – it is ugly and onerous to put upon the user the burden of ensuring this is only called by a single thread at a time.


So I’ve been going through the tests making them work again.

I’m currently working on SMR.

Having not touched the SMR code for some time I come back to it fresh – and I have perceived what I think is a design flaw. Not a bug, it works, but, well, maybe it doesn’t work very well when load is high.

The core issue with SMR is knowing when it’s safe to advance the generation counter. Each SMR instantiation maintains a generation counter, which begins at 0, and when an allocation is submitted to SMR (to be returned when it is safe to reuse or free, i.e. when no thread could possibly access it) it is assigned the value at that time of the generation counter.

When the generation counter has advanced by *two* more than the value in an allocation (I can’t remember why two, offhand – there is a vital reason for it, or there was at any rate, as I initially had it set to one, and that was a bug; I realised why and changed it to two) then it is safe to be reused.

Now, the design flaw is this : to advance the generation counter, the user calls an SMR function, which checks for certain criteria (I’ll come to that in just a mo) and if they are satisfied, the generation counter is advanced.

Remember here we’re checking for the possibility of a thread accessing an element which has already been by one thread submitted to SMR for reuse – i.e. that thread has emerged from a read section holding the element, having removed it probably from a data structure, and wants it to be re-used. The problem SMR is solving is that other threads might be engaged at that very moment in lock-free work and hold pointers to that element. We need all threads to be known to have exited their read sections (the code which does lock-free work) or to have been idle.

The critera are that since the last call to this function, all threads have either been idle (which is to say, not entered a read section – i.e. have done no lock-free work), or have entered and then exited a read section. If however any thread is currently IN a read section, then it could be it holds a pointer to this allocation.

So the problem is this – if the data structure is very busy, and there are many threads, some threads will always or almost always be in a read section. We never get to advance the generation counter!

A solution which comes to mind is this : each thread in its per-thread SMR state keeps track of the number of tiems it has entered and exited a read section. It also keeps a copy of the values of those counters from when the user last checked to see if the generation counter could be advanced.

This way even if we have lots of busy threads, we can still advance the generation counter, becuse we can see *how often each thread has exited a read section*. So even if many threads are currently IN a read-section, we can still know it’s safe to advance the ccounter.

So now I have more work to do – in the sense that I need to make the current SMR pass it tests again (which is itself already quite improved since the benchmark app was working), get the benchmark app running again, benchmark, then make these changes, then benchmark again.


Ha. My post has the same name as a Russian spaceship.

A weekend ago I made liblfds compile again.

I’m now working on the test suite. It’s a lot of work. I’m going to release without the benchmark programme – it’ll just take too long.

So, current goal is to get test compiling.

I’d like to get it done this weekend, but we’ll see.

Managed to really bang a little toe into a metal bar, yesterday, so I’m stuck indoors today (and prolly tomorrow). There’s a Statue of Liberty kayak trip this Sunday, which by my trip up to WDC Thu/Fri I completely forgot to book in time – uber bleh 🙁 like the third time I’ve missed it, and it only runs every two weeks.

The Hubris of Apple

Well, it took about eight, nine hours, but I finally have command line loading of music files into an iPod Shuffle 4th gen.

iTunes is as much a vastly bloated monstrousity as I remembered it to be from about six, seven years ago. The hubris here is beyond the pale – to load music into your device when you plug it in, Apple has arranged to have no less than three seperate processes running *permanently* on your PC. The iTunes installer – for the software for what is at the core of it copying an mp3 file from one device to another – 149mb. It actually installed no less than *six* seperate programmes.

I’m pleased with the audio quality of the 4th gen. I took a risk buying it. I was going to go for a new-in-box (but old, obviously) first gen, but the problem is the batteries age even without being used – the battery on it was probably shot, or not far from it. It is possible to buy a replacement battery and motherboard (they’re soldered together) but it’s quite difficult to dismantle the first gen for the swap.