After much thought, I’ve been integrating all three possible stacks (DCAS+freelist, CAS+SMR, CAS+freelist) into a single unified API and set of source files.
It’s actually going well – I had misgivings, I still do to an extent, but it’s felt right enough as I’ve been going that I’m feeling quite positive about it.
I realised something just now though – back on the question of havinng a single global SMR or one per data structure instance.
Having one per data structure scales better; but not in the way you’d think. Actually using a single global SMR is only a problem when it comes to data structure instance deletion – which unfortunately, impacts normal operation. During normal operation (apart from the delete problem), it’s better than one per dats structure, because the thread state lists are basically read-only and are shared.
The delete problem is that if on element release from SMR we need to perform operations which require the data structure state (one example being SMR+freelist, where on release we need to push back to the freelist), then on data structure instance delete we cannot delete the instance state until we know all elements have emerged from SMR (since they will access that state), and for many data structure types (not all) the only way we can know that is to track the number of elements currently in SMR, and that means we now have all threads using a data structure all trying to do atomic inc/dec on a single counter. This does not scale. It’s bad enough that the stack and freelist work like this anyway – but to have it *again* only makes things worse.
So, the alternative, one SMR per data structure instance, that has the problem that we cannot have a thread register and then keep track of its state with every data structure instance (just not practical) so we have to have thread instance lookup. This means a call to get the thread ID (how slow is that? if it uses thread local store, it could be very slow – TLS seems in general to be poorly implemented) and then we need to perform a hash lookup, which given that we’ll need a TLB lookup for the hash state and then another for the element, and then down the list, you’re looking at *tons* of memory access. But – and here’s the thing – it’s all *uncontended* memory access. It’s not tons of threads invalidating each others cache lines on a single atomic counter.
One other problem thought (mentioned in previous posts) with one SMR per DSI (data structure instance – I’m fed up of typing it) is thread retirement. When a thread goes away, you don’t know it’s happened. That’s awkward.