Drawback of SMR

So, SMR basically consists of two halves.

Firstly, a thread will (every now and then, for amortization) check to see if the generation counter can be advanced

Secondly, each thread will (every now and then) check to see if the generation counter has advanced and release its release candidates.

What this means is that there exists certain lengths of time during which element pile up inside the release candidate lists.

In particular we can imagine an SMR state which has a -lot- of threads and so the generation counter check is very slow, while other threads – which continue to work – are performing tight loops where an element or elements are being added to the release candidate lists.

Now, with data structures which malloc elements on each add and which simply free on release, this is fine. We simply absorb memory, then release it.

But for data structures which re-use elements, e.g. the data structure has a fixed number of elements (freelist or ringbuffer), such that the release of an element is actually to return that element to the data structure, we can see the exhaustion of elements in the data structure.

(Where, for example, if we used a pointer-counter pair to address ABA, this could not happen, as the return of each element to the data structure would be in-line with the threads processing).

A solution to this is that in the “get” API of the data structure, it must if it finds element exhaustion allocate a new element, such that it will continue to function and also increase the number of elements to the level required to operate continiously given the number of CPUs, load and useage.

The problem is that this requires a call to malloc() and we are a data structure which is supposed to be fast, by dint of pre-allocating its elements.

(As it is though, SMR currently requires a malloc per release candidate, to store details about the release candidate. I need at least to amortize that into a single malloc per n candidates).

For data structures which re-use elements, pointer-counter is a simpler and more appropriate design. However, this is only available on Intel and ARM (since you need contigious double-word CAS).

Comments are closed.