Monthly Archives: December 2011

Status

Set up Ubuntu (64 bit and 32 bit) under VirtualBox. Spent two hours trying to make shared folders work. Eventually gave it up as a bad, bad job and used an FTP server.

Library now compiles under 64 bit Linux. I was going to use the new gcc C1X atomic instrincs, except they do not appear to be present in my latest-version Ubuntu (stdatomic.h is not present).

Added ssf_queue, which required removing from the M&S its built-in ABA handling. This turned out to be straightforward.

Then added ssf_ringbuffer. I’m not sure this makes sense now though. One of the points of a ringbuffer is element pre-allocation. Using ssf_queue (which ssf_ringbuffer does, internally) means we have per-element alloc occurring. This is wrong, but to fix it rather implies making the use of malloc optional – having the API simply have there a function call for allocation, which may in fact for example call to a freelist.

So I need to figure out where I’m going with that.

Next other thing is back to ssp_slist. I’ve implemented the unlink function and I think it’s right, but of course it probably isn’t, and that as it is took some time to figure out. I have now to add the link function. Once the library compiles I can then make ssp_hash (which is saf_hash, using ssp_slist, plus adding delete functionality).

Then I’ll need to debug ssp_slist fully before making the small changes necessary to produce dsf_slist, which in turn permits dsf_hash.

Then it’s sort out benchmarking and write unit tests for everything and its dog; then sort out the Windows kernel port, check the Linux 64 and 32 bit ports work okay and then having a long, slow maturation period, where I look over the code, fix stuff up, make it nice. I’ve a to-do list of about twenty items already :-)

Oh and then I need to write and update documentation, which will take a solid week.

ssp_list

Tim Harris’ singly linked list turned out to be more complicated than I had realised – it has helping. If during a list traversal you bump into a logically deleted element, you help; you perform the physical delete.

I think I’m going for the first time to add a dummy head element; it allows easy use of functions in the main behaviour loops, whereas otherwise you end up writing everything in one big lump (because list_state->head is a special case, where you find the first element has its delete bit set, as you need to change list_state->head rather than list_element->next).

Status

saf_hash compiles.

Next, either ssp_slist and ssp_hash, or adding atomic_swap to libabstraction and writing dsf_slist and dsf_hash.

Probably first the former.

Once they’re both done, leaves ssf_queue (which will take some thought, since dff_queue, the original source, has a built-in ABA scheme) and ssf_ringbuffer, where ssf_ringbuffer is a no brainer.

And then we’re done… except for actually making it all work, writing unit tests and benchmarks, making it compile on all target platforms (six of them) and documentation.

The main question is how to go about with unit tests. I think I’ve come to the end of the road with the ad hoc method I’ve used so far of just putting together sensible tests. There’s just too much code. I’m considering a fuzzy testing engine – after all, with lock-free code, you can execute any API call at any time. Why not have every thread randomly executing every API call, for say sixty seconds at a time? problem is, you might not well uncover timing issues – with a tailored test, you have all n threads hammering away in the same little area at the same time. With a fuzzy test, you have less control; you could limit the number of elements in the data structure to try to force proximity, but it’s rather vague. OTOH, what’s nice about it is a fuzzy test engine works for ALL APIs, with almost zero effort – you just populate an array with function pointers and let it go.

Status

saf_btree compiles.

saf_hash now (since it will use saf_btrees rather than saf_lists, to improve performance).

Status

Add-only linked list complete.

Included capability to add in-order.

Removed capable to logically delete – it makes everything much more complicated (and so probably buggy).

Add-only hash now.

Status

Create ssf_freelist (single word CAS, SMR, patent-free).

Renamed RCU to smf_epoch (single CAS, memory handler, patent-free).

Changed the semantics of the guaranteed_pop/push functions for freelist and stack; they will first try to perform a normal pop/push and then only if that fails will they malloc a new element. (I had assumed this would be the normal use for this function, but then I realised it was about the only normal use for this function, so I moved it into the API).

Adding saf_hash (single word CAS, add-only, patent-free). This will in fact reasonably nicely support logical delete; any logically delete element of course remains in the data structure (as it is add-only) but it will be re-used by the next insert which passes by that element.

Probably will write saf_btree next, just to get it out of the way.

Still need to write unit tests and benchmarks for all these new data structures, which will take quite some time.

In fact, the next release will simply be a full re-write of all the test code, into something properly modular and decent.

Kenney’s RCU patent

I this morning read Kenney’s patent.

Epoch based SMR (each thread having flags indicating whether it’s entered/exited from a quiescent state) is nothing like Kenney’s method; the patent talks explicitly about a token being passed around between threads.

Currently I have RCU in a separate library which is LGPL; I’m integrating this now into liblfds proper and renaming, because I’m not doing Read-Copy-Update at all (for one thing, there’s no copy! I simply remove the element from the data structure).

RCU works

Two solid days of debugging.

Most of today was spent thinking over RCU until I realised when checking the thread-local list of release candidates, I can only delete thoses which are *two* or more generations older than the current generation count for the thread (which in turn has just been updated from the master RCU state). I was instead only checking that the release candidate generation was *smaller* than the current generation count for the thread.

After that, there was a memory leak – in the release candidate callback, I was omitting to free the release candidate structure itself.

So RCU and SMR stack unit tests now happy.

Going to check the patent for RCU and see if epoch based SMR avoids the claims.

Hope it does because then I won’t have to have RCU in a separate, LGPL, library (librcu); it instead will be part of liblfds proper.

Need to add atomic_swap() to the abstractions, both for RCU and for contigious double-word CAS singly linked list.

Debugging SMR

Single threaded RCU works. Yayness =-)

Avoiding Tim Harris’ delete bit patent

AFAIK there are two known lock-free methods for implementing a lock-free linked list; Valois’ from 1995 using dummy nodes between objects (with bugs later corrected by M&S of the queue fame; it’s a problematic solution) and Harris from 2000, with the delete bit, which is the viable solution.

Harris (I phoned him – this is from him) invented the method while an intern at Sun. (Pretty bloody impressive :-) it means Sun owns the patent – which means Oracle now own the patent.

Interestingly, the patent was filed for in Nov 2000 but only granted in 2006. A friend of mine, a retired British patent agent, tells me that’s a long time; that they may have had trouble getting it past the examiner.

Now, important fact; the patent for this method exists and only exists in the USA. Patents have not been applied for (and remember – the original was filed twelve years ago now) in other domains.

The reason for this almost certainly is because only the USA permits software patents. The rest of the world wouldn’t allow this as a patent. (I made no comment on whether this is right or wrong; it is an observation).

So the first thing to consider is this; if you’re not using the method in the USA, you’re legally free to use it.

Now the biggie; I may be wrong – I may WELL be wrong (but let us see) – but I think there is a way to avoid the patent – so you can use the basic method (two stage delete, logical then physical) even in the USA.

(Note : I have written a British patent, with proof reading assistance from my retired patent agent friend. It’s only one, but it took bloody ages and was a LOT of work and it means I have some idea of what patents are about, how they’re structured internally and the legal significance of each section and what you have to do to avoid a patent).

The drawback is you need contigious double-word CAS, so this approach only works for x86/x64 and ARM.

Now, I’m assuming you’re familiar with the method described in Tim’s patent.

First, I will describe the technical alternative and then second I will explain why I think this avoids the patent.

So, the technical details;

The patented method uses single-word CAS and so steals a low-order bit from the pointer as the logical delete flag and must use CAS to raise that flag, since the pointer could change at any time.

Instead, with contigious double-word CAS, we can use the second word to hold the delete bit (wasteful, but hey :-) and as such we can use an atomic swap to raise the delete flag.

We can always blindly write, because it doesn’t matter if the delete flag is already raised.

Note I’m using garbage collection (RCU/epoch based) so I know the node won’t go away.

That’s it.

Now, the legal details;

To avoid a patent you basically have to avoid the claims. Claims are typically (as they are in this case) formed in chains – there is a root claim and then a series of claims which depend on that root. Avoid the root and you avoid the chain. In this patent, there are three roots, claims #1, #16 and #25.

The key issue that I see is the use throughout the claims, of the phrase “the update being conditional on the examination”.

Claim 1 : A non-blocking concurrent shared object representation encoded in a computer-readable medium, comprising: a linked-list of nodes encoding of a group of zero or more values; and linearizable operations defined to implement semantics of at least insert and remove operations on the group, wherein concurrent execution of the linearizable operations is mediated using a first synchronization primitive to encode a marked node indication signifying logical deletion of a corresponding one of the values from the group, wherein the first synchronization primitive atomically examines and updates a single target, the updating being conditional on the examination.

So – we have a linked list, it’s linearizable, we have a two stage delete – logical then physical – where first synchronization primitive atomically examines and updates a single target, the updating being
conditional on the examination
.

The patent uses single-word CAS and steals a low-order bit, so it has to use CAS to raise the logical delete flag.

Using contiguous double-word CAS means we can hold the logical delete bit in the second word and use a swap to raise the logical delete flag; the updating is not conditional on the examination. Claim 1 is avoided.

Claim 16 : A method of managing access to a linked-list of nodes susceptible to concurrent operations on a group encoded therein, the method comprising: separating deletion of a value from the group into at least two functional sequences; the first functional sequence performing a logical deletion of the value using a synchronization primitive to mark a corresponding one of the nodes; and the second functional sequence excising the marked node from the linked-list, wherein the synchronization primitive atomically examines and updates a single target, the updating being conditional on the examination.

Same here; for the first stage, the logical delete, there is no update conditional on examination. There is for the second, the physical excising, but it doesn’t matter – the claim requires update conditional on examination for both.

Claim 25 : A computer program product encoded in at least one computer readable medium, the computer program product comprising: at least two functional sequences providing non-blocking access to a concurrent shared object, the concurrent shared object instantiable as a linked-list of nodes
encoding of a group of zero or more values; a first of the functional sequences defined to implement semantics of an insert operation on the group; and a second of the functional sequences defined to implement semantics of a remove operation on the group, wherein instances of the functional sequences are linearizable and concurrent execution thereof by plural processors of a multiprocessor is mediated using a synchronization primitive to encode a marked node indication signifying logical deletion of a corresponding one of the values from the group with separate physical excision of the corresponding node, wherein the synchronization primitive atomically examines and updates a single target, the updating being conditional on the examination.

And exactly the same here. The claim requires the use of CAS (update being conditional on examination) for both stages and we avoid it for the first stage.