build system progress

Happy joy!

Written platforms.json and targets.json, wrote the Python to export from SVN, make a tar.bz2, sftp to a remote host and then make; found my first bug, too! “-mcx16”, which is an Intel specific flag for GCC, throws an error as an unrecognized option when using GCC on ARM32.

Once the build system is up, I then need to get back to building every version of GCC on every platform, so I can compile with the lot.

After that, clang.

Actually the more-fun bit will be using vagrant to run VirtualBox VMs, and configuring Windows to accept SSH/SFTP.

Build system OTW


Finally – writing a build system for liblfds.

I know there are plenty of build systems out there. I normally find trying to figure out how to make an existing build system work for your own source code is more work than writing your own build system from scratch.

It’s written in Python. There’s a “platforms.json” in the build system itself which describes all the platforms available – which is to say, the compilers they offer, which processor type, IP address, etc.

There’s a “targets.json” in each entity which is built, which describes the platforms each build directory supports (there’s one build directory for each distinct set of build files – i.e. Makefile for GCC, Makefile for MSVC, WDK 7.1 build files, etc). So this would indicate the compilers which can be used, the processors, hosted or unhosted platforms, makefile targets, etc.

It’s assumed everything has a gnumake makefile (even if it’s just a curtsey makefile, which calls a different build system) and that I can use SSH/SFTP to get files in/out and issue commands.

DWCAS bug fix – “new” 7.1.1 released

If you downloaded 7.1.1 prior to 21:50 GMT+1 22nd Feb, please download it again.

One of the bug fixes was for DWCAS with GCC >= 4.1.2 and less than (I can’t type the less than symbol – wordpress in its awesome majesty doesn’t escape it, and it totally breaks all following formatting…!) 4.7.3. This platform is still not available for testing – a **lot** of work has gone into learning how to build GCC and glibc, to get access to all GCC versions, but that has not yet been successful.

There is a script which converts liblfds version “n.n.n” to “m.m.m”. This script was run, *and then the DWCAS macro was added in*… but with the 7.1.0 verson numbering, and this was not noticed.

Result – the macro doesn’t exist, so the arguments end up being a comma separated list of variable names, and of course this doesn’t work as expected!

Rather than making a whole new release for this, since the original has only been out for two days, 7.1.1 has been re-released.

This whole thing totally sucks. Obviously it’s well understood that if something isn’t tested, it doesn’t work; but there’s no obvious way to get hold of that test platform.

The work to get GCC and glibc built has to continue – but this is a nightmare task, truly one of the very worst things you could imagine, which is why after six weeks of non-stop work it still isn’t complete.

If you need to check which “version” of 7.1.1 you have, look at the file “liblfds7.1.1/inc/liblfds711/lfds711_porting_abstraction_layer_compiler”. Look at line 322. If you see “LFDS710_PAL_ATOMIC_DWCAS” you have the broken version. If you see “LFDS711_PAL_ATOMIC_DWCAS” you have the fixed version.

7.1.1 is out

As subject.

If you use the unbounded/many/many queue from 7.1.0, upgrade for a bugfix. See previous post for an explanation of the bug and why it was not found by the test suite.

Bug Report : 7.1.0 unbounded, many producer, many consumer queue

    Bug Report


There is a serious bug in the 7.1.0 unbounded, many producer, many consumer queue and users absolutely should upgrade to 7.1.1 as soon as possible, which itself will be released as soon as possible (the next day or two).

The bug is *not* in the lock-free code, but rather in the conventional code around it which handles backoff. This code was “improved” in 7.1.0 to make it more graceful and easier to read. This introduced the bug. All other versions of liblfds are unaffected. All other data structures are unaffected.

The bug can in principle occur under any workload but the heavier the workload, the more likely it becomes to manifest. In practise, to be sure to occur, a maximum-load use case is required, i.e. threads which are busy-looping on the queue, and the queue has enough load to be continually shifting elements in and out.

    The Bug

The queue contains a dummy element. Even when it is empty, there is one element in the queue; this element contains no valid data. If the queue has say ten elements enqueued to it, there are *eleven* elements in the queue. When you dequeue, the key and value placed in the first enqueued value are placed in the dummy element, the dummy element is removed from the queue and given to the user, and the element after that becomes the new dummy element.

The bug is that if the queue was empty a the time of a dequeue attempt, or became empty during a dequeue attempt, the dummy element would have written into its key and value either NULLs, or the key and value of an element which had been in the queue (but was dequeued before this dequeue attempt completed).

Normally this is (although completely broken and wrong) harmless, because the dummy element carries no valid data.

However, a race condition can occur, whereby *just* before the incorrect key and value are written into the dummy element, the dummy element is *REUSED*, i.e. it has a REAL key and value put into it and it enqueued into a queue.

It then has its key and value overwritten – and this of course is deadly.

    Test Suite Failure

The relevant tests are in the test suite are;

1. libtest_tests_queue_unbounded_manyproducer_manyconsumer_dequeuing

This test has a pre-populated queue, where one thread per logical core concurrently dequeues. The elements when dequeued are just returned to a freelist, not reused. Each thread checks the value obtained from the queue, but only dequeues once on empty (getting a return of 0 from the dequeue function causes the thread to stop working) so it’s almost impossible with that tiny window to actually induce the bug, and even if it was induced, it would do no harm, as the queue elements are not being reused.


This test has one queue and one thread per logical core. The queue begins empty, and each thread has one queue element. The threads run a tight loop, where they enqueue and then immediately dequeue. This test fails to find the problem because the queue is never empty.

    The Fix

The libtest_tests_queue_unbounded_manyproducer_manyconsumer_enqueuing_and_dequeuing test has been improved to provoke the bug. The code for handling backoff has been fixed, and it is seen that the bug no longer manifests.

Release 7.1.1 will be published as soon as humanly possible.

singly linked-list design

I can figure out how to make a practical *ordered* singly-linked list, but not a practical *unordered* singly-linked list. Bit of a surprise!

The key problem is iterating over the list.

Imagine you have a list, and you want to iterate over it.

You have a cursor; it points at the element you’re currently looking at, and prevents it from being deallocated. It does not – cannot – prevent it from being *unlinked*.

Now, when you want to move to the next element, you get the next pointer of your current element (which is to say, you point a hazard pointer at it, and then get hold of the value of the hazard pointer, this means the next element also can’t be deallocated – but it still can be unlinked.

So the problem is this : your cursor is on an element, any element, that’s fine (just assume we got there – it doesn’t change anything – we’re setting up a scenario which illustrates the problem). We get our next pointer, that’s fine. Before we move to the next element, though, someone else unlinks it.

Well, huh. So at this point, the next element we’re about to move to is unlinked – but it still points at what was *its* next element, so we’re still okay… (and we know the element has been unlinked, because its unlink warning bit has been set by the thread which unlinked it).

But what now happens if the element *after* our next element is unlinked *and* deallocated – which it can be, as we have no hazard pointer on it.

Because our next element has been unlinked, its next pointer is *not* updated by the unlink of its next element. So now the cursor is pointing at an element which has a next pointer pointing at deallocated store.

So what we see is that if we see the unlink bit is set, we have to abort the iteration.

Now, if we’re aborted the iteration, well, then what? users will often want to iterate over the whole list and look at each element just once. So we need to restart the iteration, at the point we left off.

But in an unordered list, we can have no idea where that point is, because the entire list could have been deallocated and replaced in the meantime.

We can only know where that point is if the list is *ordered*. Then the key held by the unlinked element which derailed us, which we still hold a hazard point to and so which cannot be deallocated, let’s us know how far we have to fast-forward over the list to get back to the point where we can recommence the iteration.

singly-linked list update

So, it’s all very interesting.

You need a cursor, to iterate over the list – because you need a pair of hazard pointers to iterate. There’s a problem here though I think – when we deregister an SMR hazard pointer per-thread state, there can be allocations still in the list of allocations pending reclamation. An unusued thread state with lingering elements is always checked by the other threads; so either we can’t use it till its clear (which is awkward as hell for the user – remember, we will not look to dynamic allocation as the library performs no allocations) or when we re-use, we then rely upon the now-owning thread to run scans. Problem is, that now-owning thread is expecting to be the sole owner – he has no idea he might have gigabytes of store sitting in his pending reclamation list! so I have to find a nice solution to this.

Another problem I’m currently grappling with is that although the hazard pointers prevent any element you’re using from being deallocated, they don’t stop it from being unlinked.

If you’re unlinked, it could be the element after you is unlinked before you try to get_next with the cursor – problem then is the next pointer in your element is now pointing at a deallocated allocation because of course it’s not updated when that next element was unlinked…!

I’m vaguely thinking there needs to be a “deleted bit” *IN* an element, which is an immediate logical delete, which is followed then by the usual delete bit in the predecessor element, as part of the two-step unlink (which I think is still needed – but not certain).

However, this is because I’ve not got the way it all works clear yet in my head, so it could be complete garbage.


Have implemented the freelist and queue with hazard pointers. Will do the stack once the freelist tests are written and pass.

Now implementing – finally, after all these years! – a lock-free singly-linked list, with support for real delete =-)

Once that’s done, I’ll think about how complex it is to support delete in the unbalanced tree.

English. Trying speaking it. Words have meanings, that sort of thing.

I’m joining – trying to join, as ever you have to struggle and fight to get past all the incompetence and mistakes every company makes in its desperate effort to commit suicide – Revolut, a FinTech.

They ask for four photos for proof of ID – front and back of passport, pic with yourself and passport next to your face, and then a pic of yourself.

I photo the front and the back, open the passport hold next to my face take a pic and then pic of me.

I send this off (after checking their SSL cipher suite is okay).

“When we said front and back we did not not mean the front and back of the passport; we meant the two pages (authors note – on the inside!) which have identity information.”

“Yes, we do have other people sendng us the front and back!”

Using lock-free data structures in a “single-threaded mode”

I have recently implemented hazard pointers.

As with epoch-based safe memory reclamation, each thread maintains a list of allocations which are pending reclamation.

This list must support delete, and where it’s per-thread, it can be single-threaded, so no problems there.

One solution then is to provide in the library a single-threaded list.

However, there is another possible solution, which also provides useful functionality and allows the absence of this singleton single-threaded data structure; to allow the lock-free list to be operated in a single-threaded mode.

The user would need to know, a priori, that at some point the list would only be operated on by a single thread. That thread would enter single-threaded memory with a load barrier, then Do Stuff, and then finish with a store barrier and forced write to memory. Other threads wishing to use the list again in normal multi-threaded mode I think (not yet checked the code) can do so without a special effort, since they get load barriers anyway – but if not, then any thread wishing to again use that list must issue a macro (which turns into a load barrier).

Similar functionality would be useful for all of the data structures, I think; I can imagine times when they are operated on in a single-threaded mode, and then you simply do not want the overhead and costs of the atomic operations – but you DO still want to operate on the SAME data structure you will then use multi-threaded.

There are two efficiency concerns.

The first is that the data structure C-language structures often have alignment constraints on some of their members. This is fine for CAS platforms, but not necessarily for ERG platforms – the worst case on ARM is 2048 bytes, so you end up with fat structures. You would avoid this is you had a single-thread *only* data structure, because then you don’t care about alignment. Still – if we are arguing the use case where we want single-threaded operations on a data structure which we either before or after want to use multi-threaded, then there’s no avoiding it.

The second is to do with the type qualifier volatile. Volatile is necessary to prevent compiler optimization. The compiler can in theory put a value into a register and keep it there forever – load barriers, which could cause the cache line holding that value to change, do not cause the register to be updated, and so the compiler would continue forever to use an old value. This will break most data structures, or at least prevent forward progress :-)

One common solution is to take use a non-volatile object, take a pointer to it, cast that pointer to be volatile, and access through it.

The Standard does not support this behaviour; by the language of the Standard, the compiler decides what to do with regard to volatile based on the type qualifier of the *object*, not the pointer.

However, GCC has supported this behaviour for a long time. I’m not sure what MSVC does, or other compilers.

What this means is that in single-threaded mode, the lock-free data structures must retain their volatile qualifier, which make them less efficient. We can avoid atomic operations, which is great, and we can perform operations which otherwise we could not (deletes for example) – but we can’t safely avoid volatile.

The problem is that if we did remove volatile, and use casts when we wanted volatile access, *but the compiler does not support it*, it would lead to a fault which would be intermittant and hard to diagnose.

The issue itself is difficult to explain to users – if it were for example exposed via the abstraction layer, by making volatile a define – and it is equally difficult for users to actually find out how their version of their compiler deals with volatile.

Correctness is the first attribute of software. You cannot ever put it at risk; software which does not do as it was intended has failed.

Given that it is problematic to determine whether a compiler will honour the type qualifier of the pointer, it has to be assumed it will not.