build system progress

Happy joy!

Written platforms.json and targets.json, wrote the Python to export from SVN, make a tar.bz2, sftp to a remote host and then make; found my first bug, too! “-mcx16”, which is an Intel specific flag for GCC, throws an error as an unrecognized option when using GCC on ARM32.

Once the build system is up, I then need to get back to building every version of GCC on every platform, so I can compile with the lot.

After that, clang.

Actually the more-fun bit will be using vagrant to run VirtualBox VMs, and configuring Windows to accept SSH/SFTP.

Build system OTW

Yay!

Finally – writing a build system for liblfds.

I know there are plenty of build systems out there. I normally find trying to figure out how to make an existing build system work for your own source code is more work than writing your own build system from scratch.

It’s written in Python. There’s a “platforms.json” in the build system itself which describes all the platforms available – which is to say, the compilers they offer, which processor type, IP address, etc.

There’s a “targets.json” in each entity which is built, which describes the platforms each build directory supports (there’s one build directory for each distinct set of build files – i.e. Makefile for GCC, Makefile for MSVC, WDK 7.1 build files, etc). So this would indicate the compilers which can be used, the processors, hosted or unhosted platforms, makefile targets, etc.

It’s assumed everything has a gnumake makefile (even if it’s just a curtsey makefile, which calls a different build system) and that I can use SSH/SFTP to get files in/out and issue commands.

DWCAS bug fix – “new” 7.1.1 released

If you downloaded 7.1.1 prior to 21:50 GMT+1 22nd Feb, please download it again.

One of the bug fixes was for DWCAS with GCC >= 4.1.2 and less than (I can’t type the less than symbol – wordpress in its awesome majesty doesn’t escape it, and it totally breaks all following formatting…!) 4.7.3. This platform is still not available for testing – a **lot** of work has gone into learning how to build GCC and glibc, to get access to all GCC versions, but that has not yet been successful.

There is a script which converts liblfds version “n.n.n” to “m.m.m”. This script was run, *and then the DWCAS macro was added in*… but with the 7.1.0 verson numbering, and this was not noticed.

Result – the macro doesn’t exist, so the arguments end up being a comma separated list of variable names, and of course this doesn’t work as expected!

Rather than making a whole new release for this, since the original has only been out for two days, 7.1.1 has been re-released.

This whole thing totally sucks. Obviously it’s well understood that if something isn’t tested, it doesn’t work; but there’s no obvious way to get hold of that test platform.

The work to get GCC and glibc built has to continue – but this is a nightmare task, truly one of the very worst things you could imagine, which is why after six weeks of non-stop work it still isn’t complete.

If you need to check which “version” of 7.1.1 you have, look at the file “liblfds7.1.1/inc/liblfds711/lfds711_porting_abstraction_layer_compiler”. Look at line 322. If you see “LFDS710_PAL_ATOMIC_DWCAS” you have the broken version. If you see “LFDS711_PAL_ATOMIC_DWCAS” you have the fixed version.

7.1.1 is out

As subject.

If you use the unbounded/many/many queue from 7.1.0, upgrade for a bugfix. See previous post for an explanation of the bug and why it was not found by the test suite.

Bug Report : 7.1.0 unbounded, many producer, many consumer queue

    Bug Report

    Introduction

There is a serious bug in the 7.1.0 unbounded, many producer, many consumer queue and users absolutely should upgrade to 7.1.1 as soon as possible, which itself will be released as soon as possible (the next day or two).

The bug is *not* in the lock-free code, but rather in the conventional code around it which handles backoff. This code was “improved” in 7.1.0 to make it more graceful and easier to read. This introduced the bug. All other versions of liblfds are unaffected. All other data structures are unaffected.

The bug can in principle occur under any workload but the heavier the workload, the more likely it becomes to manifest. In practise, to be sure to occur, a maximum-load use case is required, i.e. threads which are busy-looping on the queue, and the queue has enough load to be continually shifting elements in and out.

    The Bug

The queue contains a dummy element. Even when it is empty, there is one element in the queue; this element contains no valid data. If the queue has say ten elements enqueued to it, there are *eleven* elements in the queue. When you dequeue, the key and value placed in the first enqueued value are placed in the dummy element, the dummy element is removed from the queue and given to the user, and the element after that becomes the new dummy element.

The bug is that if the queue was empty a the time of a dequeue attempt, or became empty during a dequeue attempt, the dummy element would have written into its key and value either NULLs, or the key and value of an element which had been in the queue (but was dequeued before this dequeue attempt completed).

Normally this is (although completely broken and wrong) harmless, because the dummy element carries no valid data.

However, a race condition can occur, whereby *just* before the incorrect key and value are written into the dummy element, the dummy element is *REUSED*, i.e. it has a REAL key and value put into it and it enqueued into a queue.

It then has its key and value overwritten – and this of course is deadly.

    Test Suite Failure

The relevant tests are in the test suite are;

1. libtest_tests_queue_unbounded_manyproducer_manyconsumer_dequeuing

This test has a pre-populated queue, where one thread per logical core concurrently dequeues. The elements when dequeued are just returned to a freelist, not reused. Each thread checks the value obtained from the queue, but only dequeues once on empty (getting a return of 0 from the dequeue function causes the thread to stop working) so it’s almost impossible with that tiny window to actually induce the bug, and even if it was induced, it would do no harm, as the queue elements are not being reused.

2.
libtest_tests_queue_unbounded_manyproducer_manyconsumer_enqueuing_and_dequeuing

This test has one queue and one thread per logical core. The queue begins empty, and each thread has one queue element. The threads run a tight loop, where they enqueue and then immediately dequeue. This test fails to find the problem because the queue is never empty.

    The Fix

The libtest_tests_queue_unbounded_manyproducer_manyconsumer_enqueuing_and_dequeuing test has been improved to provoke the bug. The code for handling backoff has been fixed, and it is seen that the bug no longer manifests.

Release 7.1.1 will be published as soon as humanly possible.