Bug Report : 7.1.0 unbounded, many producer, many consumer queue

    Bug Report

    Introduction

There is a serious bug in the 7.1.0 unbounded, many producer, many consumer queue and users absolutely should upgrade to 7.1.1 as soon as possible, which itself will be released as soon as possible (the next day or two).

The bug is *not* in the lock-free code, but rather in the conventional code around it which handles backoff. This code was “improved” in 7.1.0 to make it more graceful and easier to read. This introduced the bug. All other versions of liblfds are unaffected. All other data structures are unaffected.

The bug can in principle occur under any workload but the heavier the workload, the more likely it becomes to manifest. In practise, to be sure to occur, a maximum-load use case is required, i.e. threads which are busy-looping on the queue, and the queue has enough load to be continually shifting elements in and out.

    The Bug

The queue contains a dummy element. Even when it is empty, there is one element in the queue; this element contains no valid data. If the queue has say ten elements enqueued to it, there are *eleven* elements in the queue. When you dequeue, the key and value placed in the first enqueued value are placed in the dummy element, the dummy element is removed from the queue and given to the user, and the element after that becomes the new dummy element.

The bug is that if the queue was empty a the time of a dequeue attempt, or became empty during a dequeue attempt, the dummy element would have written into its key and value either NULLs, or the key and value of an element which had been in the queue (but was dequeued before this dequeue attempt completed).

Normally this is (although completely broken and wrong) harmless, because the dummy element carries no valid data.

However, a race condition can occur, whereby *just* before the incorrect key and value are written into the dummy element, the dummy element is *REUSED*, i.e. it has a REAL key and value put into it and it enqueued into a queue.

It then has its key and value overwritten – and this of course is deadly.

    Test Suite Failure

The relevant tests are in the test suite are;

1. libtest_tests_queue_unbounded_manyproducer_manyconsumer_dequeuing

This test has a pre-populated queue, where one thread per logical core concurrently dequeues. The elements when dequeued are just returned to a freelist, not reused. Each thread checks the value obtained from the queue, but only dequeues once on empty (getting a return of 0 from the dequeue function causes the thread to stop working) so it’s almost impossible with that tiny window to actually induce the bug, and even if it was induced, it would do no harm, as the queue elements are not being reused.

2.
libtest_tests_queue_unbounded_manyproducer_manyconsumer_enqueuing_and_dequeuing

This test has one queue and one thread per logical core. The queue begins empty, and each thread has one queue element. The threads run a tight loop, where they enqueue and then immediately dequeue. This test fails to find the problem because the queue is never empty.

    The Fix

The libtest_tests_queue_unbounded_manyproducer_manyconsumer_enqueuing_and_dequeuing test has been improved to provoke the bug. The code for handling backoff has been fixed, and it is seen that the bug no longer manifests.

Release 7.1.1 will be published as soon as humanly possible.