r6.1.0:Testing and Benchmarking Guide
Introduction
A command-line test and benchmark program is provided with liblfds. This program, which has its own small abstraction layer which must be implemented, links against liblfds and can test and/or benchmark the abstraction layer and data structures.
For porting instructions, please see the porting guide (test).
For building instructions, please see the building guide (test).
The current command line is;
test [test|benchmark] [iterations] test : run the test suite benchmark : run the benchmark suite iterations : optional, only applies to tests, default is 1
When testing, all tests will be run; when benchmarking, all benchmarks will be run.
Tests
Introduction
Testing lock-free data structures is problematic. Some of the problems the tests try to detect are straightforward, but some are caused by race conditions. The more sensitive a test is, the more able it is to notice a race condition has occurred, the more work it has to do and so the less likely the race condition is to occur.
As such, there tend to be two classes of test; those which do extra work and those which don't. Those which do will tell you exactly what went wrong; those which don't won't tell you much, but at least you know there's a problem. In a few cases, it's possible to arrange a test so that you get the best of both worlds.
The test program initially was unsophisticated and consisted only of tests which simply exercised the data structures as hard as possible and if the test completed (without crashing or hanging) then it was assumed to have passed.
There is now effort in progress to qualitatively improve the tests and reconstruct them so that they provide a pass or fail result. This work has been done for the freelist and the queue tests and the new abstraction layer tests are written in this way. It has not yet been done for the other data structures.
As such, the abstraction layer, freelist, queue and ringbuffer tests issue a pass/fail result. The other tests simply execute (and display some data about how many iterations each thread executed).
Abstraction Layer (liblfds)
- Atomic Increment
This test attempts to determine if the implementation of lfds610_abstraction_increment is valid. The test first runs one thread per CPU, where each thread loops 1,000,000 times, incrementing a single shared integer using the normal "++" operator (which is non-atomic). The test notes the value achieved (which typically will be less than (CPU count*1,000,000), due to race conditions). The test then does repeats, but the second time, using lfds610_abstraction_increment. If the non-atomic count is less than (CPU count*1,000,000) and the atomic count is equal to (CPU count*1,000,000), the tests has passed. If both counts are (CPU count*1,000,000), the test is indeterminate; it is not possible to see if the lfds610_abstraction_incremement has been correctly implemented. Finally, if the atomic count is less than (CPU count*1,000,000), the test has failed. - Atomic DCAS
This test attempts to determine if the implementation of lfds610_abstraction_dcas is valid. The test runs one thread per CPU, where each test runs for ten seconds, using lfds610_abstraction_dcas to increment a shared integer; every time a thread successfully increments the shared counter, a thread local counter is incremented. After the threads have completed, the value of the shared counter is compared to the combined value of the local counters; the test passes if the two value match.
Freelist Tests
- Popping
A source freelist with 1,000,000 elements is created, where each element holds as its user data it's number (e.g. from 0 to 999,999). One thread is created per CPU, where each thread pops are rapidly as possible from the source freelist. These threads push the popped elements onto a local freelist. After the threads have completed, the per-thread freelists are examined to ensure all the elements in the original freelist are present and present exactly once. - Pushing
One thread is created per CPU, where each thread is given a freelist populated with 100,000 elements. Each element is a small structure, which contains the thread number (0 to CPU count-1) and the element number (0 to 99,999). A single empty main freelist is created. Each thread then pushes as rapidly as possible from its local populated freelist onto the main empty freelist. When the threads are complete, the main freelist is validated, firstly by using lfds610_freelist_query and LFDS610_FREELIST_QUERY_VALIDATE and secondly, where each element is popped and it is checked that the element numbers increment on a per-thread basis and that the expected number of elements are present. - Popping and pushing
Two threads are created per CPU. Half the threads begin with a populated freelist with 100,000 elements. The other half begin with an empty freelist. A single empty main freelist is created. The threads with populated freelists loop, where they rapidly as possible fully pop their freelist, pushing each element onto the main freelist, then, when their freelist is empty, popping from the main freelist and pushing onto their local freelist. The threads with empty freelists do the same, but they begin by popping from the main freelist. (When threads pop from the main freelists, they pop 100,000 elements). The test runs for ten seconds. Once the test is over, each thread empties anything remaining in its local freelist to the main freelist and the main freelist is then validated, using lfds610_freelist_query and LFDS610_FREELIST_QUERY_VALIDATE. If the freelist is declared valid (e.g. no loops, missing or additional elements), then the test passes. - Rapid popping and pushing
A small main freelist is created. One thread is created per CPU. Each thread busy-works for ten seconds, simply popping an element from the freelist and then immediately pushing it. After the test is complete, the freelist is validated using lfds610_freelist_query and LFDS610_FREELIST_QUERY_VALIDATE. If the freelist is valid, the test passes. This test is, empirically, the most sensitive to detecting ABA (which indicates an incorrect lfds610_abstraction_dcas implementation).
Queue Tests
- Enqueuing
An empty queue with a one million element freelist is created. The test then runs one thread per CPU. Each thread busy-works, enqueuing elements, until there are no more elements. The user data in each element is composed of the thread number and a counter. When the threads have completed, first the queue is validated using lfds610_queue_query and LFDS610_QUEUE_QUERY_VALIDATE and then the queue is dequeued and the user data are checked to ensure the counters increment on a per-thread basis. - Dequeuing
An empty queue with a one million element freelist is created. We then fully enqueue, where the user data is a counter. One thread is then started per CPU. Each thread busy-works, dequeuing. The threads check that the user data value they dequeue is always greater than the value previously dequeued. - Enqueuing and dequeuing
An empty queue is created with one element per CPU. The test then runs one thread per CPU, for ten seconds. Each thread busy-works, enqueuing and then immediately dequeuing. Each queued element user data is composed of the thread number and a counter. On dequeuing, the thread examines the user data it has obtained; it keeps track of the counter on a per-thread basis and checks to see the counter is larger than the counter value it most recently saw for that thread. After the threads complete, the queue is checked, using LFDS610_QUEUE_QUERY_VALIDATE (it should be empty). - Rapid enqueuing and dequeuing
An empty queue is created with one hundred thousand. A single thread queues half those elements (with dummy data - these initially queued elements will all have their data overwtitten during the course of the test, so their data doesn't matter). The test then runs one thread per CPU, for ten seconds. Each thread busy-works, enqueuing and then immediately dequeuing. Each queued element user data is composed of the thread number and a counter. Unlike the normal enqueuing and dequeuing test, the thread does not check the user data; it simply loops, enqueuing and immediately dequeuing. After the threads complete, the queue is validated using LFDS610_QUEUE_QUERY_VALIDATE and then the queue is fully dequeued and the user-data is checked to ensure counters increment on a per-thread basis.
Ringbuffer Tests
- Reading
A ringbuffer with 1,000,000 elements is created. The ringbuffer is fully populated, where the user data is an incrementing counter. One thread is created per CPU and the threads busy-work reading elements until the ringbuffer is empty. As the threads read, the keep track of the counter they read and throw an error if they read a counter which is equal to or smaller than the previously read counter. - Writing
A ringbuffer with 100,000 elements is created. One thread is created per CPU and each thread busy-works for ten seconds, writing to the ringbuffer. The user data written is a combination of the thread number and an incrementing counter. When the threads have completed, the ringbuffer is validated using lfds610_ringbuffer_query and LFDS610_RINGBUFFER_QUERY_VALIDATE and then the user data in the ringbuffer elements is checked to ensure the counters increment on a per-thread basis. - Reading and Writing
A ringbuffer with 100,000 elements is created. One thread is created per CPU and each thread busy-works for ten seconds; the work done is to write an element to ringbuffer and then to immediately read an element from the ringbuffer. The user data written is a combination of the thread number and an incrementing counter. As elements are read, each thread keeps track of the counters it sees on a per-thread basis and throws an error if a counter is seen which is equal to or smaller than the previously read counter for that thread.
SList Tests
- New, delete, get
The test creates a single slist and two threads per core. Each thread is given the single slist. The first thread per core alternates calling new_head() and new_next(). The second thread iterates, calling get_next() (get_head() if get_next() returns NULL) and then deletes the element it has obtained. The first thread keeps count of the number of elements it creates, the second thread keeps track of the number of elements it deletes. Threads run for ten seconds. Validation occurs by subtracting the total number of deleted elements and the number of elements remaining in the list from the total number of created elements; the result should be 0. - Get/set user data
This test creates a list with number of elements equal to the CPU core count times ten. There is then one thread per core, each thread runs for ten seconds, iterating over the list, setting the user data to its thread number (shift to the top byte of the user data) binary ORed with an incrementing counter. Validation is scanning the list afterwards; on a per-thread basis the counter should increment, with a single drop being acceptable (this occurs where a thread was halfway through the list when the test ended). - Delete all
Single-threaded test. This test creates a 100,000 element list and then simply calls delete_all_elements().
Stack Tests
- Popping
A source stack with 1,000,000 pushed elements is created, where each element holds as its user data it's number (e.g. from 0 to 999,999). One thread is created per CPU, where each thread pops are rapidly as possible from the source freelist. These threads push the popped elements onto a local stack. After the threads have completed, the per-thread stacks are examined to ensure all the elements in the original freelist are present and present exactly once. - Pushing
One thread is created per CPU, where each thread is given a stack populated with 100,000 elements. Each element is a small structure, which contains the thread number (0 to CPU count-1) and the element number (0 to 99,999). A single empty main stack is created. Each thread then pushes as rapidly as possible from its local populated stack onto the main empty stack. When the threads are complete, the main freelist is validated, firstly by using lfds610_stack_query and LFDS610_STACK_QUERY_VALIDATE and secondly, where each element is popped and it is checked that the element numbers decrement on a per-thread basis and that the expected number of elements are present. - Popping and pushing
Two threads are created per CPU. Half the threads begin with a populated stack with 100,000 elements. The other half begin with an empty stack. A single empty main stack is created. The threads with populated stacks loop, where they rapidly as possible fully pop their stack, pushing each element onto the main stack, then, when their stack is empty, popping from the main stack and pushing onto their local stack. The threads with empty stacks do the same, but they begin by popping from the main stack. (When threads pop from the main stacks, they pop 100,000 elements). The test runs for ten seconds. Once the test is over, each thread empties anything remaining in its local stack to the main freelist and the main stack is then validated, using lfds610_stack_query and LFDS610_STACK_QUERY_VALIDATE. If the stack is declared valid (e.g. no loops, missing or additional elements), then the test passes. - Rapid popping and pushing
A small main stack is created. One thread is created per CPU. Each thread busy-works for ten seconds, simply popping an element from the stack and then immediately pushing it. After the test is complete, the stack is validated using lfds610_stack_query and LFDS610_STACK_QUERY_VALIDATE. If the stack is valid, the test passes. This test is, empirically, the most sensitive to detecting ABA (which indicates an incorrect lfds610_abstraction_dcas implementation).
Benchmarks
Introduction
Benchmarks are executed many times, starting with one thread, where the number of threads increases by one each time, until the number of threads equals the number of CPUs. This is to test scalability.
Output is CSV format to stdout. The first line is the test ID, the second line are the column headers, the third and successive lines are data, e.g.;
Release 6.1.0 Freelist Benchmark #1 CPUs,total ops,mean ops/sec per CPU,standard deviation,scalability 1,216393136,21639314,0,1.00 2,181827358,9091368,40879,0.42 3,60543702,2018123,463660,0.09 4,39903510,997588,314877,0.05
The columns so far are identical for all tests.
Freelist
- Create a small freelist; each thread receives the freelist and then busy-works for ten seconds, where the thread simply loops, performing a pop and then a push. An operation is a pop or a push, so each iteration of the loop in a thread is two operations, since the test pops and then pushes.
Queue
- Create a small queue; each thread receives the queue and then busy-works for ten seconds, where the thread simply loops, performing a dequeue and then an enqueue. An operation is an enqueue or a dequeue, so each iteration of the loop in a thread is two operations, since the test dequeues and then enqueues.
Ringbuffer
- Create a small ringbuffer; each thread receives the ringbuffer and then busy-works for ten seconds, where the thread simply loops, performing a write and then a read. A read and a write both require a get and then a put, but a get and a put are never seperate, so each iteration of the loop in a thread is two operations, one for the indivisible pair of read operations and one for the indivisible pair of write operations.
Stack
- Create a small stack; each thread receives the stack and then busy-works for ten seconds, where the thread simply loops, performing a push and then a pop. An operation is a push or a pop, so each iteration of the loop in a thread is two operations, since the test pushes and then pops.
See the the benchmark database for existing benchmark data.