r6.0.1:Testing and Benchmarking Guide

From liblfds.org
Jump to: navigation, search

Introduction

A command-line test and benchmark program is provided with liblfds. This program, which has its own small abstraction layer which must be implemented, links against liblfds and can test and/or benchmark the abstraction layer and data structures.

For porting instructions, please see the porting guide.

For building instructions, please see the building guide.

The current command line is;

test [test|benchmark] [iterations]
  test       : run the test suite
  benchmark  : run the benchmark suite
  iterations : optional, only applies to tests, default is 1

When testing, all tests will be run; when benchmarking, all benchmarks will be run.

Tests

Introduction

Testing lock-free data structures is problematic. Some of the problems the tests try to detect are straightforward, but some are caused by race conditions. The more sensitive a test is, the more able it is to notice a race condition has occurred, the more work it has to do and so the less likely the race condition is to occur.

As such, there tend to be two classes of test; those which do extra work and those which don't. Those which do will tell you exactly what went wrong; those which don't won't tell you much, but at least you know there's a problem. In a few cases, it's possible to arrange a test so that you get the best of both worlds.

The test program initially was unsophisticated and consisted only of tests which simply exercised the data structures as hard as possible and if the test completed (without crashing or hanging) then it was assumed to have passed.

There is now effort in progress to qualitatively improve the tests and reconstruct them so that they provide a pass or fail result. This work has been done for the freelist and the queue tests and the new abstraction layer tests are written in this way. It has not yet been done for the other data structures.

As such, the abstraction layer, freelist, queue and ringbuffer tests issue a pass/fail result. The other tests simply execute (and display some data about how many iterations each thread executed).

Abstraction Layer (liblfds)

  1. Atomic Increment
    This test attempts to determine if the implementation of lfds601_abstraction_increment is valid. The test first runs one thread per CPU, where each thread loops 1,000,000 times, incrementing a single shared integer using the normal "++" operator (which is non-atomic). The test notes the value achieved (which typically will be less than (CPU count*1,000,000), due to race conditions). The test then does repeats, but the second time, using lfds601_abstraction_increment. If the non-atomic count is less than (CPU count*1,000,000) and the atomic count is equal to (CPU count*1,000,000), the tests has passed. If both counts are (CPU count*1,000,000), the test is indeterminate; it is not possible to see if the lfds601_abstraction_incremement has been correctly implemented. Finally, if the atomic count is less than (CPU count*1,000,000), the test has failed.
  2. Atomic DCAS
    This test attempts to determine if the implementation of lfds601_abstraction_dcas is valid. The test runs one thread per CPU, where each test runs for ten seconds, using lfds601_abstraction_dcas to increment a shared integer; every time a thread successfully increments the shared counter, a thread local counter is incremented. After the threads have completed, the value of the shared counter is compared to the combined value of the local counters; the test passes if the two value match.

Freelist Tests

  1. Popping
    A source freelist with 1,000,000 elements is created, where each element holds as its user data it's number (e.g. from 0 to 999,999). One thread is created per CPU, where each thread pops are rapidly as possible from the source freelist. These threads push the popped elements onto a local freelist. After the threads have completed, the per-thread freelists are examined to ensure all the elements in the original freelist are present and present exactly once.
  2. Pushing
    One thread is created per CPU, where each thread is given a freelist populated with (1,000,000/CPU count) elements. Each element is a small structure, which contains the thread number (0 to CPU count-1) and the element number (0 to 999,999). A single empty main freelist is created. Each thread then pushes as rapidly as possible from its local populated freelist onto the main empty freelist. When the threads are complete, the main freelist is validated, firstly by using lfds601_freelist_query and LFDS601_FREELIST_QUERY_VALIDATE and secondly, where each element is popped and it is checked that the element numbers increment on a per-thread basis and that the expected number of elements are present.
  3. Popping and pushing
    Two threads are created per CPU. Half the threads begin with a populated freelist with 100,000 elements. The other half begin with an empty freelist. A single empty main freelist is created. The threads with populated freelists loop, where they rapidly as possible fully pop their freelist, pushing each element onto the main freelist, then, when their freelist is empty, popping from the main freelist and pushing onto their local freelist. The threads with empty freelists do the same, but they begin by popping from the main freelist. (When threads pop from the main freelists, they pop 100,000 elements). The test runs for ten seconds. Once the test is over, each thread empties anything remaining in its local freelist to the main freelist and the main freelist is then validated, using lfds601_freelist_query and LFDS601_FREELIST_QUERY_VALIDATE. If the freelist is declared valid (e.g. no loops, missing or additional elements), then the test passes.
  4. Rapid popping and pushing
    A small main freelist is created. One thread is created per CPU. Each thread busy-works for ten seconds, simply popping an element from the freelist and then immediately pushing it. After the test is complete, the freelist is validated using lfds601_freelist_query and LFDS601_FREELIST_QUERY_VALIDATE. If the freelist is valid, the test passes. This test is, empirically, the most sensitive to detecting ABA (which indicates an incorrect lfds601_abstraction_dcas implementation).

Queue Tests

  1. Enqueuing
    An empty queue with a one million element freelist is created. The test then runs one thread per CPU. Each thread busy-works, enqueuing elements, until there are no more elements. The user data in each element is composed of the thread number and a counter. When the threads have completed, first the queue is validated using lfds601_queue_query and LFDS601_QUEUE_QUERY_VALIDATE and then the queue is dequeued and the user data are checked to ensure the counters increment on a per-thread basis.
  2. Dequeuing
    An empty queue with a one million element freelist is created. We then fully enqueue, where the user data is a counter. One thread is then started per CPU. Each thread busy-works, dequeuing. The threads check that the user data value they dequeue is always greater than the value previously dequeued.
  3. Enqueuing and dequeuing
    An empty queue is created with one element per CPU. The test then runs one thread per CPU, for ten seconds. Each thread busy-works, enqueuing and then immediately dequeuing. Each queued element user data is composed of the thread number and a counter. On dequeuing, the thread examines the user data it has obtained; it keeps track of the counter on a per-thread basis and checks to see the counter is larger than the counter value it most recently saw for that thread. After the threads complete, the queue is checked, using LFDS601_QUEUE_QUERY_VALIDATE (it should be empty).
  4. Rapid enqueuing and dequeuing
    An empty queue is created with one hundred thousand. A single thread queues half those elements (with dummy data - these initially queued elements will all have their data overwtitten during the course of the test, so their data doesn't matter). The test then runs one thread per CPU, for ten seconds. Each thread busy-works, enqueuing and then immediately dequeuing. Each queued element user data is composed of the thread number and a counter. Unlike the normal enqueuing and dequeuing test, the thread does not check the user data; it simply loops, enqueuing and immediately dequeuing. After the threads complete, the queue is validated using LFDS601_QUEUE_QUERY_VALIDATE and then the queue is fully dequeued and the user-data is checked to ensure counters increment on a per-thread basis.

Ringbuffer Tests

  1. Reading
    A ringbuffer with 1,000,000 elements is created. The ringbuffer is fully populated, where the user data is an incrementing counter. One thread is created per CPU and the threads busy-work reading elements until the ringbuffer is empty. As the threads read, the keep track of the counter they read and throw an error if they read a counter which is equal to or smaller than the previously read counter.
  2. Writing
    A ringbuffer with 100,000 elements is created. One thread is created per CPU and each thread busy-works for ten seconds, writing to the ringbuffer. The user data written is a combination of the thread number and an incrementing counter. When the threads have completed, the ringbuffer is validated using lfds601_ringbuffer_query and LFDS601_RINGBUFFER_QUERY_VALIDATE and then the user data in the ringbuffer elements is checked to ensure the counters increment on a per-thread basis.
  3. Reading and Writing
    A ringbuffer with 100,000 elements is created. One thread is created per CPU and each thread busy-works for ten seconds; the work done is to write an element to ringbuffer and then to immediately read an element from the ringbuffer. The user data written is a combination of the thread number and an incrementing counter. As elements are read, each thread keeps track of the counters it sees on a per-thread basis and throws an error if a counter is seen which is equal to or smaller than the previously read counter for that thread.

SList Tests

Too many to mention and since they're all going to be rewritten soon, documentation is deferred.

Stack Tests

  1. one reader thread per CPU
    • this always returns zero reads, since the stack is empty
  2. one writer thread per CPU
    • this test, since there is no reader, exhausts the stack's freelist and so the count of writes is equal to the freelist size
  3. one reader and one writer thread per CPU
  4. one reader and two writer threads per CPU
  5. two reader and one writer thread per CPU

Benchmarks

Introduction

Benchmarks are executed many times, starting with one thread, where the number of threads increases by one each time, until the number of threads equals the number of CPUs. This is to test scalability.

Output is CSV format to stdout. The first line is the test ID, the second line are the column headers, the third and successive lines are data, e.g.;

Release 6.0.1 Freelist Benchmark #1
CPUs,total ops,mean ops/sec per CPU,standard deviation,scalability
1,216393136,21639314,0,1.00
2,181827358,9091368,40879,0.42
3,60543702,2018123,463660,0.09
4,39903510,997588,314877,0.05

The columns so far are identical for all tests.

Freelist

  1. Create a small freelist; each thread receives the freelist and then busy-works for ten seconds, where the thread simply loops, performing a pop and then a push. An operation is a pop or a push, so each iteration of the loop in a thread is two operations, since the test pops and then pushes.

Queue

  1. Create a small queue; each thread receives the queue and then busy-works for ten seconds, where the thread simply loops, performing a dequeue and then an enqueue. An operation is an enqueue or a dequeue, so each iteration of the loop in a thread is two operations, since the test dequeues and then enqueues.

Ringbuffer

  1. Create a small ringbuffer; each thread receives the ringbuffer and then busy-works for ten seconds, where the thread simply loops, performing a write and then a read. A read and a write both require a get and then a put, but a get and a put are never seperate, so each iteration of the loop in a thread is two operations, one for the indivisible pair of read operations and one for the indivisible pair of write operations.

Stack

  1. Create a small stack; each thread receives the stack and then busy-works for ten seconds, where the thread simply loops, performing a push and then a pop. An operation is a push or a pop, so each iteration of the loop in a thread is two operations, since the test pushes and then pops.

See the the benchmark database for existing benchmark data.