Usage Guide (liblfds)

From liblfds.org
Jump to navigation Jump to search

Introduction

This page describes how to use the liblfds library in terms of compiling and linking, and then covers the novel and pecuilar issues which originate in the lock-free nature of the data structures in the library.

Where-ever possible such issues have been hidden from the user, but there are some - those here - which simply cannot be hidden and the user has to know about them.

Compiling and Linking

The liblfds library compiles to either a static or shared library, to the pathname liblfds700/bin/liblfds700.[nnn], where the suffix of the library file varies by platform and by the type of library (static or dynamic).

Library Initialization and Cleanup

The function lfds700_misc_library_init_valid_on_current_logical_core must be called before any other functions are called. This function is called once, by one thread.

Other threads must call LFDS700_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE for the library init to be valid for them. Threads are permitted to call this and only this macro, even though they have not themselves called lfds700_misc_library_init_valid_on_current_logical_core.

The function lfds700_misc_cleanup must be called when finished with the library. Again, this is called once, by one thread. Other threads have no work to do, to cleanup.

General Usage Style

All allocations are performed by the user and passed into the library; as such there are no new and delete functions, but rather init and cleanup. Each data structure is represented by a state structure (i.e. lfds700_freelist_state), and usually has also an element structure (i.e. lfds700_freelist_element). Each element carries both a key and value, the key being mandatory for the btree, hash and ordered list, and optional for all other data structures (being provided for convenience, in the case of passing data between different data structures and wishing to retain the key). There are getter and setter macros for key and value, where the data structure functions themselves almost only ever deal with operations on the data structure elements themselves; the user sets or gets the keys and values from the elements, before or after a data structure function is called.

Novel and Peculiar Issues

Memory Deallocation

Any data structure element which has at any time been present in a lock-free data structure can never be passed to free until the data structure in question is no longer in use and has had its cleanup function called.

As such, typical usage is for data structure elements to be supplied from (and returned to) a lock-free freelist.

This is true for both the add-only and non-add-only data structures.

So, for example, if the user allocates a queue element and some user data and enqueues them, when that queue element is finally dequeued, it cannot be passed to free until the queue itself has been terminated and has had its cleanup function called.

Memory Allocation (and Shared Memory Support)

The liblfds library performs no memory allocation or deallocation.

The user is responsible for all allocation and all deallocation. As such, allocations can be from the heap or from the stack, or from user-mode or from the kernel; the library itself just uses what you give it, and doesn't know about and so does not differentiate between virtual or physical addresses. Allocations can be shared memory, but note the virtual memory ranges must be the same in all processes - liblfds uses addresses directly, rather than using offsets. Being able to used shared memory is particularly important for Windows, which lacks a high performance cross-process lock; the data structures in liblfds when used with shared memory meet this need.

Data Structure Initialization

Passing a data structure state to its init function initializes that state but that initialization is and is only valid for the logical core upon which it occurs.

The macro LFDS700_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE is used to make the initialization valid on other logical cores and it will make the initialization valid upon and only upon the logical core which calls the macro.

Expected use is that a thread will initialize a data structure, pass a pointer to its state to other threads, all of whom will then call LFDS700_LIBLFDS_MAKE_USABLE_TO_CURRENT_LOGICAL_CORE_INITS_PERFORMED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE.

Lock-Free Exponential Backoff

Lock-free data structure operations generally are implemented as loops which attempt an operation, check to see if it succeeded or failed, and if it failed, try again. This is because lock-free operations can collide; if two or more threads are attempting to perform the same operation concurrently (such as pushing to a stack), then the thread which succeeds will in fact disrupt the other thread's operation and cause it to fail - and so to need to repeat the operation in its loop.

This can lead to a problem known as livelock, where so many threads are trying to perform a given operation that they make no or almost no forward progress, because they keep running into each other.

A method which ameilorates this problem is exponential backoff. When a thread attempts and fails to perform an operation, it performs a very brief busy-wait loop, before retrying.

The duration of the loop is computed by randomly picking a value between 0 and a maximum value, where that maximum value starts at one and doubles every time the attempted operation fails - i.e. the maximum possible backoff (and so also the mean actual backoff, since the random value will on mean be half of the maximum) exponentially increases, where that value chosen is then multiplied by a constant, where the result is finally the actual number of iterations to perform.

There is one constant for CAS operations and another for DWCAS operations. They are intended to reflect how long the CAS and DWCAS take to operate; so in fact when we randomly select the first number, between 0 and the current maximum value, we are conceptually selecting how many CAS or DWCAS operations to wait for.

The value of these constants can be get and set, at any time, by calling lfds700_misc_query. The ability to set them is only really useful when the system liblfds will run on is fixed, as the optimal values for these constants varies by processor model, clock speed and current system load.

Randon Number Generation

A key issue here is the need to be able to generate random numbers. The standard library function rand is unsuitable in every way; it is not thread safe, it has a very poor algorithm and only has a guaranteeded maximum value of 2^15-1, the performance of the implementation is unknown and may very across implementations and it is anyway a function call (i.e. slow) - remember, we want to generate a random number in the middle of a lock-free operation loop; we want to be FAST. The idea of a function call, with its attendent cache disruption and delay, is beyond the pale.

Out of necessity then liblfds has implemented well-known fast and lightweight random number genertor (on 32-bit platforms, a 32-bit xorshift, on 64-bit platforms an xorshift64*) and uses this for timely random number generation.

Each thread using liblfds is expected to initialize one instance of struct lfds700_liblfds_prng_state and it is this which is passed as the last argument to function calls which require the ability to produce random numbers for exponential backoff.

The random number generator code is not thread safe; it is expected that there is one state per thread. This is also processor cache friendly.

This leads to the question of the seed value for the random number generator.

It is often the case that the time function is used to produce a seed. However, the liblfds library is written to compile on a bare C implementation, so time cannot be used. In fact, on a bare C implementation, there are no external sources of entropy. So what to do? because if we use the same seed, all of the thread will progress through the same sequence of random numbers - probably it will be okay, but all the same, it would be nicer to do something better than this.

What has been done is to implement a second random number generator, a xorshift1024*. This is big (1024 bits of entropy, vs 32 bits for xorshift and 64 bits for xorshift64*), slow and high quality. There is a single fixed seed, LFDS700_MISC_PRNG_SEED, which was taken from an atmospheric noise based true random number generator, which is passed through murmurhash3, to initialize the 1024 bytes of state. When one of the per-thread random number generator states is initialized, it obtains its seed from the xorshift1024*.

The Standard Library and liblfds_uint_t

The liblfds library is intended for both 32 and 64 bit platforms. As such, there is a need for an unsigned type which is 32 bits long on a 32 bit platform, and 64 bits long on a 64 bit platform - but remember that the Standard Library is not used, so we can't turn to it for a solution (and also that for C89, there wasn't really such a type anyway - size_t did in fact behave in this way on Windows and Linux, but semantically size_t means something else, and so it is only co-incidentally behaving in this way).

As such, liblfds in the platform abstraction layer typedefs liblfds_uint_t. This is set to be an unsigned integer which is the natural length for the platform, i.e. the length of the processor register.

Alignment and Padding of Atomic Variables

There are two approaches in hardware, in processors, to implementing atomic operations, and there are two issues relating to these methods.

The first approach is compare-and-swap (CAS), the second approach is load-linked/store-conditional (LL/SC).

Atomic operation involve writing a variable to memory. CAS implements atomicity by locking the cache-line containing the variable. As such, it has no failure mode - it will stall if it cannot lock the cache-line. (The compare will come back saying equal or not-equal - the data may not be the same - but the comparing operation itself cannot fail). LL/SC works by loading the variable into a register, for further work, but watching the memory location it comes from until the variable is stored again; if in the meantime another write occurred to that memory location, the store aborts. The compare was performed by the user in his normal code after the load, and assuming the user decided he did want to store, then upon attempting to do so, the processor aborts the store because it knows the variable has in the meantime changes; the compare returned equal or not-equal as may be, but the compare operation itself failed - was aborted by the processor. The granularity of the 'watching' varies a great deal. On some platforms, memory is conceptually divided up into pages (which are known as "exclusive reservation granules", and on ARM, for example, these range in size from 8 to 2048 bytes, depending on implementation) and if a write occurs to the page which contains the target varible, then the store fails. On other platforms (MIPS, for example), there is one 'watcher' per processor, so if ANY other writes occur on that processor between the load and store, the store fails. Good luck with that :-)

The first issue, which applies to both approaches, is that all writes to memory actually write out the full cache-line the varible sits in (this is true for all processors; when you write to a variable, the cache line it sits in is loaded to memory, the variable is modified (which may be only one byte, for example) and then the entire cache line is written back out to memory). This causes the cache-line to be valid for and only for the writing logical core (i.e. all other copies of this line in other cores caches are now invalid, because they're out of date), and to be invalidated on all other logical cores. Generally speaking, any varible which is the target of atomic operations will be being written to regularly or relatively regularly by many logical cores - why else would atomic operations be being used? A concern here then is collateral damage - that we may have other important varibles in that same cache line, and they are continually being (inadvertantly) invalidated by the atomic operations. A solution to this is to align and pad all variables subject to atomic operations such that they have a cache-line to themselves.

The second issue, which applies only to LL/SC, is somewhat similar; where a write to a granule will break any in-progress atomic operations within that granule, then its best to ensure that a variable subject to atomic operations is aligned and padded such that they have a granule to themselves.

These two issues then come down to the same - that variables subject to atomic operations need alignment and padding, although what that alignment and padding should be varies by platform. On CAS platforms, it should be to the cache-line length. On LL/SC platforms with granules, it should be to the larger of the cache-line length and the granule length (granule lengths are always a power of 2 and always are exactly divided by cache-line lengths). On LL/SC platforms where there is only one watcher per logical core, it should be to the cache-line length.

The kicker though is this - on platforms where the granule size varies, the only way to set it correctly is for the user to configure it; the library can't find out the size. The porting abstraction layer is configured out-of-the-box to use the worst-case value, and that worst-case value can be LARGE. On ARM, it's 2048 bytes - and a btree element structure has for example THREE atomic variables, i.e. it is about 6kb in size for one element!

So the moral of the story is this - on such platforms, be sure to set LFDS700_PAL_ATOMIC_ISOLATION_IN_BYTES correctly, or your structures will be painfully large.

The platforms which need to care about this are;

  • ARM
  • POWERPC

See Also