Monthly Archives: May 2011

Back at last, plus summary of the backoff problem

Long delay again folks – work!

The delay has given my brain time to digest backoff some more, though.

I think I can summarize now, at least for x86/x64;

The DCAS/CAS operations themselves are serialised by locking done by the CPU. There are no destructive collisions between DCAS/CAS themselves. There exists however a sort of destructive *delay* between DCAS/CAS, because a DCAS/CAS operation will invalidate for all other cores the cache line holding the destination. So if all the threads trying to DCAS/CAS are on the same core, not a problem – all threads are operating from the same cache and so cache line invalidation does not occur. However, if we have say one thread per core with many cores, each time a core successfully performs a DCAS/CAS, all other cores find their cache line for the destination invalidated, which necessitates a memory read. This Is Very Slow! So that’s one issue.

The other issue, a real destructive collision, occurs at a higher level, in the data structure when an algorithm wishes to load a pointer and then successfully CAS that pointer; it needs the pointer to remain undisturbed between its load and the CAS. If other threads are trying to do the same, a thread can find that the destination pointer is repeatedly being modified by other threads, causing its own DCAS/CAS to fail – even though the CPU is ensuring DCAS/CAS operations themselves are serialised. So that’s the other issue.

So, there are two delay periods which would seem to interest us;

1. if our cache line is invalid, the time taken to load from memory and then DCAS/CAS
2. if our cache line is valid, the time taken to load from cache and then DCAS/CAS

E.g. as the fundamental unit of backoff delay – but these two times different probably by about two orders of magnitude; something like 10 vs 300. However, the problem is, when performing backoff, you have no idea if your cache line is valid or not!

But this also isn’t very relevant in a way – in a busy system, in any given ongoing backoff situation involving n threads, new threads will be joining all the time – but they will be starting the backoff cycle, whereas the other threads will be further into that cycle.

Ticket locks for ordered backoff serialisation

A poster at StackOverflow, supercat, suggested the use of atomic counters for ordering CAS retries.

Basically, when a data structure tries a CAS it needs to succeed and that CAS fails, that thread will do an atomic increment on a counter, to obtain its ticket; and then read another counter, which is the current ticket. When the current ticket becomes equal to the threads ticket, the thread performs its CAS and the increments the current ticket.

This is a well-known locking mechanism, called a ticket lock. (I created that wiki page – back in 2005! – when I figured ticket locks out for myself, and then found (of course) someone else had figured them out first, back in 1991).

There is a difference, which is that with classic ticket locks, a thread spins on the current ticket counter, whereas supercat (as with exponential backoff) is instead proposing the thread estimates a wait period and performs NOPs for that period and then checks the current ticket.

Two things come to mind.

First, we must quantify the overhead. There is an atomic increment (therefore a write) on the ticket counter, plus one or more reads of the current ticket counter. The write will invalidate all other cache line entries (and thus require a read as well as a write). The reads will almost certainly require a memory read, since another thread will have written to that value in the meantime.

Second, how do we know how long to wait for?

This is of course the same problem faced by exponential backoff. A thread knows it own operation, but not that of other threads; and an operation may take different periods of time, depending on the amount of helping it must do. Also, operations take very different lengths of time whether or not they’re on the same core.

We could have an array, with one element per logical core, since a logical core can run only one thread at a time and each thread is busy-waiting. The element a thread owns is its ticket number modulus the number of elements in the array. A thread takes posession of its element with a CAS (to handling timing issues, where a thread may have obtained its ticket but not yet written to its element, while another thread has its ticket and is already scanning the array).

We know in the array the element for the active thread, because it is the element of the current ticket. Any new thread scans the array from that element onwards to its own element. Each element indicates the operation to be performed by the thread holding that ticket and also which core that thread is on. A thread can therefore estimate how long it should wait for (barring the fact that an operation take different periods of time depending on the amount of helping it does – we will have to assume worst case).

As mentioned, each element can indicate the core the thread is on. This assuming pinning. If not pinned, then we simply assume the thread continues on the core it is currently on – something which usually will be true, since the OS avoids moving threads.

Importantly, since we can see the physical core the *previous* thread was on, we can see if cache invalidation will occur. So if we have a sequence 1/4/4/2/3, although we don’t know if the first operation will produce cache invalidation (perhaps we simply have to assume worst case), we know the second will, the third will not, and the fourth and fifth will.

This leads to the thought of re-ordering. While scanning the array, adding up the time to wait, if a thread finds another operation on the same core, it swaps location in the array (assuming this can be made to work – the other thread will be surprised when it finds its not longer where it thought it was).

Ah, unfortunately, this will screw up the existing timing estimates made by other threads! but this raise the idea of re-ordering, regardless of use in a backoff algorithm (although I’ve just realised the obvious; if you’re re-ordering, you must be imposing an order, which means backoff… and in fact takes me back to a thread executing order idea I had years ago, where you have a queue of thread IDs, a new thread is pushed into the queue, when the currently executing thread finishes it pops the queue and alerts that thread that it can now run).

When a thread breaks out of its busy wait to check the current ticket and finds it is not yet ready to go, we recompute the delay and go back to busy waiting.

So this all sounds nice, but does it take so long and have so much overhead that the cure is worst than the problem?

Consider the benefit and cost of re-ordering. If we successfully re-order, and move an operation next to an operation on the same core, we save one RFO. But the swapping of elements in the array will cause an RFO when threads on other cores come to scan the array, since their cache lines will have been invalidated. But it might be if there are many logical cores in a physical core that we save many RFOs for the cost of one or two RFOs from the swapping.

Most of the reads and writes are contentionless; although we do have to perform a CAS when accessing each array element, to check that it is in a valid state (another thread may have obtained a ticket but not yet written to its element, while we, although obtaining our ticket later, are already scanning the array; or we ourselves may be slow accessing the array, and the ticket has already incremented, perhaps even several times. What do we do if while scanning we come across an element which will be populated but isn’t yet? we can’t know how long that operation will take).

That’s enough for now – I have to go buy a kayak :-)

x86/x64 CAS and backoff

Currently, there are on my mind two issues which are handled by backoff;

1. RFO
2. destructive “collisions” between threads concurrently performing CAS

RFO is caused by CAS on cores which do not share cache. Consider we have two cores with entirely separate caches. When one core completes a CAS, it will have written the new value back to its cache and back to memory. The cache line in the other core will be invalidated; when that core then attempts to CAS, it will need to read over the bus from the other cores cache (it could read from memory, but it’s quicker to read from another cache – it’s still very slow).

Imagine instead we have two cores which share the same level one cache. One core completes a CAS and writes the new value back into the cache and to memory. When the other core comes to CAS, the cache line is still valid! so no read required.

Destructive “collisions” occur at the data structure level. Remember that generally when a data structure performs a CAS, it is looking for that CAS to succeed – it will have read in a destination value and then will CAS on the same destination, hoping that the destination is unchanged.

So imagine we have one physical core with, say, eight logical cores. Each core reads in the destination and then moves to the CAS. x86/x64 CAS busy-waits on a cache-line lock, so one core will have obtained that lock while the others all busy-wait. The core which won the lock will successfully CAS; this will of course change the value of destination – and that means all the other core’s CAS will fail, because the have the old value for destination. Pretty horrific! we see clearly here that the basic unit of operation at the data structure level is the read prior to the CAS and the CAS itself. The destination needs to be stable over the whole of the time required for these two operations, or the load-and-CAS attempt is wasted.

Returning to the RTO problem, the question is if and how backoff can be useful to ameilorate this problem. Consider one of the extreme scenarios; say we have two threads, one per physical core (so no shared caches) and both threads are simply doing constant read-and-CAS, where the cache-line lock alternates between the two cores. Here we have an RTO per CAS. How does backoff help? let’s imagine when we collide, we backoff for ten units of read-and-CAS. Well what we see here is that greatly reduces the number of RTOs. Instead of each thread doing one unit of work, with an RTO inbetween each, each thread does ten units of work, with an RTO inbetween. Much more efficient.

But this is only true when the threads are both performing 100% read-and-CAS. What if both are only occasionally doing read-and-CAS? in that situation, the wait gives us no efficiency gains, because no read-and-CAS operations occur during the wait period.

Here then the exponential part of exponential backoff becomes useful; the starting delay is very short, but it becomes long fairly quickly, so the threads find an appropriate wait period.

In this case, the basic unit of delay is the time taken for a read from memory (since the cache line is invalid), followed by a CAS.

Very closely related to this is the scenario where the thread are on cores which share a cache. Here backoff is still required, but the basic unit of delay is much shorter – the time taken for a read from cache, followed by a CAS.

Of course, if there are mixed threads, some sharing a cache, some not, you’re stuffed. You can’t know which are which.

With regard to destructive “collisions” at the data structure level, we immediately see that the basic unit of delay has to be the time taken for the operation in question, not just for a single read-and-CAS. It will be in some cases the operation consists of a single read-and-CAS, in which case the basic unit of delay is the same, but in other cases the data structure operation may have more than read-and-CAS and indeed may have a number of different paths, each taking a different time.

What’s worse is that there are likely to be a number of separate operations, each perhaps having multiple paths, operating on the same destination. If a thread is trying to perform operation A (which takes, say, 10 units of time) while another thread is trying to perform operation B (which takes, say, 5 units of time), what is the basic unit of backoff delay?

This is basically the same problem as RTO has when some threads are sharing caches and others are not; the basic delay period is incompatable between the contending threads.

One thing is clear, however; the basic unit of time must be derived from the data structures, not just the time taken for a CAS or read-and-CAS. Even if we were looking to reduce RTO by permitting a thread to perform work in blocks of ten rather than one, those individual blocks of work must be that of a data structure operation, not a single CAS or read-and-CAS.

So, let’s imagine we now measure the time taken for each operation (and that each operation has only one path). How do we arrange backoff on a single destination for operations with different basic units of time? each thread can know the length of time each operation takes and of course the operation it itself is trying to perform, but it can’t know the operations in use by other threads.

(We can ignore the problem of a given operation having multiple paths, because that’s just the same as having more operations with only a single path. Also note the question effectively covers the problem of when the read part of the read-and-CAS is sometimes from cache but sometimes from memory – it is again the same as having more single operations, but which take different periods of time).

Consider the case where we have eight cores trying to perform the same operation simultanious. All eight read the destination, one wins the cache lock and completes its CAS; the other seven all fail.

What do we want to happen? we want those seven cores to now arrange their data structure opreations such they do not collide with each other.