Currently, there are on my mind two issues which are handled by backoff;
1. RFO
2. destructive “collisions” between threads concurrently performing CAS
RFO is caused by CAS on cores which do not share cache. Consider we have two cores with entirely separate caches. When one core completes a CAS, it will have written the new value back to its cache and back to memory. The cache line in the other core will be invalidated; when that core then attempts to CAS, it will need to read over the bus from the other cores cache (it could read from memory, but it’s quicker to read from another cache – it’s still very slow).
Imagine instead we have two cores which share the same level one cache. One core completes a CAS and writes the new value back into the cache and to memory. When the other core comes to CAS, the cache line is still valid! so no read required.
Destructive “collisions” occur at the data structure level. Remember that generally when a data structure performs a CAS, it is looking for that CAS to succeed – it will have read in a destination value and then will CAS on the same destination, hoping that the destination is unchanged.
So imagine we have one physical core with, say, eight logical cores. Each core reads in the destination and then moves to the CAS. x86/x64 CAS busy-waits on a cache-line lock, so one core will have obtained that lock while the others all busy-wait. The core which won the lock will successfully CAS; this will of course change the value of destination – and that means all the other core’s CAS will fail, because the have the old value for destination. Pretty horrific! we see clearly here that the basic unit of operation at the data structure level is the read prior to the CAS and the CAS itself. The destination needs to be stable over the whole of the time required for these two operations, or the load-and-CAS attempt is wasted.
Returning to the RTO problem, the question is if and how backoff can be useful to ameilorate this problem. Consider one of the extreme scenarios; say we have two threads, one per physical core (so no shared caches) and both threads are simply doing constant read-and-CAS, where the cache-line lock alternates between the two cores. Here we have an RTO per CAS. How does backoff help? let’s imagine when we collide, we backoff for ten units of read-and-CAS. Well what we see here is that greatly reduces the number of RTOs. Instead of each thread doing one unit of work, with an RTO inbetween each, each thread does ten units of work, with an RTO inbetween. Much more efficient.
But this is only true when the threads are both performing 100% read-and-CAS. What if both are only occasionally doing read-and-CAS? in that situation, the wait gives us no efficiency gains, because no read-and-CAS operations occur during the wait period.
Here then the exponential part of exponential backoff becomes useful; the starting delay is very short, but it becomes long fairly quickly, so the threads find an appropriate wait period.
In this case, the basic unit of delay is the time taken for a read from memory (since the cache line is invalid), followed by a CAS.
Very closely related to this is the scenario where the thread are on cores which share a cache. Here backoff is still required, but the basic unit of delay is much shorter – the time taken for a read from cache, followed by a CAS.
Of course, if there are mixed threads, some sharing a cache, some not, you’re stuffed. You can’t know which are which.
With regard to destructive “collisions” at the data structure level, we immediately see that the basic unit of delay has to be the time taken for the operation in question, not just for a single read-and-CAS. It will be in some cases the operation consists of a single read-and-CAS, in which case the basic unit of delay is the same, but in other cases the data structure operation may have more than read-and-CAS and indeed may have a number of different paths, each taking a different time.
What’s worse is that there are likely to be a number of separate operations, each perhaps having multiple paths, operating on the same destination. If a thread is trying to perform operation A (which takes, say, 10 units of time) while another thread is trying to perform operation B (which takes, say, 5 units of time), what is the basic unit of backoff delay?
This is basically the same problem as RTO has when some threads are sharing caches and others are not; the basic delay period is incompatable between the contending threads.
One thing is clear, however; the basic unit of time must be derived from the data structures, not just the time taken for a CAS or read-and-CAS. Even if we were looking to reduce RTO by permitting a thread to perform work in blocks of ten rather than one, those individual blocks of work must be that of a data structure operation, not a single CAS or read-and-CAS.
So, let’s imagine we now measure the time taken for each operation (and that each operation has only one path). How do we arrange backoff on a single destination for operations with different basic units of time? each thread can know the length of time each operation takes and of course the operation it itself is trying to perform, but it can’t know the operations in use by other threads.
(We can ignore the problem of a given operation having multiple paths, because that’s just the same as having more operations with only a single path. Also note the question effectively covers the problem of when the read part of the read-and-CAS is sometimes from cache but sometimes from memory – it is again the same as having more single operations, but which take different periods of time).
Consider the case where we have eight cores trying to perform the same operation simultanious. All eight read the destination, one wins the cache lock and completes its CAS; the other seven all fail.
What do we want to happen? we want those seven cores to now arrange their data structure opreations such they do not collide with each other.
Recent Comments