I’ve been trying to learn how CAS works, on x86/x64, at the CPU level. What’s going on in the core? what happens if cores try to CAS the same variable at the same time?
I need to know this for exponential backoff; I’m trying to figure out what the theoretically optimal backoff period is.
My original – entirely incorrect, and probably derived from back-of-the-mind assumptions about ethernet – assumption was that concurrent CASs destructively collide and that this was the source of the rapid performance drop off as the number of threads increases.
Having read the Intel Architecture Manual (and been misled by it – I didn’t read slowly enough) and asked around on Usenet and StackOverflow, what I now understand is that there is a special “cache line lock” mechanism. When a core is CASing, it will first lock the cache line. If another core is CASing and has locked that CAS line already, the core will stall (and it will, because CAS is an implicit full memory barrier, so all earlier in-flight instructions will have completed and future instructions cannot yet be executed).
The performance loss from increasing number of threads comes from increased RFO traffic. Consider one thread per core. At any given time, one core holds the cache line lock and is CASing. When it completes, it will have updated memory. Every other cores cache line for the destination is now invalid. An RFO is required!
Two threads on the same core do much, much better, because they share L1 cache and the cache line remains in the exclusive state. No RFOs have to occur.
This however only partially explains the freelist benchmark results.
Remember that a CAS is of a certain duration, so in any given period of time, a particular number of CAS ‘slots’ are available.
Release 7 Lock-Free Freelist Benchmark #1
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 310134488,31013449,0,1.00
0 1 0 1 136313300,6815665,38365,0.22
0 1 0 1 136401284,6820064,50706,0.22
1 1 1 1 111134328,2778358,23851,0.09
0 0 1 1 334747444,16737372,2421,0.54
1 1 1 1 111105898,2777647,40399,0.09
With one thread only, we see maximum performance. With two threads on one core, we see maximum performance – all the CAS slots are used, equally divided between the two threads; no RFOs are required since everything remains valid in the L1 cache.
The problem is the case of four threads, two on each core. Here, affter a CAS completes, there are three threads trying to CAS who will compete for the CAS lock. One on the same core as the thread which just completed the CAS, two on another core. So 33% of the time, no RFO is required. 66% of the time an RFO is required. We should therefore be FASTER than one thread per core. But we are in fact SLOWER.