Monthly Archives: April 2011

Purpose and nature of backoff

Apart from the significant outstanding question regarding CAS on x86/x64, the purpose and effect of backoff is now apparent; it is to reduce the number of RFOs.

This is in fact very different to the purpose and effect of backoff on ethernet. Ethernet backoff is to prevent destructive operations from occurring simultanously.

Also, I think there is a second purpose; for higher level, data structure operations, we need to ensure that the data structure is stable for the duration of the operation – this is destructive interference, as we do want to stop it. But I begin to think perhaps there should be a separate backoff in the data structure itself, for this.

Finally, all of this so far is true only for x86/x64. ARM et al have LL/SC, which behaves differently – it does use MESI, and so things are different – I need to think about it.

I have to think about what all this means for how we should go about designing backoff for CAS.

Low-level CAS implementation details

I’ve been trying to learn how CAS works, on x86/x64, at the CPU level. What’s going on in the core? what happens if cores try to CAS the same variable at the same time?

I need to know this for exponential backoff; I’m trying to figure out what the theoretically optimal backoff period is.

My original – entirely incorrect, and probably derived from back-of-the-mind assumptions about ethernet – assumption was that concurrent CASs destructively collide and that this was the source of the rapid performance drop off as the number of threads increases.

Having read the Intel Architecture Manual (and been misled by it – I didn’t read slowly enough) and asked around on Usenet and StackOverflow, what I now understand is that there is a special “cache line lock” mechanism. When a core is CASing, it will first lock the cache line. If another core is CASing and has locked that CAS line already, the core will stall (and it will, because CAS is an implicit full memory barrier, so all earlier in-flight instructions will have completed and future instructions cannot yet be executed).

The performance loss from increasing number of threads comes from increased RFO traffic. Consider one thread per core. At any given time, one core holds the cache line lock and is CASing. When it completes, it will have updated memory. Every other cores cache line for the destination is now invalid. An RFO is required!

Two threads on the same core do much, much better, because they share L1 cache and the cache line remains in the exclusive state. No RFOs have to occur.

This however only partially explains the freelist benchmark results.

Remember that a CAS is of a certain duration, so in any given period of time, a particular number of CAS ‘slots’ are available.

Release 7 Lock-Free Freelist Benchmark #1

   M
   N
   S
  L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 310134488,31013449,0,1.00
0 1 0 1 136313300,6815665,38365,0.22

0 1 0 1 136401284,6820064,50706,0.22
1 1 1 1 111134328,2778358,23851,0.09

0 0 1 1 334747444,16737372,2421,0.54
1 1 1 1 111105898,2777647,40399,0.09

With one thread only, we see maximum performance. With two threads on one core, we see maximum performance – all the CAS slots are used, equally divided between the two threads; no RFOs are required since everything remains valid in the L1 cache.

The problem is the case of four threads, two on each core. Here, affter a CAS completes, there are three threads trying to CAS who will compete for the CAS lock. One on the same core as the thread which just completed the CAS, two on another core. So 33% of the time, no RFO is required. 66% of the time an RFO is required. We should therefore be FASTER than one thread per core. But we are in fact SLOWER.

What gives?

Critical section freelist benchmark

A poster on StackOverflow suggested a Critical Section benchmark, so I’ve made one for the freelist.

Release 7 Lock-Free Freelist Benchmark #1

   M
   N
   S
  L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
  1     303198412,30319841,0,1.00
  1   1 224521918,11226096,70385,0.37

  1   1 224046912,11202346,84033,0.37
1 1 1 1 194478866,4861972,160979,0.16

1 1     247318302,12365915,118548,0.41
1 1 1 1 194720526,4868013,123328,0.16
Release 7 Critical Section Freelist Benchmark #1

   M
   N
   S
  L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
  1     290107092,29010709,0,1.00
  1   1 174221358,8711068,6101440,0.30

  1   1 138109586,6905479,750688,0.24
1 1 1 1 181751090,4543777,323308,0.16

1 1     222556400,11127820,77442,0.38
1 1 1 1 181005520,4525138,1745089,0.16

So, with four threads, they run at the same speed. With less than four, lock-free wins. Wish I had a 64 core system right about now :-)

Critical sections are a Windows mechanism, I think basically a spin-lock; they can only be used by threads in the same process.

Theoretical best performance

One thread produces with the freelist benchmark about 330 million operations.

If we extend backoff to a ridiculously high value, then what we have in effect is of the n threads, one thread running at a time. So if we have four threads, with one running at any one time (but rapidly switching between them), then we expect each thread to have one quarter the performance of the single thread, e.g. scalability will be 0.25.

A quick test, with CAS/DCAS delays set to 10,000 (rather than 3 and 5, respectively), gives the following…

Release 7 Lock-Free Freelist Benchmark #1

   M
   N
   S
  L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
  1     293046788,29304679,0,1.00
  1   1 304562440,15228122,1071082,0.52

  1   1 304929044,15246452,366114,0.52
1 1 1 1 240757694,6018942,925353,0.21

1 1     245785988,12289299,133006,0.42
1 1 1 1 240915528,6022888,687262,0.21

As expected, the two thread case scales at 0.50, and we see the four thread case doing slightly less well than we expect, but about right.

What’s more, if we consider two threads running flat out (the simplest case, easiest to reason about), then the maximum performance we can achieve, given that push and pop destructively interfere, is for each thread 50% of the performance of a single thread; each uses half the possible operation slots.

We note though that with one freelist per thread, two threads almost doubles performance – we scale at 0.92. Clearly, we are not as a single thread bottlenecked by memory – we must be bottlenecked simply by how long a push/pop operation takes to perform. Two threads look like they might be slightly bottlenecked and four threads certainly looks bottlenecked, since they scale at 0.67.

CAS/DCAS

I’ve realised something (obvious when you realise it). I’ve been thinking about CAS/DCAS in the wrong way – I’ve been thinking when you have two at once, they destructively interfere. This is true, but not in the way I was thinking. Imagine you have a thread which is in the middle of a CAS. A second thread then issues a CAS as well (on the same memory). What occurs? I think now the first CAS completes, then the second CAS. CAS instructions themselves do not destructively interfere, they merely delay.

The destructive interference comes at a higher level – in the data structure algorithm itself. We are after all trying to make a CAS actually occur – which requires the destination operand to be stable, so it successfully compares with the comparand. If someone else keeps changing our destination, we keep failing!

So it’s actually the push and pop operations in the freelist which destructively interfere, not the CAS/DCAS operations themselves.

The backoff delay is acting to protect the destination from change so that there’s time for a CAS/DCAS to succeed.

So what we want to achieve with our backoff is having the destination unchanged for long enough for an operation to succeed, but nothing longer – no periods where the destination is needlessly protected from change.

Freelist per thread benchmark results

So, I’ve finally rewritten the test code from its original copy-and-paste for each benchmark to a framework and I’ve added a freelist test where each thread has its own freelist (we still use atomic operations). The purpose of this test is to provide a benchmark where there is no contention, so we can begin to get a feel for how much contention is costing us.

Benchmark #1 is the normal freelist benchmark, a single freelist shared across threads.

Benchmark #2 is identical, except we have a freelist per thread.

These results come from my laptop, with a CAS delay of 3 and a DCAS delay of 5 (for exponential backoff). Removing the delays returns benchmark #1 to its historical values, while having no impact on benchmark #2 (collisions never occur).

Release 7 Lock-Free Freelist Benchmark #1

   M
   N
   S
  L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
  1     326734118,32673412,0,1.00
  1   1 260885708,13044285,41466,0.40

  1   1 261520490,13076025,14186,0.40
1 1 1 1 197457130,4936428,27432,0.15

1 1     270206346,13510317,24557,0.41
1 1 1 1 197518532,4937963,24360,0.15
Release 7 Lock-Free Freelist Benchmark #2

   M
   N
   S
 L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
  1     332387914,33238791,0,1.00
  1   1 610409486,30520474,18490,0.92

  1   1 609682208,30484110,28705,0.92
1 1 1 1 886672632,22166816,72449,0.67

1 1     483635212,24181761,48864,0.73
1 1 1 1 886780032,22169501,51639,0.67

Have to think about this…

Been busy at work

Nothing done just recently, ben busy at work last two weekends and the evenings in between.

Bit of a break now. Might write an image viewer, been meaning to do one for ages.

Laptop off to Sony for fan repair – setting up the server PC for development.