Bleark

I have not been looking closely enough at the results. Lots of digits in each number – easy to miss exactly how many digits difference there is between two numbers – and in particular I have not looked closely enough at the standard deviation column.

The back-off testing I’ve done so far has been on ARM.

I’ve been repeating the tests now on x64.

The standard deviation values are huge – i.e. one thread is dominating, the others are doing very little work.

I’ve modified the benchmark output to actually show the number of iterations performed by each thread.

Here’s an example, the freelist;

  1   1 28060032,2806003,1295583,Lock-Free (DWCAS 0),18610592,9449440
  1   1 44278234,4427823,2881935,Lock-Free (DWCAS 1),32328296,11949938
  1   1 66740674,6674067,7405432,Lock-Free (DWCAS 2),59552493,7188181
  1   1 76710634,7671063,5774912,Lock-Free (DWCAS 3),58772714,17937920
  1   1 84710626,8471063,4498865,Lock-Free (DWCAS 4),58261203,26449423
  1   1 99004906,9900491,11592576,Lock-Free (DWCAS 5),90488398,8516508
  1   1 97875778,9787578,6058038,Lock-Free (DWCAS 6),70356286,27519492
  1   1 99880781,9988078,5948326,Lock-Free (DWCAS 7),70970900,28909881
  1   1 104587483,10458748,7230319,Lock-Free (DWCAS 8),77856779,26730704
  1   1 109813704,10981370,11987819,Lock-Free (DWCAS 9),97290193,12523511
  1   1 108797689,10879769,7387170,Lock-Free (DWCAS 10),80516436,28281253

One thread is generally doing about 70% to 80& of the work.

Now, it’s kinda interesting – very interesting in fact – to look at the values for the GCC spinlocks;

  1   1 16391375,1639138,13448,GCC Spinlock (atomic),8148140,8243235
  1   1 17053036,1705304,20385,GCC Spinlock (sync),8454446,8598590

These guys are very fair. I’m not quite sure how they’re doing it, in fact! although DWCAS 0 still is doing some work (I’ve not removed the back-off code, just set it so it’s always 0) and that might be in effect a back-off which the GCC spinlocks simply do not have.

But – fair as they are, if we compare them to all of the DWCAS values, we still see all DWCAS threads are doing more work than all of the GCC spinlock threads.

On ARM, this behaviour happens MUCH less. Here’s an example for ARM;

1 1 1 1 17640623,882031,19892,Lock-Free (DWCAS 0)
1 1 1 1 23285262,1164263,73337,Lock-Free (DWCAS 1)
1 1 1 1 26168142,1308407,84105,Lock-Free (DWCAS 2)
1 1 1 1 28152124,1407606,96889,Lock-Free (DWCAS 3)
1 1 1 1 29036007,1451800,116736,Lock-Free (DWCAS 4)
1 1 1 1 29464435,1473222,86809,Lock-Free (DWCAS 5)
1 1 1 1 29650621,1482531,95732,Lock-Free (DWCAS 6)
1 1 1 1 29559530,1477976,98744,Lock-Free (DWCAS 7)
1 1 1 1 29763734,1488187,94874,Lock-Free (DWCAS 8)
1 1 1 1 29939910,1496996,108147,Lock-Free (DWCAS 9)
1 1 1 1 30086056,1504303,62667,Lock-Free (DWCAS 10)

Now, I may be – ha, may WELL be – utterly wrong, but I suspect what’s going on here may be a combination of the difference between LL/SC and CAS, and the work the benchmark is doing.

The benchmark is simply a loop, where a thread pops and then pushes. The freelist starts with one element per thread.

Now, with CAS on Intel, as I understand it, a logical core will lock the cache line which is the target for the CAS, and then perform the operation.

So the benchmark begins, one of the threads gets to lock first, the other stalls, and then when the first is done, the second runs – and fails, because a push or a pop has occurred – *and the other thread by then is probably stalling on its next pop or push*, and it will again succeed, causing the second thread to fail again…

With LL/SC, there’s no stalling – every code just runs it code – and if some-one else performs the operation in question then our attempt fails.

So with CAS, it’s almost like we need to run the back-off on EVERY operation, not just a FAILED operation.

Further experimental results

I modified the backoff to be non-exponential; to just be a fixed number of loop iterations.

First, here’s the one-thread-per-core result, with exponential backoff, for the freelist on ARM;

1 1 1 1 17640623,882031,19892,Lock-Free (DWCAS 0)
1 1 1 1 23285262,1164263,73337,Lock-Free (DWCAS 1)
1 1 1 1 26168142,1308407,84105,Lock-Free (DWCAS 2)
1 1 1 1 28152124,1407606,96889,Lock-Free (DWCAS 3)
1 1 1 1 29036007,1451800,116736,Lock-Free (DWCAS 4)
1 1 1 1 29464435,1473222,86809,Lock-Free (DWCAS 5)
1 1 1 1 29650621,1482531,95732,Lock-Free (DWCAS 6)
1 1 1 1 29559530,1477976,98744,Lock-Free (DWCAS 7)
1 1 1 1 29763734,1488187,94874,Lock-Free (DWCAS 8)
1 1 1 1 29939910,1496996,108147,Lock-Free (DWCAS 9)
1 1 1 1 30086056,1504303,62667,Lock-Free (DWCAS 10)

Now here’s the same, with a non-exponential backoff;

1 1 1 1 17013997,850700,14393,Lock-Free (DWCAS 0)
1 1 1 1 19076057,953803,490883,Lock-Free (DWCAS 1)
1 1 1 1 17042025,852101,14743,Lock-Free (DWCAS 2)
1 1 1 1 18177159,908858,42810,Lock-Free (DWCAS 3)
1 1 1 1 17071054,853553,45057,Lock-Free (DWCAS 4)
1 1 1 1 16705689,835284,33913,Lock-Free (DWCAS 5)
1 1 1 1 20509489,1025474,66327,Lock-Free (DWCAS 6)
1 1 1 1 17431414,871571,24032,Lock-Free (DWCAS 7)
1 1 1 1 17526509,876325,23812,Lock-Free (DWCAS 8)
1 1 1 1 18478460,923923,32485,Lock-Free (DWCAS 9)
1 1 1 1 18889871,944494,28001,Lock-Free (DWCAS 10)

Obviously, comparing these two sets of results, exponential has much better performance.

However, what exponential really means is – longer backoffs, when things get busy. This could in fact simply be approximately the same as – longer backoffs.

The exponential backoff, when set to a value of 10, has a maximum iteration count of 2^10 * 10 (10240), as the exponential part has a maximum power of 10, before it resets back to 1.

The results for a fixed period backoff of 10240-10250 are;

1 1 1 1 31359328,1567966,277086,Lock-Free (DWCAS 10240)
1 1 1 1 31350319,1567516,187042,Lock-Free (DWCAS 10241)
1 1 1 1 31350319,1567516,337291,Lock-Free (DWCAS 10242)
1 1 1 1 31367336,1568367,260849,Lock-Free (DWCAS 10243)
1 1 1 1 31372341,1568617,247379,Lock-Free (DWCAS 10244)
1 1 1 1 31375344,1568767,250752,Lock-Free (DWCAS 10245)
1 1 1 1 31363332,1568167,220210,Lock-Free (DWCAS 10246)
1 1 1 1 31349318,1567466,201058,Lock-Free (DWCAS 10247)
1 1 1 1 31353322,1567666,130502,Lock-Free (DWCAS 10248)
1 1 1 1 31353322,1567666,361582,Lock-Free (DWCAS 10249)
1 1 1 1 31351320,1567566,348364,Lock-Free (DWCAS 10250)

Well! ain’t dat interestin’. We have the same performance as with exponential – not quite as good, because the standard deviation is about 10x larger, which means the cores are not being treated as fairly; some cores are getting more, others less – but still, overall, three times faster than the original above.

At this point it is illuminating to display the results for a single thread.

1 31495464,6299093,0,Lock-Free (DWCAS 1024)
1 31495464,6299093,0,Lock-Free (DWCAS 1025)
1 31494463,6298893,0,Lock-Free (DWCAS 1026)
1 31495464,6299093,0,Lock-Free (DWCAS 1027)
1 31494463,6298893,0,Lock-Free (DWCAS 1028)
1 31495464,6299093,0,Lock-Free (DWCAS 1029)
1 31495464,6299093,0,Lock-Free (DWCAS 1030)
1 31495464,6299093,0,Lock-Free (DWCAS 1031)
1 31495464,6299093,0,Lock-Free (DWCAS 1032)
1 31495464,6299093,0,Lock-Free (DWCAS 1033)
1 31495464,6299093,0,Lock-Free (DWCAS 1034)

What we do understand from this? well, we see the total result is approximately equal to the total result for all four cores. In other words, there is a bottleneck, upstream of the cores, which determines maximum performance. (Memory bandwidth, basically).

Next I tried an outlier, with a non-exponential backoff of 4000-4010, and then another, 8000-8010.

1 1 1 1 31232201,1561610,116169,Lock-Free (DWCAS 4000)
1 1 1 1 31233202,1561660,101224,Lock-Free (DWCAS 4001)
1 1 1 1 31237206,1561860,157937,Lock-Free (DWCAS 4002)
1 1 1 1 31270239,1563512,104374,Lock-Free (DWCAS 4003)
1 1 1 1 31280249,1564012,105804,Lock-Free (DWCAS 4004)
1 1 1 1 31282251,1564113,176778,Lock-Free (DWCAS 4005)
1 1 1 1 31273242,1563662,154924,Lock-Free (DWCAS 4006)
1 1 1 1 31237206,1561860,181227,Lock-Free (DWCAS 4007)
1 1 1 1 31234203,1561710,175386,Lock-Free (DWCAS 4008)
1 1 1 1 31239208,1561960,194485,Lock-Free (DWCAS 4009)
1 1 1 1 31233202,1561660,113455,Lock-Free (DWCAS 4010)
1 1 1 1 31330299,1566515,275351,Lock-Free (DWCAS 8000)
1 1 1 1 31330299,1566515,229374,Lock-Free (DWCAS 8001)
1 1 1 1 31326295,1566315,324406,Lock-Free (DWCAS 8002)
1 1 1 1 31350319,1567516,107952,Lock-Free (DWCAS 8003)
1 1 1 1 31350319,1567516,214843,Lock-Free (DWCAS 8004)
1 1 1 1 31355324,1567766,77613,Lock-Free (DWCAS 8005)
1 1 1 1 31349318,1567466,220408,Lock-Free (DWCAS 8006)
1 1 1 1 31330299,1566515,214035,Lock-Free (DWCAS 8007)
1 1 1 1 31333302,1566665,255280,Lock-Free (DWCAS 8008)
1 1 1 1 31331300,1566565,358338,Lock-Free (DWCAS 8009)
1 1 1 1 31330299,1566515,293006,Lock-Free (DWCAS 8010)

Total throughput is the same, but the standard deviation doubles – so the 8000 backoff is worse, because the performance is the same but it’s less fair.

So we see, so far, that once the backoff is long enough, we reach the maximum performance allowed by the upstream bottleneck, but after that, we become more and more unfair.

So it would be instructive now to find the lowest non-exponential backoff which achieves maximum or nearly maximum performance (i.e. with the least unfairness).

This turns out to be about the 550 mark, if I pick about 30,000,000 as the target total ops.

1 1 1 1 30066036,1503302,123386,Lock-Free (DWCAS 550)
1 1 1 1 30075045,1503752,111128,Lock-Free (DWCAS 551)
1 1 1 1 30091061,1504553,132532,Lock-Free (DWCAS 552)
1 1 1 1 30312282,1515614,121997,Lock-Free (DWCAS 553)
1 1 1 1 30350320,1517516,130751,Lock-Free (DWCAS 554)
1 1 1 1 30357327,1517866,69512,Lock-Free (DWCAS 555)
1 1 1 1 30282252,1514113,101445,Lock-Free (DWCAS 556)
1 1 1 1 30070040,1503502,129679,Lock-Free (DWCAS 557)
1 1 1 1 30054024,1502701,97377,Lock-Free (DWCAS 558)
1 1 1 1 30051021,1502551,101635,Lock-Free (DWCAS 559)
1 1 1 1 30056026,1502801,113993,Lock-Free (DWCAS 560)

For easy reference, here are the original exponential results;

1 1 1 1 17640623,882031,19892,Lock-Free (DWCAS 0)
1 1 1 1 23285262,1164263,73337,Lock-Free (DWCAS 1)
1 1 1 1 26168142,1308407,84105,Lock-Free (DWCAS 2)
1 1 1 1 28152124,1407606,96889,Lock-Free (DWCAS 3)
1 1 1 1 29036007,1451800,116736,Lock-Free (DWCAS 4)
1 1 1 1 29464435,1473222,86809,Lock-Free (DWCAS 5)
1 1 1 1 29650621,1482531,95732,Lock-Free (DWCAS 6)
1 1 1 1 29559530,1477976,98744,Lock-Free (DWCAS 7)
1 1 1 1 29763734,1488187,94874,Lock-Free (DWCAS 8)
1 1 1 1 29939910,1496996,108147,Lock-Free (DWCAS 9)
1 1 1 1 30086056,1504303,62667,Lock-Free (DWCAS 10)

And here, with non-exponential backoff 0-10;

1 1 1 1 17013997,850700,14393,Lock-Free (DWCAS 0)
1 1 1 1 19076057,953803,490883,Lock-Free (DWCAS 1)
1 1 1 1 17042025,852101,14743,Lock-Free (DWCAS 2)
1 1 1 1 18177159,908858,42810,Lock-Free (DWCAS 3)
1 1 1 1 17071054,853553,45057,Lock-Free (DWCAS 4)
1 1 1 1 16705689,835284,33913,Lock-Free (DWCAS 5)
1 1 1 1 20509489,1025474,66327,Lock-Free (DWCAS 6)
1 1 1 1 17431414,871571,24032,Lock-Free (DWCAS 7)
1 1 1 1 17526509,876325,23812,Lock-Free (DWCAS 8)
1 1 1 1 18478460,923923,32485,Lock-Free (DWCAS 9)
1 1 1 1 18889871,944494,28001,Lock-Free (DWCAS 10)

What do we draw from all this?

First, that large variations in the non-exponential backoff are required to make a difference to performance. Exponential side-steps this problem by increasing in a non-linear fashion, where the backoff value, although fixed, is on each iteration of backoff being mutiplied by 1 – 2 – 4 – 8 – 16 – 32 – 64 – 128 – 256 – 512 and finally 1024 (before going back to 1).

So the exponential backoff of 10 gives us a non-exponential backoff of 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240.

What also see is that the lower the non-exponential backoff, the more fair the results – the lower the standard deviation (with the strange and not obviously explicable huge value for DWCAS 1, and that result repeats – it’s not a one-off). However, we also see that the lower the non-exponential value, the lower the overall performance.

We also see exponential backoff values from 4 to 10, inclusive, all give good performance and pretty good fairness.

My general view then – for the ARM platform, for what we see here may be totally different on MIPS or Intel, etc, or with many more cores – exponential backoff is better because it makes the choice of the correct backoff value MUCH less sensitive.

Next experiments are going to be on high-core count Intel systems. I want to see if the PRNG part of the backoff matters there, and I want to see how all the above pans out on a CAS system, rather than an LL/SC system.

Sixth light – ARM with backoff

Somewhat puzzling results.

So, backoff matters – for freelist and as before when crossing physical cores.

That’s fine.

Now check the single-thread queue result for lock-free. It’s deterministic! that’s totally bizzare and unexpected – this is on a live (albeit idle) Linux system. It must be a bug!

Then, similarly, for the rest of queue, back-off seems to make no difference.

Now I have to figure out what’s going on here =-)

Freelist Push One Pop One

   M   
   S   
P P P P
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
      1 35185150,7037030,0,GCC Spinlock (atomic)
      1 23315292,4663058,0,GCC Spinlock (sync)
      1 31497466,6299493,0,Lock-Free (DWCAS 0)
      1 31497466,6299493,0,Lock-Free (DWCAS 1)
      1 31497466,6299493,0,Lock-Free (DWCAS 2)
      1 31497466,6299493,0,Lock-Free (DWCAS 3)
      1 31498467,6299693,0,Lock-Free (DWCAS 4)
      1 31498467,6299693,0,Lock-Free (DWCAS 5)
      1 31498467,6299693,0,Lock-Free (DWCAS 6)
      1 31498467,6299693,0,Lock-Free (DWCAS 7)
      1 31498467,6299693,0,Lock-Free (DWCAS 8)
      1 31499468,6299894,0,Lock-Free (DWCAS 9)
      1 31498467,6299693,0,Lock-Free (DWCAS 10)
      1 14279265,2855853,0,PThread Mutex
      1 25433408,5086682,0,PThread Spinlock (private)
      1 25432407,5086481,0,PThread Spinlock (shared)
1 1 1 1 16087071,804354,2961,GCC Spinlock (atomic)
1 1 1 1 13944931,697247,49849,GCC Spinlock (sync)
1 1 1 1 17640623,882031,19892,Lock-Free (DWCAS 0)
1 1 1 1 23285262,1164263,73337,Lock-Free (DWCAS 1)
1 1 1 1 26168142,1308407,84105,Lock-Free (DWCAS 2)
1 1 1 1 28152124,1407606,96889,Lock-Free (DWCAS 3)
1 1 1 1 29036007,1451800,116736,Lock-Free (DWCAS 4)
1 1 1 1 29464435,1473222,86809,Lock-Free (DWCAS 5)
1 1 1 1 29650621,1482531,95732,Lock-Free (DWCAS 6)
1 1 1 1 29559530,1477976,98744,Lock-Free (DWCAS 7)
1 1 1 1 29763734,1488187,94874,Lock-Free (DWCAS 8)
1 1 1 1 29939910,1496996,108147,Lock-Free (DWCAS 9)
1 1 1 1 30086056,1504303,62667,Lock-Free (DWCAS 10)
1 1 1 1 6391385,319569,13732,PThread Mutex
1 1 1 1 10378368,518918,6813,PThread Spinlock (private)
1 1 1 1 10356346,517817,13782,PThread Spinlock (shared)
    1 1 19879860,1987986,9343,GCC Spinlock (atomic)
    1 1 13400387,1340039,4672,GCC Spinlock (sync)
    1 1 17836819,1783682,296574,Lock-Free (DWCAS 0)
    1 1 27857830,2785783,10759,Lock-Free (DWCAS 1)
    1 1 29423394,2942339,10193,Lock-Free (DWCAS 2)
    1 1 30032002,3003200,5663,Lock-Free (DWCAS 3)
    1 1 30385355,3038536,20810,Lock-Free (DWCAS 4)
    1 1 30545515,3054552,25623,Lock-Free (DWCAS 5)
    1 1 30738708,3073871,2831,Lock-Free (DWCAS 6)
    1 1 30836806,3083681,51529,Lock-Free (DWCAS 7)
    1 1 30910880,3091088,53228,Lock-Free (DWCAS 8)
    1 1 30974944,3097494,48414,Lock-Free (DWCAS 9)
    1 1 31015985,3101598,15430,Lock-Free (DWCAS 10)
    1 1 3898895,389890,1274,PThread Mutex
    1 1 11090079,1109008,6937,PThread Spinlock (private)
    1 1 11090079,1109008,12316,PThread Spinlock (shared)
  1 1 1 16413397,1094226,22509,GCC Spinlock (atomic)
  1 1 1 13315302,887687,48504,GCC Spinlock (sync)
  1 1 1 18945927,1263062,221660,Lock-Free (DWCAS 0)
  1 1 1 26529503,1768634,120754,Lock-Free (DWCAS 1)
  1 1 1 28478450,1898563,129715,Lock-Free (DWCAS 2)
  1 1 1 28676648,1911777,171176,Lock-Free (DWCAS 3)
  1 1 1 29308279,1953885,157572,Lock-Free (DWCAS 4)
  1 1 1 29695666,1979711,155581,Lock-Free (DWCAS 5)
  1 1 1 29959930,1997329,118840,Lock-Free (DWCAS 6)
  1 1 1 30145115,2009674,72812,Lock-Free (DWCAS 7)
  1 1 1 30565535,2037702,113511,Lock-Free (DWCAS 8)
  1 1 1 30658628,2043909,120734,Lock-Free (DWCAS 9)
  1 1 1 30739709,2049314,122498,Lock-Free (DWCAS 10)
  1 1 1 5842837,389522,13478,PThread Mutex
  1 1 1 11063052,737537,555199,PThread Spinlock (private)
  1 1 1 11064053,737604,561915,PThread Spinlock (shared)

Queue Enqueue One Dequeue One

   M   
   S   
P P P P
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
      1 31280249,6256050,0,GCC Spinlock (atomic)
      1 20549529,4109906,0,GCC Spinlock (sync)
      1 14745731,2949146,0,Lock-Free (DWCAS 0)
      1 14745731,2949146,0,Lock-Free (DWCAS 1)
      1 14745731,2949146,0,Lock-Free (DWCAS 2)
      1 14745731,2949146,0,Lock-Free (DWCAS 3)
      1 14745731,2949146,0,Lock-Free (DWCAS 4)
      1 14745731,2949146,0,Lock-Free (DWCAS 5)
      1 14745731,2949146,0,Lock-Free (DWCAS 6)
      1 14745731,2949146,0,Lock-Free (DWCAS 7)
      1 14745731,2949146,0,Lock-Free (DWCAS 8)
      1 14745731,2949146,0,Lock-Free (DWCAS 9)
      1 14745731,2949146,0,Lock-Free (DWCAS 10)
      1 13629616,2725923,0,PThread Mutex
      1 23201178,4640236,0,PThread Spinlock (private)
      1 23202179,4640436,0,PThread Spinlock (shared)
1 1 1 1 23403380,1170169,14742,GCC Spinlock (atomic)
1 1 1 1 20153133,1007657,26732,GCC Spinlock (sync)
1 1 1 1 17854837,892742,10035,Lock-Free (DWCAS 0)
1 1 1 1 18008991,900450,11774,Lock-Free (DWCAS 1)
1 1 1 1 18134116,906706,14812,Lock-Free (DWCAS 2)
1 1 1 1 18220202,911010,15063,Lock-Free (DWCAS 3)
1 1 1 1 18278260,913913,18031,Lock-Free (DWCAS 4)
1 1 1 1 18285267,914263,17856,Lock-Free (DWCAS 5)
1 1 1 1 18408390,920420,21765,Lock-Free (DWCAS 6)
1 1 1 1 18418400,920920,21361,Lock-Free (DWCAS 7)
1 1 1 1 18538520,926926,20086,Lock-Free (DWCAS 8)
1 1 1 1 18536518,926826,22936,Lock-Free (DWCAS 9)
1 1 1 1 18656638,932832,24218,Lock-Free (DWCAS 10)
1 1 1 1 8533525,426676,10860,PThread Mutex
1 1 1 1 17157140,857857,10236,PThread Spinlock (private)
1 1 1 1 17181164,859058,10352,PThread Spinlock (shared)
    1 1 26557531,2655753,708,GCC Spinlock (atomic)
    1 1 25475450,2547545,566,GCC Spinlock (sync)
    1 1 22513491,2251349,991,Lock-Free (DWCAS 0)
    1 1 22511489,2251149,991,Lock-Free (DWCAS 1)
    1 1 22514492,2251449,1133,Lock-Free (DWCAS 2)
    1 1 22511489,2251149,991,Lock-Free (DWCAS 3)
    1 1 22509487,2250949,991,Lock-Free (DWCAS 4)
    1 1 22515493,2251549,991,Lock-Free (DWCAS 5)
    1 1 22508486,2250849,849,Lock-Free (DWCAS 6)
    1 1 22510488,2251049,849,Lock-Free (DWCAS 7)
    1 1 22515493,2251549,991,Lock-Free (DWCAS 8)
    1 1 22509487,2250949,991,Lock-Free (DWCAS 9)
    1 1 22514492,2251449,1133,Lock-Free (DWCAS 10)
    1 1 5277272,527727,18686,PThread Mutex
    1 1 24609585,2460958,708,PThread Spinlock (private)
    1 1 24611587,2461159,708,PThread Spinlock (shared)
  1 1 1 24698674,1646578,18354,GCC Spinlock (atomic)
  1 1 1 19022003,1268134,8570,GCC Spinlock (sync)
  1 1 1 18665647,1244376,22695,Lock-Free (DWCAS 0)
  1 1 1 18686668,1245778,21320,Lock-Free (DWCAS 1)
  1 1 1 18689671,1245978,23144,Lock-Free (DWCAS 2)
  1 1 1 18701683,1246779,24981,Lock-Free (DWCAS 3)
  1 1 1 18705687,1247046,24622,Lock-Free (DWCAS 4)
  1 1 1 18726708,1248447,32907,Lock-Free (DWCAS 5)
  1 1 1 18776758,1251784,34281,Lock-Free (DWCAS 6)
  1 1 1 18781763,1252118,34080,Lock-Free (DWCAS 7)
  1 1 1 18871853,1258124,36373,Lock-Free (DWCAS 8)
  1 1 1 18858840,1257256,36338,Lock-Free (DWCAS 9)
  1 1 1 18945927,1263062,37976,Lock-Free (DWCAS 10)
  1 1 1 8786778,585785,6375,PThread Mutex
  1 1 1 18502484,1233499,57919,PThread Spinlock (private)
  1 1 1 18499481,1233299,57914,PThread Spinlock (shared)

Fifth light – backoff period really matters

I’ve added to the benchmark the capability to benchmark a range of DWCAS backoff values.

Performance is sensitive to the backoff value – the difference in performance can be as large as about 8x – it really matters. Also, as I suspected, the optimal backoff value varies by data structure (so I can’t have as I do now one global value).

When running a single thread the backoff value doesn’t matter – you never collide with yourself.

When all the threads running are on the same physical core (i.e. say a four hyperthread physical core, so four logical processors, but all on the same core), backoff makes no or almost no difference, regardless of data structure – the physical processor *knows what’s going on across all its logical cores* and arranges scheduling so there are no collisions.

When we move threads out across multiple physical cores, though, the backoff period makes a lot of difference to performance. My current (untested) thought in this case is that setting the backoff correctly for the maximum workloads is the right choice; to the extent you’re less busy than that, you’re less likely to be having collisions anyway, and so you care less and less about the backoff value being right.

I need to test this though. I need a partial load benchmark – right now I have a full load benchmark, where the threads run a tight loop simply performing pop/push or dequeue/enqueue calls. So I need a loop basically with a controllable delay between atomic ops, and I need to vary that delay.

Turning now to the results (which are getting longer and longer!), we can firstly set aside the single-thread and all-threads-on-one-physical-core thread sets, as backoff doesn’t matter.

The DWCAS range tested is 0 to 10, inclusive.

If we look then as the other two thread sets, one thread per physical core and one thread on every logical core, we see for the freelist as the DWCAS backoff increases, performance increases. We do not in fact see performance decrease, so even a backoff of 10 is too small (although the increases have become small by then).

If we look however at the queue, we see with one thread per physical core the optimal DWCAS backoff is 6. This gives a peak of 40,205,165 operations per second. If we increase or decrease backoff by one – just one! – we lose about 10,000,000 operations per second. Whooosh!

The problem is, this backoff value is only going to be right for my particualr processor with its particular clock speed.

If we then look at one-thread-on-every-logical-core, we see – awkwardly – performance still rising as we get to a backoff of 10. However, in this case, the difference is much, much smaller – going from DWCAS 0 to 10 adds about 4,000,000 operations; going from DWCAS 9 to 10 adds about 100,000 – probably a chance within the usual per-test variance.

Freelist Push One Pop One

   M   
   N   
   S   
  L3U  
L2U L2U
L1D L1D
L1I L1I
 P   P 
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
  1     159325166,31865033,0,GCC Spinlock (atomic)
  1     159512353,31902471,0,GCC Spinlock (sync)
  1     134521387,26904277,0,Lock-Free (DWCAS 0)
  1     134466332,26893266,0,Lock-Free (DWCAS 1)
  1     134568434,26913687,0,Lock-Free (DWCAS 2)
  1     134558424,26911685,0,Lock-Free (DWCAS 3)
  1     134549415,26909883,0,Lock-Free (DWCAS 4)
  1     134605471,26921094,0,Lock-Free (DWCAS 5)
  1     134438304,26887661,0,Lock-Free (DWCAS 6)
  1     134520386,26904077,0,Lock-Free (DWCAS 7)
  1     133647514,26729503,0,Lock-Free (DWCAS 8)
  1     134596462,26919292,0,Lock-Free (DWCAS 9)
  1     134496362,26899272,0,Lock-Free (DWCAS 10)
  1     97150053,19430011,0,PThread Mutex
  1     183086904,36617381,0,PThread Spinlock (private)
  1     183145963,36629193,0,PThread Spinlock (shared)
  1   1 16499483,1649948,10900,GCC Spinlock (atomic)
  1   1 17125108,1712511,27463,GCC Spinlock (sync)
  1   1 28201173,2820117,1271658,Lock-Free (DWCAS 0)
  1   1 42947905,4294790,2638872,Lock-Free (DWCAS 1)
  1   1 67146079,6714608,7524203,Lock-Free (DWCAS 2)
  1   1 76759683,7675968,5822052,Lock-Free (DWCAS 3)
  1   1 85787702,8578770,4353339,Lock-Free (DWCAS 4)
  1   1 98828730,9882873,11478759,Lock-Free (DWCAS 5)
  1   1 97956859,9795686,6006933,Lock-Free (DWCAS 6)
  1   1 99874775,9987478,5921146,Lock-Free (DWCAS 7)
  1   1 104545441,10454544,7179073,Lock-Free (DWCAS 8)
  1   1 109721612,10972161,12044161,Lock-Free (DWCAS 9)
  1   1 110770660,11077066,12010469,Lock-Free (DWCAS 10)
  1   1 12278266,1227827,45583,PThread Mutex
  1   1 22149127,2214913,1049122,PThread Spinlock (private)
  1   1 19243224,1924322,2330123,PThread Spinlock (shared)
1 1     74481407,7448141,47707,GCC Spinlock (atomic)
1 1     74966892,7496689,80125,GCC Spinlock (sync)
1 1     102240138,10224014,105040,Lock-Free (DWCAS 0)
1 1     103181078,10318108,888448,Lock-Free (DWCAS 1)
1 1     102488386,10248839,764722,Lock-Free (DWCAS 2)
1 1     102219117,10221912,685305,Lock-Free (DWCAS 3)
1 1     102142040,10214204,1166194,Lock-Free (DWCAS 4)
1 1     101971870,10197187,728199,Lock-Free (DWCAS 5)
1 1     101869768,10186977,646942,Lock-Free (DWCAS 6)
1 1     101861760,10186176,485560,Lock-Free (DWCAS 7)
1 1     101811710,10181171,710645,Lock-Free (DWCAS 8)
1 1     101786685,10178668,519960,Lock-Free (DWCAS 9)
1 1     101738637,10173864,568941,Lock-Free (DWCAS 10)
1 1     40219179,4021918,27888,PThread Mutex
1 1     114551437,11455144,933607,PThread Spinlock (private)
1 1     115891776,11589178,206399,PThread Spinlock (shared)
1 1 1 1 13513500,675675,119368,GCC Spinlock (atomic)
1 1 1 1 13740727,687036,212698,GCC Spinlock (sync)
1 1 1 1 17002986,850149,43669,Lock-Free (DWCAS 0)
1 1 1 1 30076046,1503802,233870,Lock-Free (DWCAS 1)
1 1 1 1 51405354,2570268,1130894,Lock-Free (DWCAS 2)
1 1 1 1 57724667,2886233,369728,Lock-Free (DWCAS 3)
1 1 1 1 66267201,3313360,392702,Lock-Free (DWCAS 4)
1 1 1 1 70794724,3539736,391461,Lock-Free (DWCAS 5)
1 1 1 1 75239164,3761958,359040,Lock-Free (DWCAS 6)
1 1 1 1 77145068,3857253,376639,Lock-Free (DWCAS 7)
1 1 1 1 80203123,4010156,497413,Lock-Free (DWCAS 8)
1 1 1 1 80796716,4039836,375427,Lock-Free (DWCAS 9)
1 1 1 1 82812730,4140636,485744,Lock-Free (DWCAS 10)
1 1 1 1 15654639,782732,5485,PThread Mutex
1 1 1 1 20225205,1011260,84782,PThread Spinlock (private)
1 1 1 1 16291275,814564,187390,PThread Spinlock (shared)

Queue Enqueue One Dequeue One

   M   
   N   
   S   
  L3U  
L2U L2U
L1D L1D
L1I L1I
 P   P 
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
  1     139550411,27910082,0,GCC Spinlock (atomic)
  1     139536397,27907279,0,GCC Spinlock (sync)
  1     105669564,21133913,0,Lock-Free (DWCAS 0)
  1     105723618,21144724,0,Lock-Free (DWCAS 1)
  1     105669564,21133913,0,Lock-Free (DWCAS 2)
  1     105736631,21147326,0,Lock-Free (DWCAS 3)
  1     105085981,21017196,0,Lock-Free (DWCAS 4)
  1     105662557,21132511,0,Lock-Free (DWCAS 5)
  1     105607502,21121500,0,Lock-Free (DWCAS 6)
  1     105591486,21118297,0,Lock-Free (DWCAS 7)
  1     105640535,21128107,0,Lock-Free (DWCAS 8)
  1     105569464,21113893,0,Lock-Free (DWCAS 9)
  1     105628523,21125705,0,Lock-Free (DWCAS 10)
  1     96419323,19283865,0,PThread Mutex
  1     202514312,40502862,0,PThread Spinlock (private)
  1     212104893,42420979,0,PThread Spinlock (shared)
  1   1 23690667,2369067,142,GCC Spinlock (atomic)
  1   1 23798775,2379878,142,GCC Spinlock (sync)
  1   1 28200172,2820017,484428,Lock-Free (DWCAS 0)
  1   1 28158130,2815813,17837,Lock-Free (DWCAS 1)
  1   1 29881852,2988185,705832,Lock-Free (DWCAS 2)
  1   1 28610582,2861058,265855,Lock-Free (DWCAS 3)
  1   1 35524489,3552449,2168317,Lock-Free (DWCAS 4)
  1   1 30298268,3029827,837485,Lock-Free (DWCAS 5)
  1   1 40205165,4020516,3741929,Lock-Free (DWCAS 6)
  1   1 28476448,2847645,5663,Lock-Free (DWCAS 7)
  1   1 30428398,3042840,786806,Lock-Free (DWCAS 8)
  1   1 28037009,2803701,25057,Lock-Free (DWCAS 9)
  1   1 28542514,2854251,114949,Lock-Free (DWCAS 10)
  1   1 16863847,1686385,34400,PThread Mutex
  1   1 25633608,2563361,140430,PThread Spinlock (private)
  1   1 25596571,2559657,7786,PThread Spinlock (shared)
1 1     110576466,11057647,3114,GCC Spinlock (atomic)
1 1     116036921,11603692,2973,GCC Spinlock (sync)
1 1     90363273,9036327,99802,Lock-Free (DWCAS 0)
1 1     89785696,8978570,82106,Lock-Free (DWCAS 1)
1 1     89660571,8966057,83380,Lock-Free (DWCAS 2)
1 1     89572483,8957248,100934,Lock-Free (DWCAS 3)
1 1     89570481,8957048,203426,Lock-Free (DWCAS 4)
1 1     89332243,8933224,116506,Lock-Free (DWCAS 5)
1 1     89221132,8922113,204700,Lock-Free (DWCAS 6)
1 1     89082994,8908299,92299,Lock-Free (DWCAS 7)
1 1     88771683,8877168,107729,Lock-Free (DWCAS 8)
1 1     88816728,8881673,102491,Lock-Free (DWCAS 9)
1 1     88632544,8863254,144111,Lock-Free (DWCAS 10)
1 1     76736660,7673666,9060,PThread Mutex
1 1     125234109,12523411,83097,PThread Spinlock (private)
1 1     107783676,10778368,2831,PThread Spinlock (shared)
1 1 1 1 28380352,1419018,466288,GCC Spinlock (atomic)
1 1 1 1 28051023,1402551,372684,GCC Spinlock (sync)
1 1 1 1 28151123,1407556,575764,Lock-Free (DWCAS 0)
1 1 1 1 28726698,1436335,353561,Lock-Free (DWCAS 1)
1 1 1 1 28735707,1436785,482692,Lock-Free (DWCAS 2)
1 1 1 1 29066037,1453302,382135,Lock-Free (DWCAS 3)
1 1 1 1 29338309,1466915,314677,Lock-Free (DWCAS 4)
1 1 1 1 29815786,1490789,623586,Lock-Free (DWCAS 5)
1 1 1 1 30274244,1513712,527132,Lock-Free (DWCAS 6)
1 1 1 1 30673643,1533682,467011,Lock-Free (DWCAS 7)
1 1 1 1 31248217,1562411,570384,Lock-Free (DWCAS 8)
1 1 1 1 32043011,1602151,833219,Lock-Free (DWCAS 9)
1 1 1 1 32117085,1605854,715797,Lock-Free (DWCAS 10)
1 1 1 1 21455434,1072772,4463,PThread Mutex
1 1 1 1 35506471,1775324,620979,PThread Spinlock (private)
1 1 1 1 34894860,1744743,682990,PThread Spinlock (shared)

So what do we make of it all?

If you are targetting a particular device, then you can find the optimal backoff values and use them – no problem.

If you are not, however, and so the processor model and clock speed can vary, then you could run the optimal backoff finder on an idle system of that type and use those values.

However, this may not be possible; in which case, *other* backoff values have to be used – and it just has to be hoped that they are reasonably correct, i.e. that backoff values don’t vary much across processor model and clock. Probably they do though – I mean, if a change of *one* in the queue backoff removes 25% performance, changing the clock speed by say 1 GHz is probably going to matter!

I wonder if it’s possible to build in something like automatic gain control.

Also – is the PRNG/ethernet-like random exponential backoff necessary? would a fixed backoff period work just as well, and so allow the removal of the PRNG complexity and passing around the PRNG state?

Fourth light – ARM

Benchmark results from a Raspberry Pi2 Model B – a single package, four physical core, non-hyperthread ARM processor.

The results are not clearly understandable to me. All of this also has the unknown factor of how much the DWCAS backoff value of 16 is a good or bad choice – and this could be affecting results hugely.

With the freelist, pthread spinlocks seem to perform very poorly on ARM, compared to Intel. The lock-free freelist wins across the board, and by 2x or 3x – the exception being the GCC 4.7.3 or better intrinsics, which are about 10% faster. I’m guessing the pthread spinlocks are issuing load and store memory barriers – because I suspect the sync intrinsics do this, and they’re running at the same speed, and liblfds doesn’t, and it’s running at basically the same speed as the GCC atomic intrinsics.

When we come to every core running, liblfds is way out in the lead – about 2x over GCC atomics and so 3x over pthreads/sync.

The queue is a different matter. There’s quite a bit more atomic working being done and so its behaviour is really very different to the freelist. Lock-free is performing about as well, sometimes a bit better, sometimes a bit worse, as the GCC atomics and the pthread spinlocks.

Experiments with the backoff value are now required.

Freelist Push One Pop One

   M   
   S   
P P P P
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
      1 35310275,7062055,0,GCC Spinlock (atomic)
      1 23316293,4663259,0,GCC Spinlock (sync)
      1 31498467,6299693,0,Lock-Free
      1 14279265,2855853,0,PThread Mutex
      1 25436411,5087282,0,PThread Spinlock (private)
      1 25435410,5087082,0,PThread Spinlock (shared)
1 1 1 1 16188172,809409,3536,GCC Spinlock (atomic)
1 1 1 1 13913900,695695,49224,GCC Spinlock (sync)
1 1 1 1 30561531,1528077,104056,Lock-Free
1 1 1 1 6315309,315765,7802,PThread Mutex
1 1 1 1 10346336,517317,5497,PThread Spinlock (private)
1 1 1 1 10334324,516716,13563,PThread Spinlock (shared)
    1 1 21355334,2135533,9343,GCC Spinlock (atomic)
    1 1 13427414,1342741,12174,GCC Spinlock (sync)
    1 1 31201170,3120117,187146,Lock-Free
    1 1 3972969,397297,1557,PThread Mutex
    1 1 11071060,1107106,15572,PThread Spinlock (private)
    1 1 11065054,1106505,19819,PThread Spinlock (shared)
  1 1 1 16608592,1107239,25416,GCC Spinlock (atomic)
  1 1 1 14296282,953085,24382,GCC Spinlock (sync)
  1 1 1 30841811,2056121,65744,Lock-Free
  1 1 1 5917912,394527,7868,PThread Mutex
  1 1 1 11028017,735201,517840,PThread Spinlock (private)
  1 1 1 11028017,735201,504296,PThread Spinlock (shared)

Queue Enqueue One Dequeue One

   M   
   S   
P P P P
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
      1 31280249,6256050,0,GCC Spinlock (atomic)
      1 20548528,4109706,0,GCC Spinlock (sync)
      1 14745731,2949146,0,Lock-Free
      1 13629616,2725923,0,PThread Mutex
      1 23202179,4640436,0,PThread Spinlock (private)
      1 23202179,4640436,0,PThread Spinlock (shared)
1 1 1 1 23410387,1170519,14323,GCC Spinlock (atomic)
1 1 1 1 20547527,1027376,35305,GCC Spinlock (sync)
1 1 1 1 18781763,939088,25213,Lock-Free
1 1 1 1 8489481,424474,8878,PThread Mutex
1 1 1 1 17234217,861711,9866,PThread Spinlock (private)
1 1 1 1 17153136,857657,10259,PThread Spinlock (shared)
    1 1 30177147,3017715,2973,GCC Spinlock (atomic)
    1 1 23542519,2354252,10334,GCC Spinlock (sync)
    1 1 22512490,2251249,1982,Lock-Free
    1 1 4659655,465966,11750,PThread Mutex
    1 1 24581557,2458156,708,PThread Spinlock (private)
    1 1 24612588,2461259,849,PThread Spinlock (shared)
  1 1 1 25284259,1685617,19065,GCC Spinlock (atomic)
  1 1 1 18883865,1258924,4988,GCC Spinlock (sync)
  1 1 1 19091072,1272738,36525,Lock-Free
  1 1 1 8787779,585852,910,PThread Mutex
  1 1 1 18312294,1220820,58245,PThread Spinlock (private)
  1 1 1 18281263,1218751,56777,PThread Spinlock (shared)

Third light

Fixed two bugs. First, the test were running for about three times as long as they should – the performance numbers were correct in relation to each other, but were themselves incorrect. Second, I forgot to init the liblfds library, and so the PRNG was always returning 0, which totally fubared backoff! I thought about adding an assert for this in 7.1.0 but decided not to – I’m going to add it now; it’s a nasty silent failure, because for example when the PRNG is correctly initialized there’s a *massive* (3x) performance improvement in the freelist (when running one thread per physical core – other factors come into play in other thread combinations – mainly memory contention).

Current results (DWCAS backoff set to 16) on Linux on a Core i5 are…

Freelist Push One Pop One

   M   
   N   
   S   
  L3U  
L2U L2U
L1D L1D
L1I L1I
 P   P 
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
  1     158972814,31794563,0,GCC Spinlock (atomic)
  1     159383224,31876645,0,GCC Spinlock (sync)
  1     134314180,26862836,0,Lock-Free
  1     97038942,19407788,0,PThread Mutex
  1     183193010,36638602,0,PThread Spinlock (private)
  1     181957776,36391555,0,PThread Spinlock (shared)
  1   1 17720703,1772070,440968,GCC Spinlock (atomic)
  1   1 17745728,1774573,302378,GCC Spinlock (sync)
  1   1 113294181,11329418,1430492,Lock-Free
  1   1 14249235,1424924,19111,PThread Mutex
  1   1 21189168,2118917,1752830,PThread Spinlock (private)
  1   1 22103081,2210308,1183323,PThread Spinlock (shared)
1 1     75757682,7575768,13307,GCC Spinlock (atomic)
1 1     74735661,7473566,119621,GCC Spinlock (sync)
1 1     94664570,9466457,353058,Lock-Free
1 1     40270230,4027023,18969,PThread Mutex
1 1     114018905,11401890,10617,PThread Spinlock (private)
1 1     113393280,11339328,712627,PThread Spinlock (shared)
1 1 1 1 14968954,748448,128613,GCC Spinlock (atomic)
1 1 1 1 15102087,755104,310864,GCC Spinlock (sync)
1 1 1 1 86316230,4315812,300009,Lock-Free
1 1 1 1 15692677,784634,2765,PThread Mutex
1 1 1 1 18496478,924824,85572,PThread Spinlock (private)
1 1 1 1 18526508,926325,79346,PThread Spinlock (shared)

Queue Enqueue One Dequeue One

   M   
   N   
   S   
  L3U  
L2U L2U
L1D L1D
L1I L1I
 P   P 
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
  1     137902765,27580553,0,GCC Spinlock (atomic)
  1     149820671,29964134,0,GCC Spinlock (sync)
  1     105693588,21138718,0,Lock-Free
  1     96371275,19274255,0,PThread Mutex
  1     211975764,42395153,0,PThread Spinlock (private)
  1     200698498,40139700,0,PThread Spinlock (shared)
  1   1 26540514,2654051,4247,GCC Spinlock (atomic)
  1   1 27436409,2743641,425,GCC Spinlock (sync)
  1   1 31039008,3103901,1169309,Lock-Free
  1   1 17809792,1780979,14439,PThread Mutex
  1   1 25593568,2559357,1699,PThread Spinlock (private)
  1   1 25189164,2518916,148641,PThread Spinlock (shared)
1 1     113591478,11359148,1133,GCC Spinlock (atomic)
1 1     114401287,11440129,1274,GCC Spinlock (sync)
1 1     84650566,8465057,152888,Lock-Free
1 1     79341262,7934126,140713,PThread Mutex
1 1     107801694,10780169,3681,PThread Spinlock (private)
1 1     134627493,13462749,156710,PThread Spinlock (shared)
1 1 1 1 28101073,1405054,48162,GCC Spinlock (atomic)
1 1 1 1 26123097,1306155,39169,GCC Spinlock (sync)
1 1 1 1 34445411,1722271,304450,Lock-Free
1 1 1 1 21404383,1070219,11668,PThread Mutex
1 1 1 1 34058024,1702901,407905,PThread Spinlock (private)
1 1 1 1 35752717,1787636,595948,PThread Spinlock (shared)

The freelist result is great – when running one just one logical core, or when running on all threads on a single logical core, the pthread spinlocks win; but when moving out over multiple cores, liblfds is four to five times faster. I suspect the magnitude of the win will grow as the number of cores increases – experiments on Amazon with high core counts to follow.

The queue result is similar but the order of improvement is much smaller and indeed with one thread on every core, there is no improvement – but here I suspect the memory bus is getting saturated; but it could be related to the backoff value. More experiments to follow.

Second light

This is with DWCAS backoff set to 2 (rather than 16).

   M   
   N   
   S   
  L3U  
L2U L2U
L1D L1D
L1I L1I
 P   P 
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
  1     685502818,137100564,0,GCC Spinlock (atomic)
  1     684662979,136932596,0,GCC Spinlock (sync)
  1     577258682,115451736,0,Lock-Free
  1     417807390,83561478,0,PThread Mutex
  1     788132345,157626469,0,PThread Spinlock (private)
  1     787191405,157438281,0,PThread Spinlock (shared)
  1   1 75191116,7519112,1834370,GCC Spinlock (atomic)
  1   1 74801727,7480173,1167185,GCC Spinlock (sync)
  1   1 123331208,12333121,545300,Lock-Free
  1   1 60827767,6082777,860277,PThread Mutex
  1   1 84286202,8428620,9465170,PThread Spinlock (private)
  1   1 84852768,8485277,10199315,PThread Spinlock (shared)
1 1     328755427,32875543,109428,GCC Spinlock (atomic)
1 1     323336013,32333601,512882,GCC Spinlock (sync)
1 1     430760330,43076033,8492351,Lock-Free
1 1     164467303,16446730,91874,PThread Mutex
1 1     491495004,49149500,1599093,PThread Spinlock (private)
1 1     490755265,49075526,3211918,PThread Spinlock (shared)
1 1 1 1 64488424,3224421,617048,GCC Spinlock (atomic)
1 1 1 1 64559495,3227975,1185225,GCC Spinlock (sync)
1 1 1 1 71972901,3598645,65801,Lock-Free
1 1 1 1 67026960,3351348,88910,PThread Mutex
1 1 1 1 69056988,3452849,763551,PThread Spinlock (private)
1 1 1 1 68906838,3445342,739913,PThread Spinlock (shared)

There are variations on the order of 10% for the non-locking benchmarks, compared to the previous post. I think what’s happening is thermal throttling, in my laptop. It’s not a stable benchmarking platform. I need to rate limit the CPU to something sustainable and then benchmark.

(This is on a Core i5).

First light

These results are with DWCAS backoff of 16 – it may be (may well be) this is a non-optimal choice and is depressing performance.

I’m about to add code to iterate over a range of backoff values.

   M   
   N   
   S   
  L3U  
L2U L2U
L1D L1D
L1I L1I
 P   P 
L L L L total ops,mean ops/sec/core,std dev/sec/core,lock type
  1     684885201,136977040,0,GCC Spinlock (atomic)
  1     685716031,137143206,0,GCC Spinlock (sync)
  1     578266689,115653338,0,Lock-Free
  1     417751334,83550267,0,PThread Mutex
  1     787129343,157425869,0,PThread Spinlock (private)
  1     788219432,157643886,0,PThread Spinlock (shared)
  1   1 71237166,7123717,52944,GCC Spinlock (atomic)
  1   1 67975908,6797591,1416,GCC Spinlock (sync)
  1   1 122725603,12272560,2365656,Lock-Free
  1   1 62373311,6237331,671149,PThread Mutex
  1   1 83235152,8323515,9885612,PThread Spinlock (private)
  1   1 84115031,8411503,9632639,PThread Spinlock (shared)
1 1     328113786,32811379,77860,GCC Spinlock (atomic)
1 1     323700377,32370038,522225,GCC Spinlock (sync)
1 1     430896466,43089647,9112679,Lock-Free
1 1     167093927,16709393,19111,PThread Mutex
1 1     491238748,49123875,1935729,PThread Spinlock (private)
1 1     491291801,49129180,1430492,PThread Spinlock (shared)
1 1 1 1 62177115,3108856,8552,GCC Spinlock (atomic)
1 1 1 1 59761702,2988085,33801,GCC Spinlock (sync)
1 1 1 1 71814743,3590737,65118,Lock-Free
1 1 1 1 67450383,3372519,99140,PThread Mutex
1 1 1 1 69443374,3472169,773448,PThread Spinlock (private)
1 1 1 1 79839760,3991988,377803,PThread Spinlock (shared)

So with 1 thread per physical package (i.e. not hyperthreading), liblfds is WAAAAY out in front. However, when we max out – one thread per logical core – the pthread spinlock wins. What’s amazing and unexpected is that the *shared* pthread spinlock is substantially faster than the private spinlock!

I’m about to examine the source code for pthread spinlocks. I guess they’re doing some extra work, which makes them more efficient – and also they only have to CAS, whereas the freelist is using DWCAS. However, I think the main issue here is the DWCAS backoff in the freelist – when we go to one thread per logical core, there’s a lot more contention, and that backoff is being invoked much more often (I think!)