Implementing RCU… :-)
Monthly Archives: February 2011
Here it comes!
RCU design
I’ve written out my RCU design. It’s obviously not just the product of last night – I was thinking about it a lot before the benchmarking stuff came out. I’ve posted it to comp.programming.threads – want to see what people say / think.
It’s really remarkably simple. If the design is sound, it’ll take a day to do. Of course, then you have to test the thing :-)
Test update done
Tests lookin’ more interestin’
Test programme now uses the system architecture info to run itself over interesting thread sets. TBH, I was little bit in two minds about this, because now it is necessary for developers to implement something at least in the abstraction_system_info() call – whereas before, they just had to indicate the number of logical processors. Most people don’t need to implement the abstraction layer necessary for benchmarking, but most people do need to implement for the test programme – and abstraction_system_info() requires a bit more thought than just returning the number of logical processors.
Test Iteration 01 ================= Freelist Tests ============== M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L 0 0 0 1 Popping...passed 0 0 0 1 Pushing...passed 0 0 0 1 Popping and pushing (10 seconds)...passed 0 0 0 1 Rapid popping and pushing (10 seconds)...passed 0 1 0 1 Popping...passed 0 1 0 1 Pushing...passed 0 1 0 1 Popping and pushing (10 seconds)...passed 0 1 0 1 Rapid popping and pushing (10 seconds)...passed 0 1 0 1 Popping...passed 0 1 0 1 Pushing...passed 0 1 0 1 Popping and pushing (10 seconds)...passed 0 1 0 1 Rapid popping and pushing (10 seconds)...passed 1 1 1 1 Popping...passed 1 1 1 1 Pushing...passed 1 1 1 1 Popping and pushing (10 seconds)...passed 1 1 1 1 Rapid popping and pushing (10 seconds)...passed 0 0 1 1 Popping...passed 0 0 1 1 Pushing...passed 0 0 1 1 Popping and pushing (10 seconds)...passed 0 0 1 1 Rapid popping and pushing (10 seconds)...passed 1 1 1 1 Popping...passed 1 1 1 1 Pushing...passed 1 1 1 1 Popping and pushing (10 seconds)...passed 1 1 1 1 Rapid popping and pushing (10 seconds)...passed
Full benchmark results
Release 7 Lock-Free Freelist Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 310134488,31013449,0,1.00 0 1 0 1 136313300,6815665,38365,0.22 0 1 0 1 136401284,6820064,50706,0.22 1 1 1 1 111134328,2778358,23851,0.09 0 0 1 1 334747444,16737372,2421,0.54 1 1 1 1 111105898,2777647,40399,0.09 Release 7 Mutex Freelist Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 22834766,2283477,0,1.00 0 1 0 1 1570890,78545,597,0.03 0 1 0 1 1559400,77970,483,0.03 1 1 1 1 1630422,40761,11,0.02 0 0 1 1 1456462,72823,515,0.03 1 1 1 1 1596886,39922,5,0.02 Release 7 Lock-Free Queue Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 151265732,15126573,0,1.00 0 1 0 1 92171864,4608593,62491,0.30 0 1 0 1 92083732,4604187,25980,0.30 1 1 1 1 116592852,2914821,144423,0.19 0 0 1 1 199383750,9969188,4519,0.66 1 1 1 1 116831770,2920794,142648,0.19 Release 7 Mutex Queue Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 21378718,2137872,0,1.00 0 1 0 1 10578876,528944,1500,0.25 0 1 0 1 11732340,586617,673,0.27 1 1 1 1 3391838,84796,26,0.04 0 0 1 1 4512730,225637,818,0.11 1 1 1 1 3381934,84548,99,0.04 Release 7 Lock-Free Ringbuffer Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 113368780,11336878,0,1.00 0 1 0 1 68645920,3432296,40189,0.30 0 1 0 1 69198196,3459910,18016,0.31 1 1 1 1 93855158,2346379,128596,0.21 0 0 1 1 143133368,7156668,535,0.63 1 1 1 1 93840968,2346024,123914,0.21 Release 7 Lock-Free Stack Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 180235996,18023600,0,1.00 0 1 0 1 82136974,4106849,16235,0.23 0 1 0 1 82048308,4102415,5581,0.23 1 1 1 1 101999746,2549994,174875,0.14 0 0 1 1 230598452,11529923,18709,0.64 1 1 1 1 101977610,2549440,171516,0.14
I’m about to change the freelist allocator so it allocates a single block, rather than one block per element. That will reduce the load on the TLBs. I’m not sure the current freelist benchmark will show this though, since it doesn;’t iterate up and down over the freelist. Might be time to add a second benchmark :-)
Benchmarking done
The benchmarking programme is now back together again.
The new code needs reviewing and rewriting, because it was experimental – but I’m thinking of changing the test programme so it executes every test on the thread sets, because a system running one thread per logical processor is going to be maxed out, which maximizes CAS collisions. That’s useful for detecting many types of lock-free bug, but not for all – I think there is value for example in a two thread test, since it will more heavily exercise other sections of the code (rather than the CAS contention loops).
Furthermore, depending on the system architecture, some combinations of logical processors will be more interesting than others. Running two threads on the same physical core is interesting, but in a different way to two threads on seperate physical cores.
The test application is only used when implementing a port or for confidence building, so I think there’s no harm in it taking a longer time to run. Currently, it takes a minute or two. With the three thread sets, it would take maybe ten times longer on a hyperthreaded dual core system – but of course, the bigger your system, the longer it would take.
Comparative benchmarking dilemma
So, here’s a question.
I’m working on the benchmark programme. I’m adding the mutex based queue, one of the alternative lock-type benchmarks for comparative purposes.
The issue is that the queue uses a freelist. In the mutex based queue, do I have it use a mutex based freelist, or do I have it use a lock-free based freelist?
The former will isolate the impact of lock-free on the queue; the latter will provide an accurate picture of whole-system performance when using mutexes rather than lock-free.
It’s so beautiful!
Release 7 Lock-Free Freelist Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 322909298,32290930,0,1.00 0 1 0 1 138967842,6948392,90039,0.22 0 1 0 1 137600574,6880029,51975,0.21 1 1 1 1 112837088,2820927,141452,0.09 0 0 1 1 337294400,16864720,111676,0.52 1 1 1 1 113630322,2840758,375358,0.09 Release 7 Mutex Freelist Benchmark #1 M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L total ops,mean ops/sec per thread,standard deviation,scalability 0 0 0 1 22279952,2227995,0,1.00 0 1 0 1 3663262,183163,1549,0.08 0 1 0 1 3651956,182598,2308,0.08 1 1 1 1 1493764,37344,195,0.02 0 0 1 1 1570466,78523,2285,0.04 1 1 1 1 1469258,36731,165,0.02
Release 7 Lock-Free Freelist Benchmark #1
M
N
S
L2U L2U
L1D L1D L1D L1D
L1I L1I L1I L1I
P P P P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 176535348,17653535,0,1.00
0 0 1 1 183097082,9154854,79216,0.52
0 1 1 1 57109090,1903636,193434,0.11
1 1 1 1 39186474,979662,144424,0.06
1 1 1 1 39274834,981871,153069,0.06
0 0 0 1 197659570,19765957,0,1.12
0 0 1 1 185598534,9279927,84064,0.53
0 1 1 1 62994198,2099807,181144,0.12
1 1 1 1 39214612,980365,142972,0.06
Release 7 Mutex Freelist Benchmark #1
M
N
S
L2U L2U
L1D L1D L1D L1D
L1I L1I L1I L1I
P P P P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 11706514,1170651,0,1.00
0 0 1 1 4485136,224257,1132,0.19
0 1 1 1 890686,29690,1089,0.03
1 1 1 1 1046698,26167,125,0.02
1 1 1 1 1045442,26136,116,0.02
0 0 0 1 11767104,1176710,0,1.01
0 0 1 1 4416886,220844,1263,0.19
0 1 1 1 881374,29379,634,0.03
1 1 1 1 1040874,26022,117,0.02
You can see the three different thread set generation algorithms. With these narrow system architectures, they produce fairly similar results or overlapping results, but on say SPARC systems, where you see four or more physical cores with four or more logical cores per CPU, that they produce orthagonal sets becomes much more apparent.
Note here that the mutex based freelist hasn’t yet been upgraded with run-time cache-line alignment.
If only I’d known…
Turns out iterating over memory alone will always generate a full set of all iterations over lower level NUMA nodes/caches, certainly for the three thread set algorithms I have in mind.
If I’d realised that at the beginning, it would have saved a lot of work!
Actually, that’s not quite true. You do need to see the full component tree to understand the work being done by the system for a given thread set.
Benchmarking
Fiiiiinally. Non-stop over the weekend, plus another couple of hours today.
So, this is the output from the implementation of one of the three thread set generating algorithms.
Some explanation is required.
First, let’s look at a typical system diagram.
M
N
L3U
S S
L2U L2U
L1D L1D L1D L1D
L1I L1I L1I L1I
P P P P
L L L L L L L L
So, the letters stand for;
M = Memory N = NUMA node S = Socket L2U = Level 2 cache, unified L1D = Level 1 cache, data L1I = Level 1 cache, instruction P = Physical core L = Logical core
So we see each physical core has two logical cores (e.g. hyperthreading) and each socket has a two core CPU package.
Now, we don’t care about physical cores or sockets; all that matters is the arrangement of caches with respect to logical processors. Imagine for example we had one physical processor which had two L1I caches – each of which was used by a different logical processor, e.g.;
P L1I L1I L L
rather than the more usual physical processor with a single L1U which is shared by logical processors, e.g.;
L1I P L L
Now, how we go about testing these depends purely on the arrangement of caches – it doesn’t change a thing where the physical processor is.
The way we approach testing is to iterate, starting at the lowest level cache and then, per iteration, moving up one level, where a level is a cache of any type, or a NUMA node, or memory. (Memory is always last, and we assume there is only one memory).
So in our original diagram, we start at L1I, then move up to L1D, then L2U, then L3U, then our single NUMA node and finally, our final and single memory node.
We do this on the left-hand side of the tree, e.g. we start at the bottom left and work our ways upwards, following the left side of the tree.
For each iteration, we scan the system tree below us, counting the number of lowest-level caches we can see.
For this particular thread set generation algorithm, we then loop from 1 to equal to or less than number of lowest level caches; and on each iteration of that loop, we create a thread set, where the thread set is all the logical processors from 0 to the value of loop.
That doesn’t explain a thing, so here’s a nice picture :-)
M
N
L3U
S S
L2U L2U
L1D L1D L1D L1D
L1I L1I L1I L1I
P P P P
L L L L L L L L
1 1 0 0 0 0 0 0
1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 1
These are the four thread sets generation by the algorithm when it operates on the memory element.
Now the purpose becomes clear – to bring into the test, one by one, full sets of lowest-level caches worth of logical processors.
This is done for each level of the tree. So if we look at the result at L2U, we get this;
M
N
L3U
S S
L2U L2U
L1D L1D L1D L1D
L1I L1I L1I L1I
P P P P
L L L L L L L L
1 1 0 0 0 0 0 0
1 1 1 1 0 0 0 0
Here’s the output from my laptop;
M N S L3U L2U L2U L1D L1D L1I L1I P P L L L L 1 1 0 0 (L1I) 1 1 0 0 (L1D) 1 1 0 0 (L2U) 1 1 0 0 1 1 1 1 (L3U) 1 1 0 0 1 1 1 1 (NUMA) 1 1 0 0 1 1 1 1 (Memory)
The bracketed notes I’ve added here so you can see which level each set group is for.
Here’s the output from my server, an old Intel quad core with two two-CPU dies bonded together (which is why you see one L2U per pair of physical cores);
M
N
S
L2U L2U
L1D L1D L1D L1D
L1I L1I L1I L1I
P P P P
L L L L
1 0 0 0 (L1I)
1 0 0 0 (L1D)
1 0 0 0
1 1 0 0 (L2U)
1 0 0 0
1 1 0 0
1 1 1 0
1 1 1 1 (NUMA)
1 0 0 0
1 1 0 0
1 1 1 0
1 1 1 1 (Memory)
Fantastic! now I need to implement the other two algorithms.
Obviously, the thread sets will be culled, removing duplicates. In fact, looking at the output, I wonder if it’s not going to be the case that the thread set from memory will always be the full set of all other levels.
Recent Comments