Monthly Archives: March 2011

New test

I’m going to write a test, identical to the current freelist test (pop/push in a loop), except each thread will have its own freelist.

The performance there will be the same as the normal test, except that and only that CAS/DCAS collisions will have been removed.

Being able to see how much CAS/DCAS costs I think could be useful. It will also reveal to me what the bottlenecks are without CAS/DCAS – memory access, almost certainly.

Random factor

I think I’ve realised why the random factor in the backoff is making almost no difference – it’s an artefact of the benchmark.

The benchmark is a simple loop, where each iteration pops and then pushes.

So say two threads collide. They both back off, one selects a wait of 0, the other selects a wait of 1. So the first goes and successfully swaps; the other waits for one DCAS time and then… then bumps right into the first thread again, who is of course by then doing another operation.

So it doesn’t matter how long you wait – someone else is ALWAYS doing an operation.

One thought then is that for the maximum delay for the current backoff iteration, you attempt your DCAS at your selected delay but then you need to keep waiting, doing nothing, until you reach the maximum backoff delay.

But this assumes between threads a certain synchronicity which may not be there – e.g. you may have say ten threads all of which experience a collision on their first swap attempt and succeed on the second, and have one very unfortunate thread who bumps into each and every one of them, taking his backoff iteration to ten. Well – there’s no point him waiting till his backoff time is complete – he won’t be avoiding colliding with anyone else, they were all at their first backoff iteration.

OTOH, I think on average as a system becomes more loaded threads in general will find their average backoff iteration increases – assuming, that is, that threads are equally loaded with regard to the number of CAS/DCAS operations they’re performing.

I need to make that second benchmark, where the threads are idle about 90% of the time. It’ll give the backoff something to back off into.

And this I think begins to explain why 500/500 appears to give better results – in fact what’s happening is we’re starving the threads who backoff; giving us a single thread who ends up running on his own a lot of the time. Memory access appears to be the bottleneck on performance, so multiple threads can only reduce performance because of collisions, so reducing the number of active threads appears to improve performance. In fact, you have at any one time one fast thread and a bunch of slow threads.

I’ll need to modify the benchmark code to print per-thread results… I thought about this before, but there wasn’t an obvious way to present the output without making the output very wide.

Actually, I’m not sure this will reveal the problem – it won’t be a particular thread which is fast all the way through the test – it’s whichever thread fails to lose the backoff who becomes fast, while the others are slow, so it averages out.

Well, I’ll think of something :-) I’m aware of the issue now.

And the answer is…

On my machine, 20/22 gives significantly better results than 4/5.

Depressingly, hard-setting the value to 500/500 gives even better results than 20/22.

Well – I guess if the results are wrong on my machine, they’re wrong – so 20/22 is unlikely to happen to be the correct value.

Unfortunately, it’s also still true on the Linux machine. 500/500 is quite a bit better than 20/22 (for example, scaling 0.49 vs 0.34, for two threads on the same core).

Now, the backoff algorithm calculates a max value, which is 2 to the power of the number of backoff iterations and then randomly chooses a value in that range, and waits for that number of delay operations, where the delay is the loop of n NOPs, where we’ve been playing here with n, setting it (for DCAS) to 5, or 22, or 500.

Curiously – and I suspect this is an important clue – removing the random factor and just using the maximum delay makes little difference.

Oh man…

The Linux results are very consistent, which is great – they give 19.5 NOP for CAS, 21.5 for DCAS.

So that’s would really be happy-joy-happy-joy…

…except it turns out the debug build runs *faster* for CAS.

The debug build gives 18.4 NOP for CAS, but a much larger 36.2 for DCAS.

Well – the results are highly consistent, which is very important. It doesn’t matter per se that debug is faster than release – all that matters is you have the timing for yourself; you don’t care about your build.

The Linux machine I’m working on is about 50% faster than my home machine, for a single thread running the freelist test. I get about 450,000,000 operations per second on the Linux machine, vs about 300,000,000 per second on my Windows 7 machine.

The Linux machine – Intel Xeon @ 2.8 GHz – thinks roughly 20 NOP for CAS, 22 for DCAS.

The Windows machine – Intel i5 at 2.4 GHz – thinks roughly 3 NOP for CAS, 5 for DCAS.

So what’s up wit dat den?

Next step is benchmarking with backoff using these values. If the 20/22 value performs better on my system than 3/5, then I’ll definitely know something is amiss.

NOP to CAS/DCAS results

So…

Wrote the simplest possible assembly file. Figured out how to use ml64.exe. Fixed up the makefiles. Wrote some timing code, to try to figure out how many NOPs per CAS and per DCAS.

The approach is to time how long it takes to perform n iterations.

I want to time an empty loop, so I can subtract that time from the CAS and DCAS results before dividing by the time taken for the NOP loop to figure out how many NOP loops are required for a CAS/DCAS.

(CAS and DCAS in the code occur as singletons; but the backoff will perform a for() loop for NOP calls, just as it does in this timing code, so I want to subtract the loop time from CAS/DCAS but not from NOP).

There is a complication in that CAS and DCAS are inlined, so the empty loop should in fact do nothing (not even call an empty function), but the optimizer then makes that loop go away. I can fix that by compiling the file with optimization off, but I tried it once and it doens’t make much difference, so I’m not worrying about it right now).

And the results… …man, what a PITA.

I mean, I knew it was going to be pretty fuzzy, trying to time an instruction. But what I’ve found is that the order of the loops has more effect than the NOP operation.

The first loop (empty or NOP) is slow. The second loop is fast. So much so that when the first loop is empty and the second is NOP, the NOP loop – which is identical in every way to the first loop except for the addition of the NOP – does MORE operations than the empty loop. The empty loop takes about 10% longer than the NOP loop.

If I reverse the order of the loops, I reverse the results! NOP becomes slower than empty.

And, with the original loop order, if I then do a third loop, repeating the NOP code, that loop is slow and NOP now actually does less operations than the empty loop.

Those results on my machine are highly consistent and vary by maybe 1% per run.

Of course, I can just blunder ahead – subtract the empty loop time from the CAS/DCAS time and divide by the NOP time. That gives me about 3 NOPs per CAS and 5 per DCAS (remember, that includes for incrementing for() loop as well as the call to the NOP function, and that the CAS/DCAS code is inlined – well, it’s supposed to be – I’ve had a report it’s not! I think I may need to check that now…)

So, I can make some attempt at validating these results, by running the benchmark with the given CAS/DCAS values and seeing the results, then varying the delays and seeing if they improve the results. If they do, something is amiss.

I will also run the code on a Linux x64 machine and see what numbers I find there. They should be broadly similar.

MS *facepalm*

Inline assembler is forbidden on 64 bit architectures.

I have to get hold of an assembler and compile a .s file into a .o file and link to it.

I checked the MS C compiler instrinics – there’s one called “__noop” and just for a second I thought YES! but it turns out it’s purpose is to make it so a function call is parsed by the compiler but produces no code (for debugging calls and the link, which you turn on and off with a #define).

Assembler time…

…this means the makefile now has to care about architecures…

…and I see that there are TWO MS assemblers, with different names, one for 32-bit platforms and one for 64-bit platforms. So I can’t even call the same frickin’ BINARY NAME…

What a huge stinking mess!

(It’s become worse. There are assembly samples on the MS site. File type is “.zip.exe”. I mean, why? WHY!? I can open a zip with a click. Why make me run a potentially infected exe?)

Timing instructions

So, what I’m playing with now is timing a NOP and a CAS, so I can see how many NOP I need for a single CAS delay.

So let’s say I time one million NOP and one million CAS, which lets me figure out the ratio of NOP to CAS.

But the time for a single instruction is very short, so I time the whole loop.

Problem is, what happens if the thread is pre-empted during the loop? that’ll screw the timing up something rotten!

One idea is to time a million operations but also ten thousand operations. The ratio of time taken should be about 10:1. If one of the two loops is pre-empted, the ratio will be crazy. If both are pre-empted, the ratio remains crazy – the ten to one ratio of the work done is overwhelmed by the time spent ready-to-run, so the ratio looks like 1:1.

I’m curious to see what the ratio is (I’m guessing about 300 NOP to a CAS – although of course CAS is quicker than DCAS, so it varies on that as well).

I then want to see how much difference it makes when you have optimal (or at least approximately optimal) values for delay granularity.

Single backoff delay period

Okay –

The ONLY way I can see to solve this is to benchmark it on the system during the library init.

It’s not just that a given system will be slower or faster – it’s also that different CPUs will take longer or slower for their NOP!

There’s just no way you can pick some value, say 500, and say – yeah, this is okay for ARM/SPARC/x64/etc/etc.

This in turn leads to the timing problem. Now I need a high resolution timer… the abstraction layer gets bigger. (This is easy under Windows, okay under Linux… and maybe a big no-no for some other platforms).

I think I’ll also offer an API to set the delay period manually, so people on systems without such timers can do some playing around with the benchmark code to come up with a reasonable value.

Exponential Backoff

Okay, so I’ve been thinking about and reading a bit (because there isn’t much to read!) about exponential backoff.

To cut lots of thought short, here’s the conclusions;

1. you need to decide your delay granularity – so for each backoff, you compute how many of these delays you will have – and the granularity should be the length of time required for one operation of the type you’re attempting to perform; and I don’t know how to find this out!

2. I’ll to begin with use a straightforward, classic exponential backoff, a la Ethernet – you pick a random number between 0 and a maximum value and that’s the number of delays you wait. The maximum value is two to the power of backoff iterations (e.g. 2, 4, 8, 16, etc).

3. I’ve added a new abstraction function, abstraction_nop(), where the function performs a single time-delay inducing no operation. (“NOP” on Intel, “mov r0,r0″ for ARM, etc).

So it’s all find, except I don’t know how to know the length of a single delay period.

I can – at least for the first release of backoff – code up the backoff in C. It seems good enough, certainly to be used, and it’ll be quicker than getting it all done in assembly for all the different CPUs.

What I’ve found with the freelist benchmark I’ve been experimenting with is that on a 64-bit Intel chip, scalability for a test which loops doing pop-push won’t rise above about 0.54. I think this is because the memory bus is saturated; one core by itself is already saturating the bus. Adding more cores (without backoff) results in horrific live-lock; adding with backoff results in the bus still being maximally used – and so two cores being the same speed as one. And we see then when we move to three, scalability is about 0.3 (a third) and four to 0.25 and so on.

I’m thinking of a new test, where 90% of the time the test executes a single backoff delay and 10% of the time a freelist operation. Of course, in a sense it’s a fabricated test – all you’re doing is spreading your memory bandwidth over so many more processors – but it’s an easy way to see the results in a more realistic scenario.

I did some experimenting with different numbers of NOPs per delay, and also whether or not the wait was randomly selecting a value from 0 to the maximum or just using the maximum. Turning random selection on or off made very little difference – which is puzzling. Changing the value of NOPs made some difference for small values. 64 gave 0.46 scalability with two logical processors, where 500 gave 0.54. But 100,000 still gave 0.53… which also puzzles me. It implies the memory bus is so fully used that even with huge delays between operations, the two cores still are fully saturating the bus. Which is entirely possible of course :-) but doesn’t help me pick the optimal value for when the bus is not saturated.

The big problem remains – how do I select a sensible delay period for a given platform? maybe it’s something you can have dynamically adapt – but you’re taking a hell of a risk there. What happens if you adaptation doesn’t always work? or I could set some reasonable value but offer an API to set a specific value – and perhaps have something in the test programme which computes a good value (which you use on an idle system, so you get a sane result).

Guess what…

I’m now learning about exponential backoff algorithms =-)