Wrote the simplest possible assembly file. Figured out how to use ml64.exe. Fixed up the makefiles. Wrote some timing code, to try to figure out how many NOPs per CAS and per DCAS.
The approach is to time how long it takes to perform n iterations.
I want to time an empty loop, so I can subtract that time from the CAS and DCAS results before dividing by the time taken for the NOP loop to figure out how many NOP loops are required for a CAS/DCAS.
(CAS and DCAS in the code occur as singletons; but the backoff will perform a for() loop for NOP calls, just as it does in this timing code, so I want to subtract the loop time from CAS/DCAS but not from NOP).
There is a complication in that CAS and DCAS are inlined, so the empty loop should in fact do nothing (not even call an empty function), but the optimizer then makes that loop go away. I can fix that by compiling the file with optimization off, but I tried it once and it doens’t make much difference, so I’m not worrying about it right now).
And the results… …man, what a PITA.
I mean, I knew it was going to be pretty fuzzy, trying to time an instruction. But what I’ve found is that the order of the loops has more effect than the NOP operation.
The first loop (empty or NOP) is slow. The second loop is fast. So much so that when the first loop is empty and the second is NOP, the NOP loop – which is identical in every way to the first loop except for the addition of the NOP – does MORE operations than the empty loop. The empty loop takes about 10% longer than the NOP loop.
If I reverse the order of the loops, I reverse the results! NOP becomes slower than empty.
And, with the original loop order, if I then do a third loop, repeating the NOP code, that loop is slow and NOP now actually does less operations than the empty loop.
Those results on my machine are highly consistent and vary by maybe 1% per run.
Of course, I can just blunder ahead – subtract the empty loop time from the CAS/DCAS time and divide by the NOP time. That gives me about 3 NOPs per CAS and 5 per DCAS (remember, that includes for incrementing for() loop as well as the call to the NOP function, and that the CAS/DCAS code is inlined – well, it’s supposed to be – I’ve had a report it’s not! I think I may need to check that now…)
So, I can make some attempt at validating these results, by running the benchmark with the given CAS/DCAS values and seeing the results, then varying the delays and seeing if they improve the results. If they do, something is amiss.
I will also run the code on a Linux x64 machine and see what numbers I find there. They should be broadly similar.