Back on the case (again)…
Too much work, see.
So, I think I have an idea half-conclusion for Intel backoff. There are two modes; backoff where all threads accessing the data structure are on the same physical core (e.g. share CAS cache line locking) and backoff where threads will be on multiple physical cores. (And also of course no backoff – the third mode).
When on the same physical core, the atomic backoff delay is that for one memory read from cache and the CAS itself. When on multiple cores, the atomic backoff delay is for one memory read from memory and the CAS itself.
This handles x86/x64 style CAS cache line locking. I need now to consider LL/SC style locking (e.g. everyone else).
Also, there remains the basic question whether this is the right way to go. The problem is that we are benchmarking at library init. What if we benchmark during a busy period? that seems to me really to be a fatal flaw…
Perhaps we can do some kind of dynamic benchmarking? by that I mean, derive from our recent history the figure we use for the atomic backoff period. That way we would self-correct.