Performing tuning

I’ve been spending some time poring over the M&S queue.

If I look at the benchmark for the freelist, it goes ballistic on one thread on each of two cores.

The queue just *doesn’t*. It goes at the same speed at the normal locking stuff.

It’s driving me nuts – I need to figure out why, so I can know it’s real and not just me messing up the implementation.

Relating to this, but it didn’t make a difference, I found out today by default on Intel when a cache line is fetched, the following cache lins is also fetched. I tried doubling the atomic isolation aligned to be two times the cache line width, but it made no difference to anything except the btree, which slowed down by about 25%.

One thing I noticed was that I had some store barriers in the dequeue/enqueue code which were absolutely unnecessary – they were being used on counter fields, which would be pushed out anyway by a following DWCAS.

I noticed also I had an inefficient store barrier in the freelist/stack push – I was causing two fields to get pushed by a DWCAS, when in fact one of them was going to be pushed out anyway in the DWCAS. I reordered them, and the freelist/stack sped up by about 10% in the single-thread case (which is the clearest measure of the effect of such a change).

However, the change I made to the queue, where I removed the unnecessary stores, has made it about 50% faster with one thread on each of two physical cores, but noticably SLOWER when one thread on each of four logical cores (i.e. two threads on each of two physical cores) – and this I do *not* understand.