Performing tuning

I’ve been spending some time poring over the M&S queue.

If I look at the benchmark for the freelist, it goes ballistic on one thread on each of two cores.

The queue just *doesn’t*. It goes at the same speed at the normal locking stuff.

It’s driving me nuts – I need to figure out why, so I can know it’s real and not just me messing up the implementation.

Relating to this, but it didn’t make a difference, I found out today by default on Intel when a cache line is fetched, the following cache lins is also fetched. I tried doubling the atomic isolation aligned to be two times the cache line width, but it made no difference to anything except the btree, which slowed down by about 25%.

One thing I noticed was that I had some store barriers in the dequeue/enqueue code which were absolutely unnecessary – they were being used on counter fields, which would be pushed out anyway by a following DWCAS.

I noticed also I had an inefficient store barrier in the freelist/stack push – I was causing two fields to get pushed by a DWCAS, when in fact one of them was going to be pushed out anyway in the DWCAS. I reordered them, and the freelist/stack sped up by about 10% in the single-thread case (which is the clearest measure of the effect of such a change).

However, the change I made to the queue, where I removed the unnecessary stores, has made it about 50% faster with one thread on each of two physical cores, but noticably SLOWER when one thread on each of four logical cores (i.e. two threads on each of two physical cores) – and this I do *not* understand.

Update

Last test to write on the new queue.

I’ve had a bit of rename. Everything used to be alphabetic, i.e.

list_addonly_ordered_singlylinked

However, thinking of Orwell’s comment “ignore all of above” I’ve decided to be consistent in ordering within each data structure type. I’ve also needed to rename lfds710_queue, because it needs to be clear it’s MPMC.

I had a bit of an insight earlier, in the shower.

I think I can modify the M&S queue to be bounded, by giving it an array of pointers to elements. You can’t use an actual array directly, because it constrains the order in which dequeues must occur for other threads to be able to progress. So it has to be pointers.

However, I’m not going to do that now – I’m moving to release now.

I may do the benchmark for the new queue, but that’s the only extra dev work I’ll do.

In other news, this is my first year with two Easters. Late March in Malta, late April in Athens!

Ringbuffer disappointment

The ringbuffer uses the M&S queue and so needs a dummy element.

This disfigures the API.

I implemented Dmitry Vyukov’s bounded MPMC queue, to use that instead.

However, having implemented and now coming to adjust the ringbuffer, I realise it’s no use.

The problem is that it’s a soft queue and also that it runs from an array. By soft I mean that although all enqueue and dequeue operations occur, they are not necessarily visible to all other cores by the time the enqueue/dequeue function returns.

If we imagine a queue as a ringbuffer, then what it means is that if we try to enqueue and we cannot, because the queue is full, we dequeue, and then enqueue – throwing away one element.

The problem is that when we come to dequeue, someone else may dequeue just before us; our dequeue will succeed but if the other thread is first, then until that other thread’s dequeue becomes visible to us, we can’t continue, because we need the *next element in the array*, the one just dequeued, to become visible to us as dequeued – we can’t enqueue until *it* is available, regardless of the success of our own dequeue.

What’s more, we can’t just say “well, we did a dequeue, so we’ll get something sooner or later” because other threads could be genuinely enqueuing more stuff. So we can’t *know* whether we need to re-issue the dequeue.

The basic issue is that the queue is blocked by any one thread performing a dequeue where that thread is slow (to be clear, though, the queue was never meant to be otherwise – it’s just I’m really appreciating the impact of this now).

It can be solved by forcing a store at the end of the dequeue, so everyone has visibility. This halves the speed of the data structure (currently one CAS for enqueue and one for dequeue). Also, we still are exposed to the problem of a thread performing a dequeue being idled by the OS when it’s halfway – it’s done the first CAS, so the dequeue has occurred, bu it’s not yet made the element usable (adjusting the sequence number). During this period, an idled thread blocks the data structure.

anti-aging news

About seven months ago a woman in the US, a normal woman, not ill, flew to Colombia – to evade the restrictions placed upon her freedom by the FDA – and underwent a gene therapy treatment to help address two of the declines which come with age, loss or muscle mass and telomere shortening.

This news finally hit slashdot, and so I finally read about it – I’m right on the cusp of getting strongly involved in this field, so I’m not yet well read.

The comments on slashdot reflect in part the staggering lack of imagination and abudance of banality which humans can possess.

This technology is on a par with the invention of fire, and there are a non-negligible number of respondants along the lines of “death is death, retards”. I can only hope for timely automobile collisions 😉

One of my thoughts with regard to a longer or open-ended lifespan is that it will give people time to mature and become wiser. Some people become wise and considerate in twenty years; others need forty, others still sixty or eight. If we can live for hundreds, or thousands, or hundreds of thousands of years, perhaps we might all become wise enough to have a decent planet – and more than one planet – to live on.

Christ, finally

All the benchmarks are running again.

Adding NUMA support took SOOO long, OMG I don’t even etc.

(So the benchmarks can now run SMP, NUMA or NUMA unused, where the benchmark admin data is NUMAed but the data which is actually liblfds data is only one the NUMA node which has the largest number of logical cores in the current thread set).

Need to fix up the gnuplots now, add a benchmark duration argument, some other tidying up.

Then I can make the docs, run the tests, then publish 7.1.0 – and then I can have snazzy glittery gnuplots on the home page!

Update

Test is finally back on its feet after rewriting to not use the standard library – all passes on release, valgrind clean on debug.

Unfortunately, it’s not QUITE free of the standard library…

One test uses qsort and bsearch, and a lot of tests use time(), to ensure they run for a reasonable period.

I need to rewrite those tests to run for enough iterations, so time isn’t used.

However, back to the benchmark now.

Update

Still working away on the benchmark app.

Added NUMA support.

This means if you have no NUMA, or one NUMA node, you get an SMP benchmark.

If you have two or more NUMA nodes, the benchmark allocates NUMA aware – so for example, the btree benchmark, which has 1024 elements per thread, allocates from the NUMA node of the thread.

There is however a second NUMA mode, for comparison purposes, where NUMA is not used. This is different to SMP or a single NUMA node sysem, however. Benchmarks themselves sometimes have significant admin work to do in the backround. As such, for any given iteration of the benchmark there might be say 20 memory accesses, but only 5 are actually by liblfds. As such, if we were to compare NUMA with non-NUMA simply by in the latter case putting all allocs in the same NUMA, we would not be comparing apples with apples – we would be seeing a MUCH larger decrease in performance, but where most of it is due to all the work the benchmark itself is doing.

As such, when testing non-NUMA, we need to ensure ONLY the memory accesses performed by liblfds are non-NUMA, so we need to ensure all the admin work is going to the local NUMA node. So that mode exists as well.

I’m thinking once this is done of quickly knocking out a bounded MCMP queue, so I can simplify the ringbuffer API and improve its performance.