Update

Wierd shit be going down.

So, we have the lock-free freelist.

Fine.

Performs like crap. The problem is contention. Threads keep running into each other.

backoff[0] = 371608
backoff[1] = 1798761
backoff[2] = 212273
backoff[3] = 401016
backoff[4] = 92713
backoff[5] = 265962
backoff[6] = 59579
backoff[7] = 96049
backoff[8] = 35300
backoff[9] = 58019

That list there is for four threads (two cores), on the benchmark. The index is how many times a given push operation collided, before it managed to succeed. So 371,608 operations went through without colliding – then 1,798,761 had one collisions, 213,273 had three, etc.

So We add backoff. We look at index 0 and index 1, and aim for a ratio of 100:1, adjusting the backoff up or down accordingly.

Woohoo! 5x improvement. The collision array looks like this now;

backoff[0] = 70565880
backoff[1] = 391154
backoff[2] = 7050
backoff[3] = 663
backoff[4] = 208
backoff[5] = 51
backoff[6] = 22
backoff[7] = 11
backoff[8] = 16
backoff[9] = 6

Notice first how much higher the numbers are (we’re getting a lot more done) and of course notice how heavily skewed we are now to index 0.

Great.

But we want to improvement performance some more. The basic problem now is every thread is trying to hit the freelist head point. So we add an elimination layer. Basically, it’s a set of cache lines, which are filled with freelist element pointers, and when we come to push, we randomly choose a line and then look for a free slot (NULL) and if we find one, atomic exchange our element into that slot and we’re done; and popping, we randomly pick a line and search for a non-NULL and if we fine one, we’re done.

This means we’re now distributing the freelist memory load over multiple cache lines, not just the one.

So what happens when we have this?

Well we get this;

backoff[0] = 3674707
backoff[1] = 334781
backoff[2] = 8646
backoff[3] = 3982
backoff[4] = 2058
backoff[5] = 1046
backoff[6] = 465
backoff[7] = 204
backoff[8] = 64
backoff[9] = 20

Index 0 has gone down by about a factor of 20. We’re REALLY offloading from the freelist…

…but what’s wierd, really freakingly wierd, is that index 1 is basically UNCHANGED.

And this is throwing the backoff autotuner off, because it’s trying to maintain a 100:1 ratio for backoff. So now it’s continually trying to raise the backoff metric.

*I do not understand why index 1 is unchanged*.

Index 2 is unchanged as well, but after that, we see big (5x, 10x) drops.

Freelist scaling

I added an elimination layer in front of the freelist.

Works like this; you have an array of freelist element pointers. Cache line aligned, and you have n cache lines worth of array. All these pointers start NULL.

When a thread come to push, it randomly selects one of the cache lines and scans it from left to right. The first NULL it finds, it atomic exchanges the pointer to the freelist element it wants to push. If the original value returned by the exchange is NULL, that’s it! the pusher can make a run for it. No need to hit the main freelist at all. If the original value of the exchange is actually a pointer, another pusher exchanged onto that pointer before us; so we’ve got rid of *our* element, but only to get hold of someone elses, so we’re not done – we keep scanning the cache line of pointers to try to find another NULL.

If we can’t find a NULL, then we push to the main freelist as normal.

For popping, when a thread comes to pop, it likewise randomly selects a line from the array and scans left to right, looking for non-NULL values. If it finds one, it tries to exchange with NULL. If the original value returned is a pointer, then bingo! the thread has got hold of an available freelist element. If the original value is a NULL, another popper grabbed this element before us and so we keep canning. If we can’t find a non-NULL, then we pop as normal from the main freelist.

So the upshot of all this is that we offload memory accesses from the single freelist head pointer, over a range of cache lines – as many or as few as we want – in the elimination layer.

So I thought this would help a lot.

It’s so far a big fat non-event. It’s helped a bit, I dunno, maybe 10%, something like that? but I was expecting far, far more.

I’m trying to figure out why this is happening. One thought is the absence of back-off. Obviously we expect a lot of cache line contention between threads, but if there are enough cache lines in the elimination array, compared to the number of threads, I’m thinking it shouldn’t be that much of a problem…

Experimentation continues.

Progress

Benchmark (and so topology) works on Win32 with NUMA.

Need to run it on a big Linux machine now which actually has more than one NUMA node.

GetLogicalProcessorInformation / GetLogicalProcessorInformationEx

I think I’ve figured it out.

VirtualBox is not at fault, if I’m right.

The original version of this function, GetLogicalProcessorInformation() returns an array of SYSTEM_LOGICAL_PROCESSOR_INFORMATION. They contain a union, but in C, the union is sized to its largest member, so the size of the struct is fixed and indeed, you know how many you have by dividing the size of the returned data by the size of the struct.

The later version, GetLogicalProcessorInformationEx() also – according to the docs – returns an array, but it’s now SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX. The docs actually say “array”. However, this new struct has a new member – “Size”.

I now suspect it returns a *packed* “array” – not an array, because you can’t index into it – of these structs, and you move to the next one by casting your pointer to an unsigned char and then adding the size of the current element to it.

THIS IS NOWHERE IN THE DOCUMENTATION. THERE IS NO EXAMPLE CODE. THE ONLY WAY YOU CAN KNOW IS TO ***GUESS*** BECAUSE THERE IS NOW A SIZE MEMBER IN THE STRUCT.

lols

So, I’m giving KVM a try.

I came across a page of KVM setup instructions so complex that I thought I was reading advanced hypermathamtics in ancient Greek but then later came across a different page of instructions which involved three commands – and which worked.

So I came to the point where I was to create a guest. I thought I’d try bringing over the existing VirtualBox Win7 (with SDK) guest, as an experiment – Windows IME hates having the machine change under it, and that might screw up the topology stuff by itself anyway, but I’d like to see KVM work in the first place, and it *might* work anyway, and it should be pretty much instant – so all in all, good first test.

VirtualBox uses its .vdi format, KVM uses the .qcow format (quad cow? I hope so 🙂 you can export from VB to a raw format, and import in KVM from raw.

So I issue th command to VB to export the Win7 VM to raw.

It starts – “0%…”

I wait. Wait. Wait. Start doing other things – well, it’s a 25GB file, it’s got quite a bit to do. It’s reading from USB3 and writing to a local SSD.

Linux interestingly turns out not to be very good at coping with lots of disk I/O – at least, not on my system. The whole system regularly becomes unresponsive for a few seconds at time. Linux is much better than Windows at handling intense CPU work (in that situation, Windows regularly becomes unresponsive for a few seconds at a time) but it seems the opposite for disk.

Anyways – I start cooking dinner. I’ve seen VB behave like this before – it shows it’s started, takes forever, and then in about half a second the progress bar completes.

I get on with cooking. I try to browse a bit, but it’s awkward, with the freezes. Then – wait! what’s that!! ACTION! VB has printed…

10%!

AHHHHHHHHHHHHHHHH!!!!! it really IS going to take hours!!!

Well, bugger. Well, I can get to the gym, anyway, and it’ll be done when I get back.

I start eating dinner.

Then VB goes 20/30/40/50/60/70/80/90/100 in about half a second and completes 🙂

Update

Need to make benchmark work on Windows with NUMA.

This is (as previously blogged) a problem, because VirtualBox (up to and including the very recently released version 5) return broken data from the Windows topology API.

I looked at using Xen, in the hope a different VM would do the trick, but Xen setup is, ah, let’s say – it’s complex enough that I am absolutely, completely, totally and utterly certain from years of experience with Linux that there isn’t a chance in hell of me being able to make it run.

So either I look around for some more VMs (VMware, maybe? but I vaguely think it’s commerical), or I get hold of a Windows machine (I looked at this, with Amazon, but the problem is – Windows expects a GUI; there *are* instructions for installing Windows so that you get remote VNC-like access, but they too are WAY WAY past the point of viable complexity – I know from years of experience blah blah etc no way in HELL I could ever make it work – people keep writing all this software which is so complex it can’t be made to work, which is exactly the same as not writing it at all), or I simply say for this release that it’s been compiled on Windows and ought to work but the code hasn’t been run. Which sucks, but it’s not obvious what the alternative is.

The other bit of work to be done is to run all the code on a NUMA Linux machine. Amazon can help here (for a pretty penny). That’s the next job. Nap first.

ARM #4

Btree on ARM now 3x faster than next best.

That’s a 30x improvement by being sensible about cache line use.

ARM #3

I think I’ve made a big mistake in the lock-free btree structs. The element struct looks like this;

struct lfds710_btree_au_element
{
  struct lfds710_btree_au_element LFDS710_PAL_ALIGN(LFDS710_PAL_ATOMIC_ISOLATION_IN_BYTES)
    *volatile left,
    *volatile right,
    *volatile up;

  void LFDS710_PAL_ALIGN(LFDS710_PAL_ATOMIC_ISOLATION_IN_BYTES)
    *volatile value;

  void LFDS710_PAL_ALIGN(LFDS710_PAL_ATOMIC_ISOLATION_IN_BYTES)
    *key;
};

But all those volatile/aligned members, they’re all given those properties because they can be subject to an atomic operation, the thing is, ONLY ONE OF THEM CAN BE OPERATED ON AT A TIME.

So it’s not necessary for EACH MEMBER to be so aligned – all that’s needed is that the STRUCT is so aligned.

In a fully read-write btree, especially balanced where we see plenty of atomic opts all over the place on an ongoing basis, the alignment may pay off. In this simple tree, add-only, each pointer element is only ever written once anyway, no matter what! so it’s okay to invalidate all the other pointers in the element; the cost of doing so once is profoundly less and on an on-going basis than the cost of the multiple cache-line loads from tree traversal.

The add-only linked lists are going to be experiencing the same problem. No benchmarks for them yet though.

ARM #2

The problem at least in major part is the alignment of the variables in the btree struct.

Adding matching atomic-isolation alignment (64 bytes) to the GCC atomic struct halves performance and in fact makes it equal on a single core to the lock-free btree.

Without the alignment, the entire btree element fits in a single cache line.

With it, it’s two caches line PER ALIGNED MEMBER OF THE STRUCT… ahhhkkkkk…

ARM

So, the benchmark runs fine now on MIPS and ARM.

The results however are far from fine.

Now, these are only little dev boards and I suspect the hardware has a big part of play in the relative performance of the locking vs lock-free data structures – in particular the lock-free btree has a lot of memory accessess compared to the locking btree.

However, what I see is this; on MIPS, lock-free wins, but either by not much or only 2x.

On ARM, I can’t remember the queue/freelist results offhand because I’m staggered by the btree result, and it’s that I am currently investigating.

The btree on ARM (well, a Raspberry Pi 2 Model B) is TEN TIMES SLOWER THAN USING SPINLOCKS.

On Intel, it’s ten times *faster*.

So I’m trying to figure this out now. Collosal blunder somewhere? I hope so…

One thing I have just found, which really surprised me, is how much difference even small changes can make. In the btree compare function, the code currently is like this;

static int key_compare_function( void const *new_key, void const *existing_key )
{
  int
    cr = 0;

  struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element
    *new_benchmark_element,
    *existing_benchmark_element;

  LFDS710_PAL_ASSERT( new_key != NULL );
  LFDS710_PAL_ASSERT( existing_key != NULL );

  new_benchmark_element = (struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element *) new_key;
  existing_benchmark_element = (struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element *) existing_key;

  if( new_benchmark_element->datum < existing_benchmark_element->datum )
    cr = -1;

  if( new_benchmark_element->datum > existing_benchmark_element->datum )
    cr = 1;

  return( cr );
}

Change it to this;

static int key_compare_function( void const *new_key, void const *existing_key )
{
  struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element
    *new_benchmark_element,
    *existing_benchmark_element;

  LFDS710_PAL_ASSERT( new_key != NULL );
  LFDS710_PAL_ASSERT( existing_key != NULL );

  new_benchmark_element = (struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element *) new_key;
  existing_benchmark_element = (struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element *) existing_key;

  if( new_benchmark_element->datum < existing_benchmark_element->datum )
    return( -1 );

  if( new_benchmark_element->datum > existing_benchmark_element->datum )
    return( 1 );

  return( 0 );
}

And that actually gives a 1.01% speedup – over five seconds an extra 12,000 operations.