liblfds.org blog

blog

2016-05-01

More M&S Queue Performance Tuning / Analysis

I realised that I had missed in the M&S queue a place where backoff needs to occur.

I had backoff where-ever a CAS or DWCAS occurs. However, backoff is actually needed where-ever an operation can fail due to contention with other threads.

In the M&S, such an operation exists - it’s not CAS or DWCAS, it’s just reading variables, then a load barrier, then a check the variables are unchanged. It needs backoff, didn’t have it.

I had high hopes this might be the cause of the mysterious missing performance on two threads (one per core) - but, alas, no.

It does add about 15% on four threads (two per core) so it’s worth having, but it’s the not the key issue (if there is one!)

I have instrumented the queue code to record in an per-thread array the backoff iteration number reached for a dequeue or an enqueue, and I now print these out from the benchmark.

They are very interesting.

Here’s the output for two threads, one per core. En for enqueue, Dn for dequeue, backoff iteration begins at 0 (and a count for zero means the queue operation went through first time) and a count for a higher number means we were on attempt n to perform that given operation.

This is for a backoff quantum of 8 (ranges from 0 to whatever you want - the higher, the more backoff, but the more backoff, the more unfair you become - the usual tradeoffs, etc). Eight is a pretty good value for most data structures on my platform.

============ LP 3 : E0 : 7684462 LP 3 : E1 : 6042450 LP 3 : E2 : 356540 LP 3 : E3 : 67853 LP 3 : E4 : 47549 LP 3 : E5 : 42962 LP 3 : E6 : 41005 LP 3 : E7 : 39853 LP 3 : E8 : 40438 LP 3 : E9 : 39311 LP 3 : D0 : 10008606 LP 3 : D1 : 649547 LP 3 : D2 : 40764 LP 3 : D3 : 7992 LP 3 : D4 : 2092 LP 3 : D5 : 1234 LP 3 : D6 : 512 LP 3 : D7 : 266 LP 3 : D8 : 194 LP 3 : D9 : 120 ============ LP 2 : E0 : 10859024 LP 2 : E1 : 7284726 LP 2 : E2 : 388366 LP 2 : E3 : 15056 LP 2 : E4 : 3982 LP 2 : E5 : 1748 LP 2 : E6 : 1137 LP 2 : E7 : 942 LP 2 : E8 : 970 LP 2 : E9 : 806 LP 2 : D0 : 13080513 LP 2 : D1 : 776180 LP 2 : D2 : 6417 LP 2 : D3 : 384 LP 2 : D4 : 51 LP 2 : D5 : 15 LP 2 : D6 : 5 LP 2 : D7 : 1 LP 2 : D8 : 1 LP 2 : D9 : 2

There are many things to see.

Dequeue basically has an easier time of it. The count for 0 is TWO orders of magnitude greater than the count for 1, and the higher counts are almost 0.

Enqueue OTOH is having a hard time.

What this suggests to me is the possibility of auto-tuning backoff. Have the user pass in a per-thread array for keeping track of backoff iteration numbers reached, and a per-thread (and per data structure operation - enqueue being different to dequeue, etc), and then every n operations, if E0 is not the highest count, increase backoff by one. Need a mechanism to decrease though - perhaps aiming for E0 and E1 to be equal. ### auto-setting backoff

So, auto-setting backoff.

There are a number of important advantages.

As it is, it’s become clear that the backoff value varies not just by CAS/DWCAS, but by data structure and indeed by operation in each data structure.

It also varies, I expect and think I’ve now seen (I’ve not been looking for it yet) by the level of contention - the more threads banging away at once, the ideal backoff value changes (as you’d expect).

It will I am sure also vary by CPU topology (which also relates in part to how many threads are banging away at once).

So it’s basically freaking impossible to get right ahead of time, and it matters a LOT for performance (although one saving grace is that the window of good performance seems pretty wide).

What I have in mind for a method is to keep track of the backoff iteration reached for each operation, and then every n operatioin, decide based on the ratio between backoff iteration count for 0 and the count for 1. We need to know the previous ratio, to compare it to the current ratio, so we can know if what we did last time (increase or decrease) helped or hindered.

A problem here is where we keep this info. In a perfect world, it would be retained per thread and per data structure instance and per-operation state (i.e. separate state for queue enqueue and dequeue, freelist pop and freelist push, etc), which the user passes in via an argument to every liblfds lock-free operation. This is a huge PITA for a user.

We could reduce this to per-thread, per-operation state, which would amalgamte auto-setting across all data structures. This isn’t right, since different instances will have different numbers of threads using them. At least it takes topology into account.

Ideally, we avoid user-visible state entirely.

This would implies per data structure instance state, which is therefore valid for all threads using that instance, and will also be per-operation. The problem is then this would need atomic increments - splat, performance hit. We might find however that non-atomic increments work well enough. The loss of increments will of course be in proportion to how often a given counter is increments, so it will tend to bias the ratio we conpute - but it might be just fine in practise.

2016-05-03

Backoff auto-tune successful

Has been a success.

It works well, and removes the need to specify backoff values.

I’ve moved over fully to it now, and removed all the CAS/DCAS setting code in the benchmark app.

2016-05-04

Code complete

Development work is complete.

Now to sort out build configuration, run test and benchmark on all platforms, bring the 7.0.0 docs over and update them to 7.1.0, then release.

2016-05-06

MS build problems

So, I’m getting the port to Windows back into shape.

I’m only supporting Windows Platform SDK for Windows 7 - it’s the very last release of stand-alone command line compilers, that I can find, from MS. I’m using this with gnumake.

I have however run into what seems to be a strange and novel problem. I’m using the latest version of the SDK, 7.1 (before I used 7.0). I recall with 7.0 there was a command line offered from the start menu for each build target (XP, 2000, Win7, etc) and I used them all.

Now though there’s only a command line offered for Windows 7.

So I fire up a normal command line, then run setenv and target XP.

Problem is, it’s not compilting like it’s targetting XP. NTDDI_VERSION is set for Windows 7 (the host OS). I was expect this and _WIN32_WINNT and so on to end up being set for XP. They’re not.

So how do I target XP?

The only thing I find setenv doing is setting an env-var, “TARGET_PLATFORM”.

I’ve also found out, far as I can tell, there’s no support in MSVC for dumping the compiler macros from the command line, as there is in GCC.

So, WTH, you know?

I have a command line option in setenv to target specific Windows versions, which simply does not work, or so it seems. I have no way of finding out what the compiler has set, and so the only thing I can see to do is that in MY code, I look at the values for TARGET_PLATFORM (whatsever they are) and then MYSELF set _WIN32_WINNT.

Which is absolutely freaking insane and can’t be right - but there’s nothing I can find anywhere about what is right, and what I can see is that setenv targetting for a given Windows versions simply does not work.

So I’m going to run up another VM and install 7.0 - about 30 minutes work - and then see if it will compile XP properly.

Ha, in other news, DHL failed to deliver a package today, so I have to go pick it up from their depot Monday. They failed to deliver because the intercom on the AirBnB I’m in does not work - so they couldn’t call me from the door.

Broken topology info reporting for Windows 7 in VirtualBox under Linux

GetLogicalProcessorInformationEx() on Windows 7 on VirtualBox on Linux returns partially garbage results.

As such, I cannot enumerate topology, as such, I cannot develop the benchmark app on this VM.

TBH, I’m not surprised this happens, given how appallingly messed up the Windows topology API is. It’s impossible to use it correctly - implementing it correctly must be twice as hard.

Win7 topology reporting under VirtualBox

So, the Ex() functions are totally fucked. They return complete garbage. I think the first record is right, and then it goes rapidly down into 0xcdcdcdcdcd.

The old-style, pre-processor-group API, that works fine except for caches - it only reports data caches. No instruction caches, no unified caches.

I had assumed when I went over to Linux I could continue to support Windows by means of a VM.

Looks like it’s time to try something other than VirtualBox, which is a shame, I’ve used it for a long time and it’s familiar and in every other way has been fine.

2016-05-08

Doccing

Writing the 7.1.0 docs! :-)

Update

1am. Finished the first draft of the API docs (the left-hand column).

Tomorrow will pick up some new computer kit, including a Syncwire four-port USB charger which the dev boards will run off of, which will be a marvellous help with getting the liblfds ports testing done.

Basically, I think, two days work to go. One day to finish the docs (the right-hand column - all the build and porting guides) and one day to finish the testing work on the code.

2016-05-11

Update

Getting the final coding done, making the library compile on all platforms.

I’ve realised I want to change atomic exchange into atomic set (i.e. an exchange, but throw away the original value).

Problem is you can’t test atomic set.

Update

Whewf.

So, atomic exchange. Currently, the code doesn’t exchange. It only does a set. That means we need to throw away the unwanted original value. Two problems with this. First, you can’t test set. Think about it - you have a function, it does an atomic set, you don’t know the original value. How do check it works? you can test exchange just fine. Second, type-fucking-punning. Why the hell is it called punning? the only meaning I know for the word pun is “joke”. So the problem is that to disgard the type, we have to declare a variable, set it to the value we want to set, and pass the address of that variable into the exchange - but we can’t know the type unless the user passes it in.

So what I’ve done is kept exchange, and it’s test, and added set as well. Exchange is not used - but people making a port can write code for exchange and test it. Making set them means taking that exchange code and just throwing away the return value.

So that was one issue.

Second issue, type-fucking-punning, again. The problem turns up with some versions of GCC and not others, which worries me. The basic issue is that I’m using an array for pointer-and-counter. The GCC atomic instrincs are typeless - which means on 32 bit platforms to get double-word compare-and-swap I have an array of to 32-bit elements, which I was casting to an int long long unsigned… and which breaks type-punning. I’ve just experimented, and far as I can see, GCC doesn’t do atomic ops on arrays, so I HAVE to use the int long long unsigned type - which means converting all the pointer-and-counter arrays to that type and casting the first half of it to a pointer… which I’m worried will also throw type-punning errors.

I understand why the linux kernel has strict aliasing off!

This is only showing up as a problem on 32 bit GCC platforms, because on 64-bit I’ve got to use some inline assembly anyway (there being no native 128 bit type). I don’t want to use inline assembly unless I have to, because I have to write it for every platform.

In fact, this is a pain in the ass. On 64-bit platforms, there’s no native 128 bit type, so I HAVE to use an array of two elements. The code has to work with this - so obviously I want to use the same array of two on 32 bit platforms - only, ha-ha-ha, type-punning kicks me in the ass.

In fact, I can’t see a way around this. If I have arrays, type-punning bitches. If I have an int long long unsigned for 32 bit, I can’t compile for 64 bit, because there’s no native 128 bit type.

What concerns me is that I only see this problem on the Raspberry Pi, not on the Ci20… ah, now, the Pi is using the new atomic instrincts, the Ci20 the old. That might have something to do with it too, although it shouldn’t. On my laptop, I’m 64 bit, so I’m not using the GCC atominc intrinsics, I’m using the inline assembley.

GCC has an attribute may_alias, which I think I’m going to have to apply to the arrays.

That or use a union. Thinking.

2016-05-12

may_alias is an alias for will_go_horribly_wrong

So, I’ve been thinking over the use of a union for the pointer-and-counter pairs.

It’s a better representation - it’s more showing things to be what they really are - but it means some rewriting, and more to the point, pretty ugly code, physically ugly, as I have to use BIG_LONG_DEFINES to access the pointer and counter.

So I decided to use may_alias on those pointer-and-counter pairs, for now.

Know what happens?

GCC internal compiler error.

staggeringly vast facepalm

What I said? about everything being fucked? totally and utterly? you can’t rely on anything, at all, ever, because it’s all fucked?

Right.

Works okay on the Pi, with it’s moderately older compiler.

So I think I actually have to fully disable strict aliasing.

I need a build system which has ever major release of GCC on it, and I need to compile on that. Oh - and for every CPU architecture I can get my hands on.

Damned if you do, damned if you don’t

Getting thoroughly sick of type-punning.

If I omit a particular cast, GCC throws a warning - you’re disgarding the volatile qualifier.

If I put the case in to explicitly indicate I mean to do so, GCC throws another fucking warning, you might break type-punning.

The particular function I’m calling is an atomic intrinsic. The argument being passed is expected by the instrinct to be non-volatile, but linked lists being what they are, I have to be passing in the next pointer of the previous element, which has to be volatile.

GCC does not support make uninstall

“Please note that GCC does not support ‘make uninstall’ and probably won’t do so in the near future as this would open a can of worms.”

facepalm

This means back in Feb I actually made a one-way install of GCC 6.0…

So I can manually delete the binaries, but then I’ve got to hunt down the libraries. Without breaking my system…

Utter GCC frustration

Type fucking punning, jesus christ on a fucking stick.

So, I have pointer-and-counter pairs. I use a double-word CAS to atomically modify them. As a pair, they’re arranged as an array of two elements. On a 32-bit systems, that’s two 32-bit unsigned ints. To DWCAS, I pass them to GCC’s atomic instrinsic, and it MUST have the type as an “int long long unsigned”, so I pass in the address of the first element and cast it to that.

Doesn’t work, breaks type punning of course.

Problem is, there’s NO FUCKING WAY to FIX type punning, short - and this is a theory only - of rewriting the code base to use a union instead of an array.

I’ve spent twelve fucking hours on this now.

There’s an attribute, may_alias, which SIMPLY DOES NOT FUCKING WORK AT ALL EVER NO MATTER WHAT YOU DO. Google for it. Almost no hits, and those which do exist are “how do I make this work? anyone have a mininal code example where it does work? at all?” followed by “I used a union in the end”.

The only way I can see out of this is to fully disable strict aliasing both for libfds and for the test library, since it calls the header file code for atomic ops from liblfds, and so will show the same problems when compiling.

Whack-a-mole with GCC

Starting to feel like getting code to compile with multiple versions of GCC is like playing whack-a-mole.

Fix a problem on one platform, you’ll produce a new problem on another platform.

Are we having fun yet?

ARM ERG

Finding out the ERG for ARM (or indeed any platform) is problematic. So many different models, so hard to get such specific abstruse technical information.

I’m going to write a bit of code - will all have to be platform specific, so it’ll be ARM only to start with - which empirically determines the ERG.

2016-05-13

ERG

Been writing the ERG finder.

Spent all day making it not work.

Eventually found out - there’s a local monitor and a global monitor. The local monitor doesn’t do much, and if your write ops all exist inside the same core, it’s the only monitor involved. You must have another core write, to involve the global monitor. As such, the code to determine ERG has to be multi-threaded.

In other news, the best mangos in Greece cost about four times as much as in the UK, but the best cherries cost about a quarter.

In every country you go to, you must adjust your diet.

2016-05-14

Determining the ERG

So I’ve been banging away on this code, and remembering some things I’d forgotten - like a given core I think can only have one LDREX open at a time. This constrains implementation, because if you say start a thread, it’ll for sure performing some atomic work, and bang - your LDREX fails automatically.

So I have to have one thread doing the ldrex/strex on the first word of the ERG length we’re testing, while the other threads write to the test word which is just inside the ERG length we’re testing, and the first thread has to KNOW the other threads have written, without using atomic operations.

It’s not that hard, but I did have to think a bit! assuming I have it right, of course :-)

However, while doing this, I thought - hang on, there’s a register in the chip which knows the ERG length. Why don’t I just write a bit of assembly and read it? IDIOT!

So I google a bit, and yup sure enough there it is, and a trivial bit of code to read it, so I write the code, and hey presto it doesn’t work. Illegal instruction.

After an hour or two of investigation, the code to find the ERG length can only be run from supervisor mode, and I can’t get to supervisor mode from user mode, so it’s absolutely no bloody use to me at all.

What I actually need, if there’s any milage in this at all, is for the OS to publish the ERG value to me, in user-mode. Far as I can tell, that doesn’t exist.

So back to my empirical code :-)

ERG code done

Results on a Cortex A7, where the value is the number of successful LL/SC operations (out of 1024 - we expect an odd failure due to thread swapping);

ERG 8 bytes : 0
ERG 16 bytes : 0
ERG 32 bytes : 0
ERG 64 bytes : 1023
ERG 128 bytes : 1024
ERG 256 bytes : 1023
ERG 512 bytes : 1024
ERG 1024 bytes : 1024
ERG 2048 bytes : 10

So the ERG size looks like 64 bytes. I don’t know why the 2048 byte size fails so much!

Docs say this;

“Exclusive reservation granule size is 16 words.”

Which is indeed 64 bytes.

2016-05-15

ARM

So, the benchmark runs fine now on MIPS and ARM.

The results however are far from fine.

Now, these are only little dev boards and I suspect the hardware has a big part of play in the relative performance of the locking vs lock-free data structures - in particular the lock-free btree has a lot of memory accessess compared to the locking btree.

However, what I see is this; on MIPS, lock-free wins, but either by not much or only 2x.

On ARM, I can’t remember the queue/freelist results offhand because I’m staggered by the btree result, and it’s that I am currently investigating.

The btree on ARM (well, a Raspberry Pi 2 Model B) is TEN TIMES SLOWER THAN USING SPINLOCKS.

On Intel, it’s ten times faster.

So I’m trying to figure this out now. Collosal blunder somewhere? I hope so…

One thing I have just found, which really surprised me, is how much difference even small changes can make. In the btree compare function, the code currently is like this;

static int key_compare_function( void const *new_key, void const *existing_key )
{
  int
    cr = 0;

  struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element
    *new_benchmark_element,
    *existing_benchmark_element;

  LFDS710_PAL_ASSERT( new_key != NULL );
  LFDS710_PAL_ASSERT( existing_key != NULL );

  new_benchmark_element = (struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element *) new_key;
  existing_benchmark_element = (struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element *) existing_key;

  if( new_benchmark_element->datum < existing_benchmark_element->datum )
    cr = -1;

  if( new_benchmark_element->datum > existing_benchmark_element->datum )
    cr = 1;

  return( cr );
}

Change it to this;

static int key_compare_function( void const *new_key, void const *existing_key )
{
  struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element
    *new_benchmark_element,
    *existing_benchmark_element;

  LFDS710_PAL_ASSERT( new_key != NULL );
  LFDS710_PAL_ASSERT( existing_key != NULL );

  new_benchmark_element = (struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element *) new_key;
  existing_benchmark_element = (struct libbenchmark_benchmark_btree_au_liblfds_lockfree_readn_writen_benchmark_element *) existing_key;

  if( new_benchmark_element->datum < existing_benchmark_element->datum )
    return( -1 );

  if( new_benchmark_element->datum > existing_benchmark_element->datum )
    return( 1 );

  return( 0 );
}

And that actually gives a 1.01% speedup - over five seconds an extra 12,000 operations.

ARM #2

The problem at least in major part is the alignment of the variables in the btree struct.

Adding matching atomic-isolation alignment (64 bytes) to the GCC atomic struct halves performance and in fact makes it equal on a single core to the lock-free btree.

Without the alignment, the entire btree element fits in a single cache line.

With it, it’s two caches line PER ALIGNED MEMBER OF THE STRUCT… ahhhkkkkk…

ARM #3

I think I’ve made a big mistake in the lock-free btree structs. The element struct looks like this;

struct lfds710_btree_au_element
{
  struct lfds710_btree_au_element LFDS710_PAL_ALIGN(LFDS710_PAL_ATOMIC_ISOLATION_IN_BYTES)
    *volatile left,
    *volatile right,
    *volatile up;

  void LFDS710_PAL_ALIGN(LFDS710_PAL_ATOMIC_ISOLATION_IN_BYTES)
    *volatile value;

  void LFDS710_PAL_ALIGN(LFDS710_PAL_ATOMIC_ISOLATION_IN_BYTES)
    *key;
};

But all those volatile/aligned members, they’re all given those properties because they can be subject to an atomic operation, the thing is, ONLY ONE OF THEM CAN BE OPERATED ON AT A TIME.

So it’s not necessary for EACH MEMBER to be so aligned - all that’s needed is that the STRUCT is so aligned.

In a fully read-write btree, especially balanced where we see plenty of atomic opts all over the place on an ongoing basis, the alignment may pay off. In this simple tree, add-only, each pointer element is only ever written once anyway, no matter what! so it’s okay to invalidate all the other pointers in the element; the cost of doing so once is profoundly less and on an on-going basis than the cost of the multiple cache-line loads from tree traversal.

The add-only linked lists are going to be experiencing the same problem. No benchmarks for them yet though.

ARM #4

Btree on ARM now 3x faster than next best.

That’s a 30x improvement by being sensible about cache line use.

2016-05-17

Update

Need to make benchmark work on Windows with NUMA.

This is (as previously blogged) a problem, because VirtualBox (up to and including the very recently released version 5) return broken data from the Windows topology API.

I looked at using Xen, in the hope a different VM would do the trick, but Xen setup is, ah, let’s say - it’s complex enough that I am absolutely, completely, totally and utterly certain from years of experience with Linux that there isn’t a chance in hell of me being able to make it run.

So either I look around for some more VMs (VMware, maybe? but I vaguely think it’s commerical), or I get hold of a Windows machine (I looked at this, with Amazon, but the problem is - Windows expects a GUI; there are instructions for installing Windows so that you get remote VNC-like access, but they too are WAY WAY past the point of viable complexity - I know from years of experience blah blah etc no way in HELL I could ever make it work - people keep writing all this software which is so complex it can’t be made to work, which is exactly the same as not writing it at all), or I simply say for this release that it’s been compiled on Windows and ought to work but the code hasn’t been run. Which sucks, but it’s not obvious what the alternative is.

The other bit of work to be done is to run all the code on a NUMA Linux machine. Amazon can help here (for a pretty penny). That’s the next job. Nap first.

lols

So, I’m giving KVM a try.

I came across a page of KVM setup instructions so complex that I thought I was reading advanced hypermathamtics in ancient Greek but then later came across a different page of instructions which involved three commands - and which worked.

So I came to the point where I was to create a guest. I thought I’d try bringing over the existing VirtualBox Win7 (with SDK) guest, as an experiment - Windows IME hates having the machine change under it, and that might screw up the topology stuff by itself anyway, but I’d like to see KVM work in the first place, and it might work anyway, and it should be pretty much instant - so all in all, good first test.

VirtualBox uses its .vdi format, KVM uses the .qcow format (quad cow? I hope so :-) you can export from VB to a raw format, and import in KVM from raw.

So I issue th command to VB to export the Win7 VM to raw.

It starts - “0%…”

I wait. Wait. Wait. Start doing other things - well, it’s a 25GB file, it’s got quite a bit to do. It’s reading from USB3 and writing to a local SSD.

Linux interestingly turns out not to be very good at coping with lots of disk I/O - at least, not on my system. The whole system regularly becomes unresponsive for a few seconds at time. Linux is much better than Windows at handling intense CPU work (in that situation, Windows regularly becomes unresponsive for a few seconds at a time) but it seems the opposite for disk.

Anyways - I start cooking dinner. I’ve seen VB behave like this before - it shows it’s started, takes forever, and then in about half a second the progress bar completes.

I get on with cooking. I try to browse a bit, but it’s awkward, with the freezes. Then - wait! what’s that!! ACTION! VB has printed…

10%!

AHHHHHHHHHHHHHHHH!!!!! it really IS going to take hours!!!

Well, bugger. Well, I can get to the gym, anyway, and it’ll be done when I get back.

I start eating dinner.

Then VB goes 20/30/40/50/60/70/80/90/100 in about half a second and completes :-)

GetLogicalProcessorInformation / GetLogicalProcessorInformationEx

I think I’ve figured it out.

VirtualBox is not at fault, if I’m right.

The original version of this function, GetLogicalProcessorInformation() returns an array of SYSTEM_LOGICAL_PROCESSOR_INFORMATION. They contain a union, but in C, the union is sized to its largest member, so the size of the struct is fixed and indeed, you know how many you have by dividing the size of the returned data by the size of the struct.

The later version, GetLogicalProcessorInformationEx() also - according to the docs - returns an array, but it’s now SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX. The docs actually say “array”. However, this new struct has a new member - “Size”.

I now suspect it returns a packed “array” - not an array, because you can’t index into it - of these structs, and you move to the next one by casting your pointer to an unsigned char and then adding the size of the current element to it.

THIS IS NOWHERE IN THE DOCUMENTATION. THERE IS NO EXAMPLE CODE. THE ONLY WAY YOU CAN KNOW IS TO GUESS BECAUSE THERE IS NOW A SIZE MEMBER IN THE STRUCT.

Progress

Benchmark (and so topology) works on Win32 with NUMA.

Need to run it on a big Linux machine now which actually has more than one NUMA node.

2016-05-19

Freelist scaling

I added an elimination layer in front of the freelist.

Works like this; you have an array of freelist element pointers. Cache line aligned, and you have n cache lines worth of array. All these pointers start NULL.

When a thread come to push, it randomly selects one of the cache lines and scans it from left to right. The first NULL it finds, it atomic exchanges the pointer to the freelist element it wants to push. If the original value returned by the exchange is NULL, that’s it! the pusher can make a run for it. No need to hit the main freelist at all. If the original value of the exchange is actually a pointer, another pusher exchanged onto that pointer before us; so we’ve got rid of our element, but only to get hold of someone elses, so we’re not done - we keep scanning the cache line of pointers to try to find another NULL.

If we can’t find a NULL, then we push to the main freelist as normal.

For popping, when a thread comes to pop, it likewise randomly selects a line from the array and scans left to right, looking for non-NULL values. If it finds one, it tries to exchange with NULL. If the original value returned is a pointer, then bingo! the thread has got hold of an available freelist element. If the original value is a NULL, another popper grabbed this element before us and so we keep canning. If we can’t find a non-NULL, then we pop as normal from the main freelist.

So the upshot of all this is that we offload memory accesses from the single freelist head pointer, over a range of cache lines - as many or as few as we want - in the elimination layer.

So I thought this would help a lot.

It’s so far a big fat non-event. It’s helped a bit, I dunno, maybe 10%, something like that? but I was expecting far, far more.

I’m trying to figure out why this is happening. One thought is the absence of back-off. Obviously we expect a lot of cache line contention between threads, but if there are enough cache lines in the elimination array, compared to the number of threads, I’m thinking it shouldn’t be that much of a problem…

Experimentation continues.

2016-05-21

Update

Wierd shit be going down.

So, we have the lock-free freelist.

Fine.

Performs like crap. The problem is contention. Threads keep running into each other.

backoff[0] = 371608 backoff[1] = 1798761 backoff[2] = 212273 backoff[3] = 401016 backoff[4] = 92713 backoff[5] = 265962 backoff[6] = 59579 backoff[7] = 96049 backoff[8] = 35300 backoff[9] = 58019

That list there is for four threads (two cores), on the benchmark. The index is how many times a given push operation collided, before it managed to succeed. So 371,608 operations went through without colliding - then 1,798,761 had one collisions, 213,273 had three, etc.

So We add backoff. We look at index 0 and index 1, and aim for a ratio of 100:1, adjusting the backoff up or down accordingly.

Woohoo! 5x improvement. The collision array looks like this now;

backoff[0] = 70565880 backoff[1] = 391154 backoff[2] = 7050 backoff[3] = 663 backoff[4] = 208 backoff[5] = 51 backoff[6] = 22 backoff[7] = 11 backoff[8] = 16 backoff[9] = 6

Notice first how much higher the numbers are (we’re getting a lot more done) and of course notice how heavily skewed we are now to index 0.

Great.

But we want to improvement performance some more. The basic problem now is every thread is trying to hit the freelist head point. So we add an elimination layer. Basically, it’s a set of cache lines, which are filled with freelist element pointers, and when we come to push, we randomly choose a line and then look for a free slot (NULL) and if we find one, atomic exchange our element into that slot and we’re done; and popping, we randomly pick a line and search for a non-NULL and if we fine one, we’re done.

This means we’re now distributing the freelist memory load over multiple cache lines, not just the one.

So what happens when we have this?

Well we get this;

backoff[0] = 3674707 backoff[1] = 334781 backoff[2] = 8646 backoff[3] = 3982 backoff[4] = 2058 backoff[5] = 1046 backoff[6] = 465 backoff[7] = 204 backoff[8] = 64 backoff[9] = 20

Index 0 has gone down by about a factor of 20. We’re REALLY offloading from the freelist…

…but what’s wierd, really freakingly wierd, is that index 1 is basically UNCHANGED.

And this is throwing the backoff autotuner off, because it’s trying to maintain a 100:1 ratio for backoff. So now it’s continually trying to raise the backoff metric.

I do not understand why index 1 is unchanged.

Index 2 is unchanged as well, but after that, we see big (5x, 10x) drops.

2016-05-23

makefile whack-a-mole

I finally found the gnumake recommended practises documentation and I’ve been making the makefiles comformant.

When you have fourteen makefiles, over four platforms, it takes a while to get all of them happy everywhere…

2016-05-24

Going nuts

Been working non-stop, mininal sleep, to get the release out.

There’s SO MUCH WORK to get builds working across what in fact in the end is 17 different sets of build files, which run on six different plaforms, especially when you’ve largely rewritten most of them.

I in fact hadn’t consciously realised it, but I had not implemented some of the abstraction layers - I’ve just implemented (really a re-implementation, mainly, I’d done a lot of in a long time ago, but removed the code since then) the Windows kernel abstraction layer and I’ve just right now been implementing the Linux kernel abstraction layer for libshared…

…and, now, finally, after all this time, I have discovered the Linux kernel does not offer a “wait for thread to terminate” function, which totally and utterly fucks the entire abstraction mechanism for threads.

truly -epic- facepalm

Linux does has a function to create threads in the kernel and to put them on a given node and CPU, but you have to manually implement signalling between the thread(s) and the issuing thread, to sync between them.

This means I now need to implement this style of mechanism across all platforms, which means rewriting a metric fucking ton of code. It also means making an abstraction layer for an inter-thread communication mechanism. Jesus, headache, headache, headache. WHY NOT JUST OFFER WAIT-ON-THREAD-HANDLE FFS EVERYONE ELSE DOES.

It also means 7.1.0 is going to be released without Linux kernel support for test and benchmark. You make your choices, you get what you get. Linux made its choice, and so it doesn’t get test and benchmark.

Linux kernel source code

Just found out Linux kernel source code base has lower case #defines.

I found out ’cause they’re colliding with my VARIABLE NAMES…

Linux may not be the super-awesome-power code base we’re thinking it is =-)

2016-05-25

Learned something new

So, Windows kernel has no return type for its thread functions.

To abstract this, I have a macro, which is for Windows kernel defined like this;

#define LIBSHARED_THREAD_RETURN_TYPE( return_value )

So at the end of the thread function when I come to return, I have this;

return( LIBSHARED_THREAD_RETURN_TYPE(return_value) );

(The type of return_value is “int” for Windows kernel, just so the assignments in the code compile).

So the code ends up post-pre-processor like this;

return();

Which I thought would be fine. Return is not a function call, it’s a keyword, so the brackets just disappear.

Guess what?

Of COURSE it doesn’t work.

The compiler doesn’t like it. “error C2059: syntax error : ‘)’”

Update

So, read the spec, return with an expression from void is a warning/error.

I fixed it by doing something I’ve thought for quite a while I should do - not use brackets with return. It’s not a function, don’t use brackets. Turns out now to matter, since it allows me to mask windows kernel threads in a readable way.

So, the code now compiles on WDK, all platforms.

This means I now have successful compilation on all platforms with all build files sets - all twenty-one of them…!

I”m not going to go round for the second pass and make sure everything still compiles on everything else until I’ve done the docs, since they will prolly lead to changes.

I’m now writing the docs.

2016-05-26

Getting there!

One more set of API function docs to write - about a dozen or so pages.

Then a bunch of pre-release prep, then release!

2016-05-27

Update

Public APIs which are too complex or too big to be documented are totally worthless.

When you come to write your docs, you have to rewrite all that code.

I’ve spent the last half-day rewriting the libbenchmark abstraction API for enumerating processer/memory topology.

The problem has been that the function was directly using a bunch of topology_node API stuff, which then would all needed to be documented.

I’ve written a small set of helper functions, which mask all of the topology_node stuff.

2016-05-28

Update / benchmarking 7.0.0

Documentation first-pass finished.

No missing articles.

I need to fix up the “enum flag” type use in the codebase - there’s a type in liblfds itself, and another in the test_and_benchmark code, and they’re intermixed in that code. It needs to be one or the other.

Changing subject, the code for each release of liblfds is wholly indpendent and occupies different namespaces (by means of prefixes to the names of all public entities). As such, every version of the library can be compiled against and linked to, concurrently.

As such, the benchmark app can benchmark the different versions of the same data structure.

I’ve just spent an hour adding in a benchmark for the 7.0.0 btree!

The performance is really bad, though - and I never remembered it being like this; it was also pretty good. So I’m wondering if I’ve flubbed up the implementation somewhere along the line.

benchmarking earlier liblfds releases

So, problem.

Consider - as time goes by, liblfds supports more platforms. 7.1.0 supports MIPS. 6.1.1 did not. If benchmark benchmarks all version of liblfds, well, then it will try to include the earlier version’s header file and it will try to link to its library.

The header file is fine - it will work, and it can be modified (I can make bugfix releases for earlier versions) to make public which data structures are available. The library file isn’t fine, because right now it has to be hard coded in the makefile…!

It’s the libnuma problem all over again. With libnuma, I have a second build type. Adopting that solution for liblfds versions would mean one of two things, depending on whether the assumption is made platform support only increases over time. If as releases go by, platform support extends, then we’d end up with one extra build variant per release, i.e.;

gcc_gnumake_liblfds611_liblfds700_liblfds711 gcc_gnumake_liblfds700_liblfds711 gcc_gnumake_liblfds711

This would be doubled, because of libnuma.

However, if platform support can simply vary, then this approach becomes awkward.

I am however strongly inclined to the view it will only increase over time.

However however, this then leaves the user with what on the face of it is an awkward problem - when the user comes to build, he sees these build variant directories, and to select the correct variant, he has to know what releases of liblfds he can build.

OTOH that’s not so hard - he can just try to build them, and see if they build okay.

2016-05-31

7.1.0 is out

Finally, I can get some sleep :-)

And a haircut.

Some exercise!

A social life.

And I can start applying for jobs.

Home Blog Forum Mailing Lists Documentation GitHub Contact

admin at liblfds dot org