Update

Getting the final coding done, making the library compile on all platforms.

I’ve realised I want to change atomic exchange into atomic set (i.e. an exchange, but throw away the original value).

Problem is *you can’t test atomic set*.

Update

1am. Finished the first draft of the API docs (the left-hand column).

Tomorrow will pick up some new computer kit, including a Syncwire four-port USB charger which the dev boards will run off of, which will be a marvellous help with getting the liblfds ports testing done.

Basically, I think, two days work to go. One day to finish the docs (the right-hand column – all the build and porting guides) and one day to finish the testing work on the code.

Win7 topology reporting under VirtualBox

So, the Ex() functions are totally fucked. They return complete garbage. I think the first record is right, and then it goes rapidly down into 0xcdcdcdcdcd.

The old-style, pre-processor-group API, that works fine except for caches – it only reports data caches. No instruction caches, no unified caches.

I had assumed when I went over to Linux I could continue to support Windows by means of a VM.

Looks like it’s time to try something other than VirtualBox, which is a shame, I’ve used it for a long time and it’s familiar and in every other way has been fine.

Broken topology info reporting for Windows 7 in VirtualBox under Linux

GetLogicalProcessorInformationEx() on Windows 7 on VirtualBox on Linux returns partially garbage results.

As such, I cannot enumerate topology, as such, I cannot develop the benchmark app on this VM.

TBH, I’m not surprised this happens, given how appallingly messed up the Windows topology API is. It’s impossible to *use* it correctly – *implementing* it correctly must be twice as hard.

MS build problems

So, I’m getting the port to Windows back into shape.

I’m only supporting Windows Platform SDK for Windows 7 – it’s the very last release of stand-alone command line compilers, that I can find, from MS. I’m using this with gnumake.

I have however run into what seems to be a strange and novel problem. I’m using the latest version of the SDK, 7.1 (before I used 7.0). I recall with 7.0 there was a command line offered from the start menu for each build target (XP, 2000, Win7, etc) and I used them all.

Now though there’s only a command line offered for Windows 7.

So I fire up a normal command line, then run setenv and target XP.

Problem is, it’s not compilting like it’s targetting XP. NTDDI_VERSION is set for Windows 7 (the host OS). I was expect this and _WIN32_WINNT and so on to end up being set for XP. *They’re not*.

So how *do* I target XP?

The only thing I find setenv doing is setting an env-var, “TARGET_PLATFORM”.

I’ve also found out, far as I can tell, there’s no support in MSVC for dumping the compiler macros from the command line, as there is in GCC.

So, WTH, you know?

I have a command line option in setenv to target specific Windows versions, which simply does not work, or so it seems. I have no way of finding out what the compiler has set, and so the only thing I can see to do is that in MY code, I look at the values for TARGET_PLATFORM (whatsever they are) and then MYSELF set _WIN32_WINNT.

Which is absolutely freaking insane and *can’t* be right – but there’s nothing I can find anywhere about what is right, and what I *can* see is that setenv targetting for a given Windows versions simply *does not work*.

So I’m going to run up another VM and install 7.0 – about 30 minutes work – and then see if it will compile XP properly.

Ha, in other news, DHL failed to deliver a package today, so I have to go pick it up from their depot Monday. They failed to deliver because the intercom on the AirBnB I’m in does not work – so they couldn’t call me from the door.

Code complete

Development work is complete.

Now to sort out build configuration, run test and benchmark on all platforms, bring the 7.0.0 docs over and update them to 7.1.0, then release.

Backoff auto-tune successful

Has been a success.

It works well, and removes the need to specify backoff values.

I’ve moved over fully to it now, and removed all the CAS/DCAS setting code in the benchmark app.

auto-setting backoff

So, auto-setting backoff.

There are a number of important advantages.

As it is, it’s become clear that the backoff value varies not just by CAS/DWCAS, but by data structure and indeed by operation in each data structure.

It also varies, I expect and think I’ve now seen (I’ve not been looking for it yet) by the level of contention – the more threads banging away at once, the ideal backoff value changes (as you’d expect).

It will I am sure also vary by CPU topology (which also relates in part to how many threads are banging away at once).

So it’s basically freaking impossible to get right ahead of time, and it matters a LOT for performance (although one saving grace is that the window of good performance seems pretty wide).

What I have in mind for a method is to keep track of the backoff iteration reached for each operation, and then every n operatioin, decide based on the ratio between backoff iteration count for 0 and the count for 1. We need to know the previous ratio, to compare it to the current ratio, so we can know if what we did last time (increase or decrease) helped or hindered.

A problem here is where we keep this info. In a perfect world, it would be retained per thread and per data structure instance and per-operation state (i.e. separate state for queue enqueue and dequeue, freelist pop and freelist push, etc), which the user passes in via an argument to every liblfds lock-free operation. This is a huge PITA for a user.

We could reduce this to per-thread, per-operation state, which would amalgamte auto-setting across all data structures. This isn’t right, since different instances will have different numbers of threads using them. At least it takes topology into account.

Ideally, we avoid user-visible state entirely.

This would implies per data structure instance state, which is therefore valid for all threads using that instance, and will also be per-operation. The problem is then this would need atomic increments – splat, performance hit. We might find however that non-atomic increments work well enough. The loss of increments will of course be in proportion to how often a given counter is increments, so it will tend to bias the ratio we conpute – but it might be just fine in practise.

More M&S Queue Performance Tuning / Analysis

I realised that I had missed in the M&S queue a place where backoff needs to occur.

I had backoff where-ever a CAS or DWCAS occurs. However, backoff is actually needed where-ever an operation can fail due to contention with other threads.

In the M&S, such an operation exists – it’s not CAS or DWCAS, it’s just reading variables, then a load barrier, then a check the variables are unchanged. It needs backoff, didn’t have it.

I had high hopes this might be the cause of the mysterious missing performance on two threads (one per core) – but, alas, no.

It does add about 15% on four threads (two per core) so it’s worth having, but it’s the not the key issue (if there is one!)

I have instrumented the queue code to record in an per-thread array the backoff iteration number reached for a dequeue or an enqueue, and I now print these out from the benchmark.

They are very interesting.

Here’s the output for two threads, one per core. En for enqueue, Dn for dequeue, backoff iteration begins at 0 (and a count for zero means the queue operation went through first time) and a count for a higher number means we were on attempt n to perform that given operation.

This is for a backoff quantum of 8 (ranges from 0 to whatever you want – the higher, the more backoff, but the more backoff, the more unfair you become – the usual tradeoffs, etc). Eight is a pretty good value for most data structures on my platform.

============
LP 3 : E0 : 7684462
LP 3 : E1 : 6042450
LP 3 : E2 : 356540
LP 3 : E3 : 67853
LP 3 : E4 : 47549
LP 3 : E5 : 42962
LP 3 : E6 : 41005
LP 3 : E7 : 39853
LP 3 : E8 : 40438
LP 3 : E9 : 39311
LP 3 : D0 : 10008606
LP 3 : D1 : 649547
LP 3 : D2 : 40764
LP 3 : D3 : 7992
LP 3 : D4 : 2092
LP 3 : D5 : 1234
LP 3 : D6 : 512
LP 3 : D7 : 266
LP 3 : D8 : 194
LP 3 : D9 : 120
============
LP 2 : E0 : 10859024
LP 2 : E1 : 7284726
LP 2 : E2 : 388366
LP 2 : E3 : 15056
LP 2 : E4 : 3982
LP 2 : E5 : 1748
LP 2 : E6 : 1137
LP 2 : E7 : 942
LP 2 : E8 : 970
LP 2 : E9 : 806
LP 2 : D0 : 13080513
LP 2 : D1 : 776180
LP 2 : D2 : 6417
LP 2 : D3 : 384
LP 2 : D4 : 51
LP 2 : D5 : 15
LP 2 : D6 : 5
LP 2 : D7 : 1
LP 2 : D8 : 1
LP 2 : D9 : 2

There are many things to see.

Dequeue basically has an easier time of it. The count for 0 is TWO orders of magnitude greater than the count for 1, and the higher counts are almost 0.

Enqueue OTOH is having a hard time.

What this suggests to me is the possibility of auto-tuning backoff. Have the user pass in a per-thread array for keeping track of backoff iteration numbers reached, and a per-thread (and per data structure operation – enqueue being different to dequeue, etc), and then every n operations, if E0 is not the highest count, increase backoff by one. Need a mechanism to decrease though – perhaps aiming for E0 and E1 to be equal.