source control / github / versioning

I’ve opened a new github account, “liblfds2”. There will be one repo, “liblfds”. I’m going to follow a trunk-with-branches approach to source control. I couldn’t use the existing account, because it’s confusing, because of the older repos in there.

I’m using SVN myself, so I’ve moved over to the standard trunk/branches/tags layout.

libbenchmark has a private branch of the earlier versions of liblfds, with the porting abstraction layer and build files updates so they work with the current version (i.e. support the same platforms, so they can all be compiled and run together).

versioning

So, libbenchmark contains the older versions of the library (from 7.0.x onwards) so you can see how well the older versions are performing compared to the current version.

This leads to a problem where the current version can compile and run on more platforms than older versions, and this comes down to differences in the porting abstraction layer.

Deciding what to do with the abstraction layer is a bit problematic.

There are two reason for change;

1. bug fixes
2. ensuring older library versions can compile with the current benchmark app

The policy with regard to bug fix releases is of course bug fixes only.

This is not compatible with the sometimes major changes needed to work with the current benchmark app (such as adding new platforms).

The latest version may well have significant modifications (internal improvements), not just additions (new platforms).

We could imagine having two branches, one with bug fix changes only, one with an up-to-date abstraction layer for libbenchmark, but this is getting more complex for the user and also the current versioning system does not properly support this.

I begin to realise versioning systems are a straightjacket. It seems problematic to encode everything in them, but anything you can’t encode can’t happen in your source control system.

I have for a while (and originally did) have an integer version which simply increments. When you have branching, you then need something on your web-site to provide information about which version is which.

Problem is anything which isn’t directly obvious is going to lead people to mistakes.

Pick your poison.

I think what I may do is keep a private branch for libbenchmark.

Users won’t know about it, so it won’t muddy the waters.

twenty hours of debugging, problem now solved

I have just spent twenty hours (broken up by a night of sleep) debugging.

There was something amiss on aarch64 – benchmarks using DWCAS were segfaulting.

I uncovered one bug with the freelist elimination layer, which I fixed. Found another with thread startup ordering, and another with pointer/counter being the wrong way round still in 7.0.1 and 7.1.1.

Still, this didn’t fix the problem.

The particular example of the problem I focused on was this : the freelist benchmark for 7.0.1 with two threads. There’s two freelist elements, two threads, each pops and then pushes, in a tight loop. Problem is one of the threads was popping a NULL (empty freelist).

This in principle was impossible.

Well, I’ve just figured it out. After twenty hours of doing nothing but working on this, to figure it out. I’m pretty relieved and happy 🙂

So, on aarch64, I implement my own DWCAS, using LL/SC. Have to, because libatomic is totally messed up right now. I don’t use memory barriers in the DWCAS itself – for me, they’re separate, in the code itself, rather than in the DWCAS.

This means though that the DWCAS is issuing “ldxp” to load, and then “stxp” to store.

If any other thread writes to the load/store location after the ldxp, the stxp will fail – so we’re protected against other thread writing between our load and store.

But…

…we are NOT protected against LOADING THE WRONG VALUE IN THE FIRST PLACE.

So we have (say) no memory barriers. We load the original value, this prior to the DWCAS. Someone else then does a write. *We don’t see it*, because we have not issued a load barrier – and then we enter into the DWCAS. We load the wrong value – we’ve not seen their write yet – and then we compare it with the compare value we loaded prior to the DWCAS, and lo and behold, they are the same *and no one has written to the load/store location during the DWCAS*. So we store!! and that messes things up completely.

There has to be a load barrier between the load of the compare and the DWCAS.

It’s plain now – extra plain – I need data structure implementations per processor family.

freelist elimination array design flaw

So!

I’ve been working on the build system.

I uncovered one subtle bug (wrong type used) in the test library, fixed that.

Then I found that the benchmark for the freelist with elimination layer was crashing consistently with two threads on aarch64 for liblfds 7.1.2 (the in-development version of 7.1.1, where the liblfds core library is present in the benchmark library of 7.2.0, so it can be benchmarked and compared against other versions of liblfds).

I’ve been looking into this for two days.

I think I’ve found a design flaw in the elimination array. I’m still thinking it over, but I think so.

In the current design, you specify when initializing the freelist how many cache lines worth of elimination array you are providing – recommendation is on the order of one per thread you expect to be using the freelist (where the cache lines are used to store freelist element pointers).

You will then push elements onto the freelist. I think often it’s necessary to know how many elements are available – so you might push 32 elements, and you *know* you have 32 elements, and it might be for your use case you know you’ll *never* use more than 32, so you can then know the freelist will never be empty.

However, the elimination array is present.

The way it works (in 7.1.x – it’s a bit different in 7.2.0, I’ll come to that) is that a cache line is randomly chose (using a decent PRNG) and its scanned – if you want to push, you’re looking for a NULL, if you want to pop, you’re looking for a pointer.

Now, the first problem is the use of the PRNG (rather than round robin, which is how it works in 7.2.0). The PRNG produces a random value, where the number of elimination array lines is always a power of two, so we subtract one from the number of array lines and binary and that with the random value, to select the array line to use.

Of course, being a random number generator, there’s always a bit of difference in the number of each number produced – the percentage is constant, but the absolute value represented by that percentage rises as the number of operations increases. So if we have say 100 pops and 2 cache lines, we might find we have 49 on cache line 0 and 51 on cache line 1. After 100,000 pops we might find it’s 49,800 on cache line 0 and 51,200 on cache line 1.

The problem is that the number of elements in the cache lines is very small – usually eight.

This means that fairly quickly one cache line is full and the other is empty.

That breaks the elimination layer function *anyway*, but it actually then – after much crossing of eyes and scratching of head – led on to reveal a second issue.

So, consider the user wants to put 32 elements in the freelist – and rely on there being 32 elements in there. However, we have the elimination array. The freelist selects one line, scans it, if no luck it pops from the freelist proper.

Imagine we have 32 elements, but we have say oh 20 cache lines, each with eight pointers. We could imagine all the elements being popped, then pushed back to the array, but leaving some array lines empty – and then we try to pop, we pick an empty array line, then we go to the freelist proper – and it’s empty. Embarassed looks all round.

To cope with this, the user is expected to push in an extra number of elements – enough to fill the elimination array completely *but minus one cache line*, the reasoning being the worst possible arrangment of these extra elements leave all caches lines full except for one, we pick that one, miss, and then go to the freelist proper – so it all works.

It doesn’t work though.

Imagine the above scenary. Say there’s four cache lines, three are full, one is empty. There are say four elements on the freelist proper. We want to pop, we select the empty cache line, we begin to scan it – nothing so far… we’re halfway down the scan *and the other threads pop the elements from the freelist, and push them into the array line *BEHIND* where we are now*.

We think the array is empty, we go to pop from the freelist proper, embarrassed looks all round.

I half-think the only solution to this is that elimination array begins fully populated.

build system development

It’s not perfect yet – there should be listings I think for test and benchmark failures – but you can see it’s getting there! the output isn’t very ordered yet, either. I also have a ton of benchmark gunplots, which were automatically generated 🙂

Total time about 420 minutes on the slowest platform, so about seven hours (platforms run in parallel).

The “512 FAILED” is a bit of a bug – it should say failed, but the number seems unexpected =-)

localhost : x86_64 : gcc : 5.3.0 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 5.3.0 : so_dbg : build : 0
localhost : x86_64 : gcc : 5.3.0 : so_rel : build : 0
localhost : x86_64 : gcc : 5.3.0 : ar_prof : build : 0
localhost : x86_64 : gcc : 5.3.0 : ar_cov : build : 0
localhost : x86_64 : gcc : 5.3.0 : ar_dbg : build : 0
localhost : x86_64 : gcc : 5.3.0 : so_prof : build : 0
localhost : x86_64 : gcc : 5.3.0 : so_vanilla : build : 0
localhost : x86_64 : gcc : 5.3.0 : so_tsan : build : 0
localhost : x86_64 : gcc : 5.3.0 : ar_rel : build : 0
localhost : x86_64 : gcc : 5.3.0 : ar_tsan : build : 0
localhost : x86_64 : gcc : 5.1.0 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 5.1.0 : so_dbg : build : 0
localhost : x86_64 : gcc : 5.1.0 : so_rel : build : 0
localhost : x86_64 : gcc : 5.1.0 : ar_prof : build : 0
localhost : x86_64 : gcc : 5.1.0 : ar_cov : build : 0
localhost : x86_64 : gcc : 5.1.0 : ar_dbg : build : 0
localhost : x86_64 : gcc : 5.1.0 : so_prof : build : 0
localhost : x86_64 : gcc : 5.1.0 : so_vanilla : build : 0
localhost : x86_64 : gcc : 5.1.0 : so_tsan : build : 0
localhost : x86_64 : gcc : 5.1.0 : ar_rel : build : 0
localhost : x86_64 : gcc : 5.1.0 : ar_tsan : build : 0
localhost : x86_64 : gcc : 6.2.0 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 6.2.0 : so_dbg : build : 0
localhost : x86_64 : gcc : 6.2.0 : so_rel : build : 0
localhost : x86_64 : gcc : 6.2.0 : ar_prof : build : 0
localhost : x86_64 : gcc : 6.2.0 : ar_cov : build : 0
localhost : x86_64 : gcc : 6.2.0 : ar_dbg : build : 0
localhost : x86_64 : gcc : 6.2.0 : so_prof : build : 0
localhost : x86_64 : gcc : 6.2.0 : so_vanilla : build : 0
localhost : x86_64 : gcc : 6.2.0 : so_tsan : build : 0
localhost : x86_64 : gcc : 6.2.0 : ar_rel : build : 0
localhost : x86_64 : gcc : 6.2.0 : ar_tsan : build : 0
localhost : x86_64 : gcc : 4.6.4 : ar_vanilla : build : 512 FAILED
localhost : x86_64 : gcc : 4.6.4 : so_dbg : build : 512 FAILED
localhost : x86_64 : gcc : 4.6.4 : so_rel : build : 512 FAILED
localhost : x86_64 : gcc : 4.6.4 : ar_prof : build : 512 FAILED
localhost : x86_64 : gcc : 4.6.4 : ar_cov : build : 512 FAILED
localhost : x86_64 : gcc : 4.6.4 : ar_dbg : build : 512 FAILED
localhost : x86_64 : gcc : 4.6.4 : so_prof : build : 512 FAILED
localhost : x86_64 : gcc : 4.6.4 : so_vanilla : build : 512 FAILED
localhost : x86_64 : gcc : 4.6.4 : ar_rel : build : 512 FAILED
localhost : x86_64 : gcc : 7.1.0 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 7.1.0 : so_dbg : build : 0
localhost : x86_64 : gcc : 7.1.0 : so_rel : build : 0
localhost : x86_64 : gcc : 7.1.0 : ar_prof : build : 0
localhost : x86_64 : gcc : 7.1.0 : ar_cov : build : 0
localhost : x86_64 : gcc : 7.1.0 : ar_dbg : build : 0
localhost : x86_64 : gcc : 7.1.0 : so_prof : build : 0
localhost : x86_64 : gcc : 7.1.0 : so_vanilla : build : 0
localhost : x86_64 : gcc : 7.1.0 : so_tsan : build : 0
localhost : x86_64 : gcc : 7.1.0 : ar_rel : build : 0
localhost : x86_64 : gcc : 7.1.0 : ar_tsan : build : 0
localhost : x86_64 : gcc : 6.3.0 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 6.3.0 : so_dbg : build : 0
localhost : x86_64 : gcc : 6.3.0 : so_rel : build : 0
localhost : x86_64 : gcc : 6.3.0 : ar_prof : build : 0
localhost : x86_64 : gcc : 6.3.0 : ar_cov : build : 0
localhost : x86_64 : gcc : 6.3.0 : ar_dbg : build : 0
localhost : x86_64 : gcc : 6.3.0 : so_prof : build : 0
localhost : x86_64 : gcc : 6.3.0 : so_vanilla : build : 0
localhost : x86_64 : gcc : 6.3.0 : so_tsan : build : 0
localhost : x86_64 : gcc : 6.3.0 : ar_rel : build : 0
localhost : x86_64 : gcc : 6.3.0 : ar_tsan : build : 0
localhost : x86_64 : gcc : 4.8.0 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 4.8.0 : so_dbg : build : 0
localhost : x86_64 : gcc : 4.8.0 : so_rel : build : 0
localhost : x86_64 : gcc : 4.8.0 : ar_prof : build : 0
localhost : x86_64 : gcc : 4.8.0 : ar_cov : build : 0
localhost : x86_64 : gcc : 4.8.0 : ar_dbg : build : 0
localhost : x86_64 : gcc : 4.8.0 : so_prof : build : 0
localhost : x86_64 : gcc : 4.8.0 : so_vanilla : build : 0
localhost : x86_64 : gcc : 4.8.0 : so_tsan : build : 0
localhost : x86_64 : gcc : 4.8.0 : ar_rel : build : 0
localhost : x86_64 : gcc : 4.8.0 : ar_tsan : build : 0
localhost : x86_64 : gcc : 6.1.0 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 6.1.0 : so_dbg : build : 0
localhost : x86_64 : gcc : 6.1.0 : so_rel : build : 0
localhost : x86_64 : gcc : 6.1.0 : ar_prof : build : 0
localhost : x86_64 : gcc : 6.1.0 : ar_cov : build : 0
localhost : x86_64 : gcc : 6.1.0 : ar_dbg : build : 0
localhost : x86_64 : gcc : 6.1.0 : so_prof : build : 0
localhost : x86_64 : gcc : 6.1.0 : so_vanilla : build : 0
localhost : x86_64 : gcc : 6.1.0 : so_tsan : build : 0
localhost : x86_64 : gcc : 6.1.0 : ar_rel : build : 0
localhost : x86_64 : gcc : 6.1.0 : ar_tsan : build : 0
localhost : x86_64 : gcc : 4.9.4 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 4.9.4 : so_dbg : build : 0
localhost : x86_64 : gcc : 4.9.4 : so_rel : build : 0
localhost : x86_64 : gcc : 4.9.4 : ar_prof : build : 0
localhost : x86_64 : gcc : 4.9.4 : ar_cov : build : 0
localhost : x86_64 : gcc : 4.9.4 : ar_dbg : build : 0
localhost : x86_64 : gcc : 4.9.4 : so_prof : build : 0
localhost : x86_64 : gcc : 4.9.4 : so_vanilla : build : 0
localhost : x86_64 : gcc : 4.9.4 : so_tsan : build : 0
localhost : x86_64 : gcc : 4.9.4 : ar_rel : build : 0
localhost : x86_64 : gcc : 4.9.4 : ar_tsan : build : 0
localhost : x86_64 : gcc : 4.8.5 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 4.8.5 : so_dbg : build : 0
localhost : x86_64 : gcc : 4.8.5 : so_rel : build : 0
localhost : x86_64 : gcc : 4.8.5 : ar_prof : build : 0
localhost : x86_64 : gcc : 4.8.5 : ar_cov : build : 0
localhost : x86_64 : gcc : 4.8.5 : ar_dbg : build : 0
localhost : x86_64 : gcc : 4.8.5 : so_prof : build : 0
localhost : x86_64 : gcc : 4.8.5 : so_vanilla : build : 0
localhost : x86_64 : gcc : 4.8.5 : so_tsan : build : 0
localhost : x86_64 : gcc : 4.8.5 : ar_rel : build : 0
localhost : x86_64 : gcc : 4.8.5 : ar_tsan : build : 0
localhost : x86_64 : gcc : 5.4.0 : ar_vanilla : build : 0
localhost : x86_64 : gcc : 5.4.0 : so_dbg : build : 0
localhost : x86_64 : gcc : 5.4.0 : so_rel : build : 0
localhost : x86_64 : gcc : 5.4.0 : ar_prof : build : 0
localhost : x86_64 : gcc : 5.4.0 : ar_cov : build : 0
localhost : x86_64 : gcc : 5.4.0 : ar_dbg : build : 0
localhost : x86_64 : gcc : 5.4.0 : so_prof : build : 0
localhost : x86_64 : gcc : 5.4.0 : so_vanilla : build : 0
localhost : x86_64 : gcc : 5.4.0 : so_tsan : build : 0
localhost : x86_64 : gcc : 5.4.0 : ar_rel : build : 0
localhost : x86_64 : gcc : 5.4.0 : ar_tsan : build : 0
localhost : x86_64 : gcc : 4.5.4 : ar_vanilla : build : 512 FAILED
localhost : x86_64 : gcc : 4.5.4 : so_dbg : build : 512 FAILED
localhost : x86_64 : gcc : 4.5.4 : so_rel : build : 512 FAILED
localhost : x86_64 : gcc : 4.5.4 : ar_prof : build : 512 FAILED
localhost : x86_64 : gcc : 4.5.4 : ar_cov : build : 512 FAILED
localhost : x86_64 : gcc : 4.5.4 : ar_dbg : build : 512 FAILED
localhost : x86_64 : gcc : 4.5.4 : so_prof : build : 512 FAILED
localhost : x86_64 : gcc : 4.5.4 : so_vanilla : build : 512 FAILED
localhost : x86_64 : gcc : 4.5.4 : ar_rel : build : 512 FAILED
raspberrypi : arm : gcc : 5.1.0 : ar_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 5.1.0 : so_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 5.1.0 : so_rel : build : 512 FAILED
raspberrypi : arm : gcc : 5.1.0 : ar_prof : build : 512 FAILED
raspberrypi : arm : gcc : 5.1.0 : ar_cov : build : 512 FAILED
raspberrypi : arm : gcc : 5.1.0 : ar_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 5.1.0 : so_prof : build : 512 FAILED
raspberrypi : arm : gcc : 5.1.0 : so_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 5.1.0 : ar_rel : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.5 : ar_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.5 : so_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.5 : so_rel : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.5 : ar_prof : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.5 : ar_cov : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.5 : ar_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.5 : so_prof : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.5 : so_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.5 : ar_rel : build : 512 FAILED
raspberrypi : arm : gcc : 5.4.0 : ar_vanilla : build : 0
raspberrypi : arm : gcc : 5.4.0 : so_dbg : build : 0
raspberrypi : arm : gcc : 5.4.0 : so_rel : build : 0
raspberrypi : arm : gcc : 5.4.0 : ar_prof : build : 0
raspberrypi : arm : gcc : 5.4.0 : ar_cov : build : 0
raspberrypi : arm : gcc : 5.4.0 : ar_dbg : build : 0
raspberrypi : arm : gcc : 5.4.0 : so_prof : build : 0
raspberrypi : arm : gcc : 5.4.0 : so_vanilla : build : 0
raspberrypi : arm : gcc : 5.4.0 : ar_rel : build : 0
raspberrypi : arm : gcc : 4.9.4 : ar_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 4.9.4 : so_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 4.9.4 : so_rel : build : 512 FAILED
raspberrypi : arm : gcc : 4.9.4 : ar_prof : build : 512 FAILED
raspberrypi : arm : gcc : 4.9.4 : ar_cov : build : 512 FAILED
raspberrypi : arm : gcc : 4.9.4 : ar_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 4.9.4 : so_prof : build : 512 FAILED
raspberrypi : arm : gcc : 4.9.4 : so_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 4.9.4 : ar_rel : build : 512 FAILED
raspberrypi : arm : gcc : 6.2.0 : ar_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 6.2.0 : so_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 6.2.0 : so_rel : build : 512 FAILED
raspberrypi : arm : gcc : 6.2.0 : ar_prof : build : 512 FAILED
raspberrypi : arm : gcc : 6.2.0 : ar_cov : build : 512 FAILED
raspberrypi : arm : gcc : 6.2.0 : ar_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 6.2.0 : so_prof : build : 512 FAILED
raspberrypi : arm : gcc : 6.2.0 : so_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 6.2.0 : ar_rel : build : 512 FAILED
raspberrypi : arm : gcc : 7.1.0 : ar_vanilla : build : 0
raspberrypi : arm : gcc : 7.1.0 : so_dbg : build : 0
raspberrypi : arm : gcc : 7.1.0 : so_rel : build : 0
raspberrypi : arm : gcc : 7.1.0 : ar_prof : build : 0
raspberrypi : arm : gcc : 7.1.0 : ar_cov : build : 0
raspberrypi : arm : gcc : 7.1.0 : ar_dbg : build : 0
raspberrypi : arm : gcc : 7.1.0 : so_prof : build : 0
raspberrypi : arm : gcc : 7.1.0 : so_vanilla : build : 0
raspberrypi : arm : gcc : 7.1.0 : ar_rel : build : 0
raspberrypi : arm : gcc : 4.8.0 : ar_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.0 : so_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.0 : so_rel : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.0 : ar_prof : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.0 : ar_cov : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.0 : ar_dbg : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.0 : so_prof : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.0 : so_vanilla : build : 512 FAILED
raspberrypi : arm : gcc : 4.8.0 : ar_rel : build : 512 FAILED
pine64 : aarch64 : gcc : 5.1.0 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 5.1.0 : so_dbg : build : 0
pine64 : aarch64 : gcc : 5.1.0 : so_rel : build : 0
pine64 : aarch64 : gcc : 5.1.0 : ar_cov : build : 0
pine64 : aarch64 : gcc : 5.1.0 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 5.1.0 : so_prof : build : 0
pine64 : aarch64 : gcc : 5.1.0 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 5.1.0 : ar_prof : build : 0
pine64 : aarch64 : gcc : 5.1.0 : ar_rel : build : 0
pine64 : aarch64 : gcc : 4.8.5 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 4.8.5 : so_dbg : build : 0
pine64 : aarch64 : gcc : 4.8.5 : so_rel : build : 0
pine64 : aarch64 : gcc : 4.8.5 : ar_cov : build : 0
pine64 : aarch64 : gcc : 4.8.5 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 4.8.5 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 4.8.5 : ar_rel : build : 0
pine64 : aarch64 : gcc : 5.4.0 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 5.4.0 : so_dbg : build : 0
pine64 : aarch64 : gcc : 5.4.0 : so_rel : build : 0
pine64 : aarch64 : gcc : 5.4.0 : ar_cov : build : 0
pine64 : aarch64 : gcc : 5.4.0 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 5.4.0 : so_prof : build : 0
pine64 : aarch64 : gcc : 5.4.0 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 5.4.0 : ar_prof : build : 0
pine64 : aarch64 : gcc : 5.4.0 : ar_rel : build : 0
pine64 : aarch64 : gcc : 4.9.1 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 4.9.1 : so_dbg : build : 0
pine64 : aarch64 : gcc : 4.9.1 : so_rel : build : 0
pine64 : aarch64 : gcc : 4.9.1 : ar_cov : build : 0
pine64 : aarch64 : gcc : 4.9.1 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 4.9.1 : so_prof : build : 0
pine64 : aarch64 : gcc : 4.9.1 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 4.9.1 : ar_prof : build : 0
pine64 : aarch64 : gcc : 4.9.1 : ar_rel : build : 0
pine64 : aarch64 : gcc : 4.9.0 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 4.9.0 : so_dbg : build : 0
pine64 : aarch64 : gcc : 4.9.0 : so_rel : build : 0
pine64 : aarch64 : gcc : 4.9.0 : ar_cov : build : 0
pine64 : aarch64 : gcc : 4.9.0 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 4.9.0 : so_prof : build : 0
pine64 : aarch64 : gcc : 4.9.0 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 4.9.0 : ar_prof : build : 0
pine64 : aarch64 : gcc : 4.9.0 : ar_rel : build : 0
pine64 : aarch64 : gcc : 7.1.0 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 7.1.0 : so_dbg : build : 0
pine64 : aarch64 : gcc : 7.1.0 : so_rel : build : 0
pine64 : aarch64 : gcc : 7.1.0 : ar_cov : build : 0
pine64 : aarch64 : gcc : 7.1.0 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 7.1.0 : so_prof : build : 0
pine64 : aarch64 : gcc : 7.1.0 : ar_prof : build : 0
pine64 : aarch64 : gcc : 7.1.0 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 7.1.0 : so_tsan : build : 0
pine64 : aarch64 : gcc : 7.1.0 : ar_rel : build : 0
pine64 : aarch64 : gcc : 7.1.0 : ar_tsan : build : 0
pine64 : aarch64 : gcc : 6.1.0 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 6.1.0 : so_dbg : build : 0
pine64 : aarch64 : gcc : 6.1.0 : so_rel : build : 0
pine64 : aarch64 : gcc : 6.1.0 : ar_cov : build : 0
pine64 : aarch64 : gcc : 6.1.0 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 6.1.0 : so_prof : build : 0
pine64 : aarch64 : gcc : 6.1.0 : ar_prof : build : 0
pine64 : aarch64 : gcc : 6.1.0 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 6.1.0 : so_tsan : build : 0
pine64 : aarch64 : gcc : 6.1.0 : ar_rel : build : 0
pine64 : aarch64 : gcc : 6.1.0 : ar_tsan : build : 0
pine64 : aarch64 : gcc : 6.3.0 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 6.3.0 : so_dbg : build : 0
pine64 : aarch64 : gcc : 6.3.0 : so_rel : build : 0
pine64 : aarch64 : gcc : 6.3.0 : ar_cov : build : 0
pine64 : aarch64 : gcc : 6.3.0 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 6.3.0 : so_prof : build : 0
pine64 : aarch64 : gcc : 6.3.0 : ar_prof : build : 0
pine64 : aarch64 : gcc : 6.3.0 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 6.3.0 : so_tsan : build : 0
pine64 : aarch64 : gcc : 6.3.0 : ar_rel : build : 0
pine64 : aarch64 : gcc : 6.3.0 : ar_tsan : build : 0
pine64 : aarch64 : gcc : 4.9.4 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 4.9.4 : so_dbg : build : 0
pine64 : aarch64 : gcc : 4.9.4 : so_rel : build : 0
pine64 : aarch64 : gcc : 4.9.4 : ar_cov : build : 0
pine64 : aarch64 : gcc : 4.9.4 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 4.9.4 : so_prof : build : 0
pine64 : aarch64 : gcc : 4.9.4 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 4.9.4 : ar_prof : build : 0
pine64 : aarch64 : gcc : 4.9.4 : ar_rel : build : 0
pine64 : aarch64 : gcc : 4.8.0 : ar_vanilla : build : 0
pine64 : aarch64 : gcc : 4.8.0 : so_dbg : build : 0
pine64 : aarch64 : gcc : 4.8.0 : so_rel : build : 0
pine64 : aarch64 : gcc : 4.8.0 : ar_cov : build : 0
pine64 : aarch64 : gcc : 4.8.0 : ar_dbg : build : 0
pine64 : aarch64 : gcc : 4.8.0 : so_vanilla : build : 0
pine64 : aarch64 : gcc : 4.8.0 : ar_rel : build : 0
minnow : i686 : gcc : 5.1.0 : ar_vanilla : build : 0
minnow : i686 : gcc : 5.1.0 : so_dbg : build : 0
minnow : i686 : gcc : 5.1.0 : so_rel : build : 0
minnow : i686 : gcc : 5.1.0 : ar_prof : build : 0
minnow : i686 : gcc : 5.1.0 : ar_cov : build : 0
minnow : i686 : gcc : 5.1.0 : ar_dbg : build : 0
minnow : i686 : gcc : 5.1.0 : so_prof : build : 0
minnow : i686 : gcc : 5.1.0 : so_vanilla : build : 0
minnow : i686 : gcc : 5.1.0 : ar_rel : build : 0
minnow : i686 : gcc : 4.8.5 : ar_vanilla : build : 0
minnow : i686 : gcc : 4.8.5 : so_dbg : build : 0
minnow : i686 : gcc : 4.8.5 : so_rel : build : 0
minnow : i686 : gcc : 4.8.5 : ar_prof : build : 0
minnow : i686 : gcc : 4.8.5 : ar_cov : build : 0
minnow : i686 : gcc : 4.8.5 : ar_dbg : build : 0
minnow : i686 : gcc : 4.8.5 : so_prof : build : 0
minnow : i686 : gcc : 4.8.5 : so_vanilla : build : 0
minnow : i686 : gcc : 4.8.5 : ar_rel : build : 0
minnow : i686 : gcc : 5.4.0 : ar_vanilla : build : 0
minnow : i686 : gcc : 5.4.0 : so_dbg : build : 0
minnow : i686 : gcc : 5.4.0 : so_rel : build : 0
minnow : i686 : gcc : 5.4.0 : ar_prof : build : 0
minnow : i686 : gcc : 5.4.0 : ar_cov : build : 0
minnow : i686 : gcc : 5.4.0 : ar_dbg : build : 0
minnow : i686 : gcc : 5.4.0 : so_prof : build : 0
minnow : i686 : gcc : 5.4.0 : so_vanilla : build : 0
minnow : i686 : gcc : 5.4.0 : ar_rel : build : 0
minnow : i686 : gcc : 4.6.4 : ar_vanilla : build : 512 FAILED
minnow : i686 : gcc : 4.6.4 : so_dbg : build : 512 FAILED
minnow : i686 : gcc : 4.6.4 : so_rel : build : 512 FAILED
minnow : i686 : gcc : 4.6.4 : ar_prof : build : 512 FAILED
minnow : i686 : gcc : 4.6.4 : ar_cov : build : 512 FAILED
minnow : i686 : gcc : 4.6.4 : ar_dbg : build : 512 FAILED
minnow : i686 : gcc : 4.6.4 : so_prof : build : 512 FAILED
minnow : i686 : gcc : 4.6.4 : so_vanilla : build : 512 FAILED
minnow : i686 : gcc : 4.6.4 : ar_rel : build : 512 FAILED
minnow : i686 : gcc : 4.5.4 : ar_vanilla : build : 512 FAILED
minnow : i686 : gcc : 4.5.4 : so_dbg : build : 512 FAILED
minnow : i686 : gcc : 4.5.4 : so_rel : build : 512 FAILED
minnow : i686 : gcc : 4.5.4 : ar_prof : build : 512 FAILED
minnow : i686 : gcc : 4.5.4 : ar_cov : build : 512 FAILED
minnow : i686 : gcc : 4.5.4 : ar_dbg : build : 512 FAILED
minnow : i686 : gcc : 4.5.4 : so_prof : build : 512 FAILED
minnow : i686 : gcc : 4.5.4 : so_vanilla : build : 512 FAILED
minnow : i686 : gcc : 4.5.4 : ar_rel : build : 512 FAILED
ci20 : mipsel : gcc : 5.1.0 : ar_vanilla : build : 512 FAILED
ci20 : mipsel : gcc : 5.1.0 : so_dbg : build : 512 FAILED
ci20 : mipsel : gcc : 5.1.0 : so_rel : build : 512 FAILED
ci20 : mipsel : gcc : 5.1.0 : ar_prof : build : 512 FAILED
ci20 : mipsel : gcc : 5.1.0 : ar_cov : build : 512 FAILED
ci20 : mipsel : gcc : 5.1.0 : ar_dbg : build : 512 FAILED
ci20 : mipsel : gcc : 5.1.0 : so_prof : build : 512 FAILED
ci20 : mipsel : gcc : 5.1.0 : so_vanilla : build : 512 FAILED
ci20 : mipsel : gcc : 5.1.0 : ar_rel : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.5 : ar_vanilla : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.5 : so_dbg : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.5 : so_rel : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.5 : ar_prof : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.5 : ar_cov : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.5 : ar_dbg : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.5 : so_prof : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.5 : so_vanilla : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.5 : ar_rel : build : 512 FAILED
ci20 : mipsel : gcc : 4.6.4 : ar_vanilla : build : 0
ci20 : mipsel : gcc : 4.6.4 : so_dbg : build : 0
ci20 : mipsel : gcc : 4.6.4 : so_rel : build : 0
ci20 : mipsel : gcc : 4.6.4 : ar_prof : build : 0
ci20 : mipsel : gcc : 4.6.4 : ar_cov : build : 0
ci20 : mipsel : gcc : 4.6.4 : ar_dbg : build : 0
ci20 : mipsel : gcc : 4.6.4 : so_prof : build : 0
ci20 : mipsel : gcc : 4.6.4 : so_vanilla : build : 0
ci20 : mipsel : gcc : 4.6.4 : ar_rel : build : 0
ci20 : mipsel : gcc : 4.8.0 : ar_vanilla : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.0 : so_dbg : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.0 : so_rel : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.0 : ar_prof : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.0 : ar_cov : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.0 : ar_dbg : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.0 : so_prof : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.0 : so_vanilla : build : 512 FAILED
ci20 : mipsel : gcc : 4.8.0 : ar_rel : build : 512 FAILED
ci20 : mipsel : gcc : 4.5.4 : ar_vanilla : build : 0
ci20 : mipsel : gcc : 4.5.4 : so_dbg : build : 0
ci20 : mipsel : gcc : 4.5.4 : so_rel : build : 0
ci20 : mipsel : gcc : 4.5.4 : ar_prof : build : 0
ci20 : mipsel : gcc : 4.5.4 : ar_cov : build : 0
ci20 : mipsel : gcc : 4.5.4 : ar_dbg : build : 0
ci20 : mipsel : gcc : 4.5.4 : so_prof : build : 0
ci20 : mipsel : gcc : 4.5.4 : so_vanilla : build : 0
ci20 : mipsel : gcc : 4.5.4 : ar_rel : build : 0
localhost : 154.152909 minutes
raspberrypi : 136.283255 minutes
pine64 : 300.401916 minutes
minnow : 201.507361 minutes
ci20 : 423.456168 minutes
total time = 423.466951 minutes
...done

freelist scaling update #2

The single-threaded freelist with elimination layer is faster than spinlocks because it gets to pop/push with a single exchange, where-as spinlocks have to do two single-word CAS.

The best solution so far (for the current implementation) seems to be incrementing the push/pop EA array index on every push/pop, but if a pop is successful then setting the push EA array index to be that of the success pop EA array index to that value, and when a pop is successful setting the push EA array index to the pop value (i.e. we know we’ve just pop/pushed successfully on a given EA array index, so we can know – for a bit – that the next push/pop will be okay on that index).

freelist scaling update

So, the freelist scaling new is not, alas, as good as it seemed. I rather – but not fully – in my delight rather jumped the gun.

The work in question is that of the elimination layer. A freelist pop cancels out a freelist push, so if two threads attempting those operations can somehow “meet up”, and exchange their operations, then the freelist head point (the point of memory contention, which slows everything to a crawl as processor counts rise) is relieved of those two operations.

I’ve implemented this as an array of cache lines. Rule of thumb is one cache line per thread using the freelist, but I’ve no proof for this being right.

Each cache line stores an array of freelist element pointers. They start NULL.

When a thread pushes, it round-robin picks a cache line in the elimination array, and then iterates through it. If it finds a NULL, it attempts to exchange its freelist element pointer (for the element it wants to push) with the NULL. If it succeeds, well, then it just handed over its element and is free to go.

If it fails, and keeps failing (or the pointer are already non-NULL, i.e. in use), so that it gets to the end of the cache line without handing over its element, *then* it falls back to pushing onto the freelist proper.

Popping is much the same. Round robin to the cache line, iterate, find a non-NULL then try to exchange it with NULL.

Now, what happened that got me so excited was basically a fluke – I had four threads running, each in a tight pop/push loop, and each one had ended up (I wasn’t using round-robin yet, but a fixed cache line per thread) popping and pushing from its own, separate cache line. Basically, it was as if each thread had its own freelist. Result – great performance – but it’s totally unrealistic.

I then dug into it, collected cache line hit/miss stats and so on, and once I’d sorted everything out, and had the threads doing round robin, with separate round-robin counters for push and pop, then we end up being on four cores about two, two and a half times faster than without the elimination layer – but it’s not scaling – and in fact we’re only this fast because the benchmark is particularly favoured.

The benchmark is doing a pop and then a push, in a loop. Pop and push have independent counters, both start at zero. Counters are per thread. The elimination layer begins half-full. So we pop – and we get an element. We’ve made a space. We then push. There’s *always* a space there, because we just made a space!

In fact, if we use round-robin but add a hash mixing function to it, we then find about 10% misses, and performance is still better but only by about 25%.

There’s still a few things I do not understand (or are bugs in my code somewhere – the benchmark app prolly). On a single core, *with* the elimintion layer is as fast as a spin lock, and without is about 20% slower.

On two cores on the same physical core, without hash mixing function, there’s a haromic such that each thread seems to be getting an elimination layer cache line to itself, although this might just be where one physical core knows what its logical cores are up to and so can optimize the atomic operations. With has mixing, performance drops to the same as one thread on two separate physical cores.

This is making me think somewhat of others ways to achieve the same end. I read about “scalable stack” recently, which is actually a list (can’t remember the details offhand – might be deferd updates to the list, kindafing).

In fact you could imagine a btree as a freelist, where popping gets a random leaf node, and pushing pushes to a random leaf node.

A hash mixer on the address of the element being pushed would be enough to take care of balance. The problem would be detaching leaf nodes – same problem as in lists – no detaching a node which has just had an element attached to it. Harris’ delete bit would deal with this, and no need for garbage collection because it’s a freelist.

Freelist now scaling

Just a bit of experimentation – and lo and behold, it is the random number generation which is killing performance.

Take it out and the freelist is scaling just like the btree.

So – more investigation required, have to see what’s up with the RNG, if I messed it up or it’s just slow.

I actually read something on this matter quite a while ago – about you can’t use modulus *even in single threads* because it’s so slow it kills you. You have to use shift-left/shift-right and powers of 2 only.

I think I’ve run into the same problem.

Flying to Athens from SXF

I’m in SXF, Schoenfeld airport, Berlin.

It has so far not been a pleasent trip.

I bought budget tickets because the company I’m flying to see paid for them – you don’t spend other people’s money.

Ryanair charge 45 euros to print a board pass, so it’s important to get that done yourself; I remembered this yesterday evening, after gym. I don’t have a printer, the flight is on a Sunday, so I went to the airport via work to print there. It’d be nice if the airport had a printing service, but this (bloody obvious) idea seems to be a bridge too far, even after all these years, in every airport I’ve been in.

Most of the metro route to work is still out of action – it’s down for 45 days and there’s a bus service instead. I actually got the dates mixed up – I thought it was back in action the end of *this* month, but in fact there’s another month to go.

So I got that done – used up some of my buffer time – and then hopped on the metro to get to SXF.

Turns out there’s also a chunk of the S-Bahn line to SXF which is out of action and being replaced by a bus service (in fact, it comes back into action *tomorrow*).

So I read this, swear, because it’s getting a bit late, and I head out to get the bus.

Problem is, where’s the bus stop? I mean, *AT ALL*? there are no signs, no directions, there’s no information except in the station where it says “you can get the bus from bus stop 140”. No maps – and the bus stops are not marked, so you don’t know which stop is “140” anyway.

I’m standing outside the metro, looking at two bus stops and Lord knows if they have anything to do with the replacement bus service. They look just like normal stops. I’m at the crossing of a couple of major roads, and there’s probably bus stops dotted around all over the place. These are probably just two random stops.

So at this point in fact it is impossible for me to continue my journey. BVG completely fucked it up. “There’s a bus service!” “Great, where!” “We’ve hidden it! you’ll have to find it!!!”

I look at the metro map – I can retrace my metro route, then take a different metro route, which gets me close to the airport, bus then I need a bus to finish the trip.

I’m pretty sure there isn’t time for that, not now I’ve spent another 10 minutes trying to figure out what the fuck with the bus service.

So I hail a taxi and get to the airpot that way. 35 euro. The useless BVG ticket for the day was another 8 euro.

So now I’m in the aiport. I begin to think there’s some kind of systemtic weakness in German culture when it comes to signage. I can’t tell how to get from arrivals to departures. I have to *ask* someone.

COME ON.

Arrivals.

Depatures.

THE TWO PLACES EVERYONE WILL NEED TO GET TO.

It reminds me of the absence of testing in software. Just writng the code nad then never thinking to check if it works. Build an airport, then never check if people can find out where they need to go.

(I later was looking for a bathroom. I found a sign – it was probably installed in the wrong airport – and I followed it. Eventually I found the bathrooms. “Staff only!” “Your bathrooms are up on the first floor!” “good luck finding them!”)

So I head over to security, do the thing, now I’m waiting for flight info.

I’ve been looking around at the shops.

It’s actual extortion, in the true and literal sense of the word, as opposed to me exaggerating.

See, we’re not allowed to bring water in. Then when you come to a shop on the inside where you can buy water, it costs quite literally more than twice as much as the same water outside the airport.

Now, extortion is where you force people to give you money.

See it?

You force people to surrender their water when they come in, and then you control the prices in the airpot, and you charge 100% more than outside the airport.

In fact, everything throughout the airport has the same price; there’s no competition, and that price is about 100% more than outside the airport. I presume the airport is controlling the prices, or they’ve issued a monopoly to a retailer here. A bar of ritter sport for example is 1.05 euro in the supermarket, but it’s 2.30 here. The Burger King here is I believe charging about 100% more than outside. I see the same price differential on all the chocolate I know the prices for.

A bunch of these shops have large signs reading “DUTY FREE”.

Extortion with deception. If you’ve got the former going on, the presence of the latter as well should not come as a surprise.

You tell people they’re not paying duty (IS there any duty on good produced within the EU anyway?) and then you charge them 100% more.

I *hate* this, with a deep, viscerial and undying passion.

People are here to make a trip. They will want to drink, eat, relax, get some choccie maybe for the trip, maybe something for a friend.

What’s being done is that the people here are being fucked over, *knowingly, specifically and deliberately*. They are being ripped off – *and the airport is totally down with this*. It has *chosen* this.

That… that choice, that deliberate decision, it’s repellent. It turns my stomach over. *How can you think and behave like that?* you also go out and kick dogs? do you *have* soeone in your life who loves you, or whom you love? how do you have that when your mind is like this? got kids? how do you treat them?

I remember seeing something even more like this back in Newark, in New Jersey. It’s a bad airport, I mean, the security is so physically rough you can call it mistreatment, there’s almost no shops, and little seating. It along with Gatwick are both on my personal no-fly list. I thought I’d get a plug converter for the flight, just in case they had US style sockets.

There’s *one* electronics shop. They want *thirty-five dollars* for a converter that sells for one dollar fifty on Amazon. Okkkay. I mean, there’s overcharging, and then there’s obscene gratitous deliberate abuse of customers. Do they really think they make more money selling one converter in a blue moon at an obscene price, than selling dozens at a reasonable price? it’s not just unethical, it’s nasty, stupid, short-sighted and unprofitable. Hey though – if you are nasty and stupid and you’re one of the guys doing all this, it’s no surprise you’re unethical too.

Coming back to SXF, honourable mention also must be given to the almost complete absence of power sockets, too. I ended up sitting in a stairwell, next to a plug socket there. I had to walk around for 20 minutes to find it.

My view is that airports don’t care about passengers. They deliberately and knowingly fuck them over. I *don’t* want to feel like I’ve just voluntarily gone along with extortion. It’s *not* a good feeling, thanks. Better experiences to spend my life having, you know, that having you fuck me over. It *does* spoil the trip, yes. So I can buy nothing here. It’s bad for me – my trip is less pleasent, no choccie, no water – and it’s bad for the airport, because they get nothing from me at all, and I will avoid SXF in the future.

This of course brings us back to the lack of competition in airports. This occurs because of the staggeringly profound regulation and State involvement in their construction.

The only safety people like you and me have is in *choice*. Whatever attacks choice attacks us, and makes us helpless, and then we get what we have here today.

Just uncovered a stonking design flaw in the freelist elimination layer.

The idea is to have one cache line for every logical core.

What I was actually doing was having one atomic_isolation per logial core.

On CAS platforms, this is nominally one cache line (except of course Intel are now bringing over two cache lines at once, so even there it’s wrong) but on ARM where the max ERG is 2048, instead of having one cacheline I had a huge 2kb.

So what I actually need is one cache line, with atomic_isolation separation.