Got it

I’ve worked out the SMR bug.

Laying in bed at 23:40, taking what I had learned today about the problem and thinking again about the algorithm itself – and there it was. It’s the entry/exit macros for the lock-free sections. They’re reverting the SMR thread state generation counter. I’ve not tested it to prove it, but I am certain.

Have to think how to modify the design now.

What a marvellous and wonderful relief!


I’ve just compiled everything with clang.

Almost effortless – there’s one GCC argument I’m using which is not supported.

I have found however it’s only with the VERY latest clang (2nd Sep 2016 – about a month ago!) that the __atomic* GCC intrinsics are supported. With the version I have with Debian Jessie, 3.8, I only get __sync*. I note the GNUC versions set by clang are 4.1.2, which is exactly right for the __sync* stuff – that’s the version it was introduced. The latest clang (with __atomic*) is 3.9.0.

I’ve not figured out the SMR bug yet. It’s wierd shit. I can read the value before the CAS, get a 1 from the CAS (CAS happened) read the value immediately after and see it’s correct – and then one or two lines later, when I read it again – IN THE SAME THREAD – it’s reverted to the previous value.

As far as I can tell from logging, no other thread has set the value, and also no other thread according to the logic of the code COULD set the value.

I’ve tried with GCC 6.2.0 and I get the same problem. About to try now with clang…

Ah, one other thing, my final comment in the previous post was wrong. I do not need to move on from GCC 4.1.2 – I simply have to write the abstraction layer for the compiler such that I have one for DWCAS for 4.1.2 to 4.6.0 (using inline assembly) and then from 4.6.0 to 4.7.3 I can use __sync* with __int128 and then with 4.7.3 and greater I use __atomic* with __int128.

Either way, I have ARM64 support now, for GCC 4.6.0 and later.

Oh, I also looked at downloading and building all the formally released GCC versions from 4.1.2 onwards. It looks pretty straightforward, except I have to see how much the configure options may have changed over time.

So I expect to end with a script to set up the build environment, which downloads, builds and installs every GCC version. Benchmarking across versions will be very interesting!


I’ve been working on SMR.

There’s a bug, and it’s maddening, and I’ve been working it and it alone for a week now. I mean, every spare hour, including all Monday (public holiday).

I’ve finally narrowed it down to this a CAS sometimes (after a few seconds of 100% CPU with four cores) seemingly return 1 for success but not actually changing the target. Yes, really.

This seemed to indicate it was my mistake – I had mis-understood __ATOMIC_RELAXED, and the lack of a compiler barrier might be the cause. However, changing over to __ATOMIC_ACQ_REL didn’t fix the problem.

Now I’m googling for GCC bugs – and I found something unrelated but which finally answers an ages old question of mine as to GCC support for 16-byte CAS on x64.

The bug is a docs bug, and the plantiff says “the docs don’t say a word about 16 byte CAS being supported”. He’s right, too, they don’t. However, this is only half the problem – the GCC atomic intrinsics work on types, not arrays, so you need a 128 bit type. The GCC docs for this are even worse;

Reading that, I come away thinking I need native support for 128 bit types. This is as I have now discovered not the case. x64 has __int128 just fine, and with “-march=native” you can compile a 16 byte GCC atomic CAS just fine.

Niiiiiiiiiiiiiiiiiiice. So all the arrays can go away – except – it’s GCC 4.6.0 and later and what about Microsoft? MS do not have a native 128 bit type.

However, even with keeping the arrays, I can use the GCC atomic CAS for DWCAS on x64 and ARM64 – so I can remove the inline assembly on both platforms (the inline assembly which works for x64 but where on ARM64 I’ve had a first pass at creating it, but not yet having an ARM64 to test on, it doesn’t work).

So I could do that… GCC 4.1.2 was released 13th Feb 2007, 4.6.0 was released March 25, 2011. So it would move the minimum compiler version forward by four years.

Valve and Steam – country change is a one-way process

I was wondering how the Black Mesa mod was getting on – if they’d published Xen yet. I googled, went to their site – no, *but* Black Mesa is now on Steam, 20 euro.

Sounds alright to me. I liked what they did, I love Half-life, I’d buy it.

I go to Steam. It’s selling for 14.99 *GBP*, not Euro.

Ah, I think. They think I’m in the UK. I’ll change it to Germany so I can pay in euros.

I change to Germany. No problem.

Then I think – ah, wait. The German version is censored, right? I google. Yes, it is.

Okay, no problem, also 20 euro is a bit more than 15 GBP, I’ll change it back to the UK.

Here’s the punch-line, folks!


Steam is telling me they will only accept my country as being Germany (presumably based on IP lookup, so if I was still going to purchase, I’d run Steam through Tor with a UK exit node selected).

Important lesson here. One I have already learned with regard to banks and which I now know I need to apply to other organizations – probably all. NEVER TELL ORGANIZATIONS WHERE YOU LIVE. In fact, in general, tell them NOTHING. If you tell a bank you’re in a different country, for example, they will usually go bonkers in some bizzare and totally crazy ways – ways which you could never predict, because they make no sense whatsoever. *Just don’t do it*.

So I have a UK accout, I want to buy the UK version, etc, etc, etc – buttttttttttttttttttt no. Steam and Valve know better! I told them ONCE I live in Germany, so it MUST BE TRUE. And now I can only change it back when I physically return to the UK and a nice UK IP address.

Reminds me of Skype. One day years ago the money I had in the account disappeared. Only about five quid. I contacted Skype and said I’ve change my password, but there’s been fraud, what do you do?

Their reply was tha they had locked my account until I change my password and provided a copy of my passport or ID to prove who I am (*facepalm*) and that since it was fraud, it was my fault, not theirs, and they do not give refunds.

So because *I* said it was fraud, it was absolutely and definitely fraud. And not say, a bug in their accounting software. Total unmitigated fuckwits. This was of course also roughly around the time they stopped being peer-to-peer and became server based, so the NSA could record all the calls, so I had what was probably a lucky escape, as I – as you can imagine given their response – certainly never went to the risk and trouble of sending them a copy of my ID after they needlessly locked the account. In fact I read about that time when I was googling of people who’d had auto-topup on, and had lost their sizeable chunks of their current accounts to fraud. I had never given my password to anyone, of course – it was either a brute force attack which Skype never picked up on, or Skype were hacked.

Back to Valve being fuckwits. I mean, I understand it, I think – I’m sure this is forced on them by Governments. But it means Valve/Steam are now in the catagory of “major bank”, i.e. major fucking morons due to utterly insane behviour from totally unexpected causes, so you can’t trust them with anything. Achievement unlocked.

IN other news, I’m still waiting for my volume unlocked iPod shuffle from the USA, to get around the EU regulation which sharply limits maximum volume.

Pertinent aside : I pay about 6000 euros a month, in total, to the German State, to live here. This is part of what I get.

SMR update

I’ve pinned down the SMR bug.

A thread would miss an SMR generation and by that end up two generations behind, and the code wasn’t coping with this.

Performance hit from SMR is about 50%.

This is appalling – I suspect hazard pointers are about free – but hazard pointers are someone elses invention and I’d need permission to use them. The SMR I have now is purely my own work.

First unbounded/single/single queue benchmark


liblfds720_queue_uss_enqueue_and_dequeue_Core i5

Well, maybe disappointing.

The benchmark has two threads (naturally enough) and we see when they run on seperate physical cores we get about 8.8m ops (the producer thread enqueues 8.8m elements, the consumer thread dequeues 8.8m elements).

This is about the same as the unbounded/many/many running two threads on two physical cores (although the benchmark there is a bit different – there’s one queue, and each thread does an enqueue followed by a dequeue, so one op is an enqueue *and* a dequeue, and there we get about 5m ops on one core and about 8m on the other.

What I’m thinking though is that single/single will scale, where many/many does not. Many/many are the core counts go up dives into a deep hole in the ground. I think single/single might scale linearly – there will be many queues, inherently, but each one will still manage the 8.8m no matter how many of them are running.

The benchmark needs more enhancement to test this though.

M&S queue API improvement – solves the consequences of the dummy element awkwardnesses

I’ve made a small but important improvement to the unbounded/many/many and unbounded/single/single queue APIs.

The queue elements now have a user_state, which you can get and set.

Those of you (both of you 😉 who have used the unbounded/many/many queue (the single/single is in 7.2.0, which isn’t out yet) will know that the dummy element is a pain in the ass.

When you dequeue, you get a queue element and you get a key/value pair. Problem is, the queue element you’re given is what was the dummy element in the queue; you can reuse it, for sure, but often what you’ll want to do now is put it on a freelist, and for that you need a freelist element. So you may well have defined a struct of your own which is a queue element and a freelist element – only now you have a problem; you just dequeued, fine, you have a queue element, fine, you have a value, fine, the value is your struct with the freelist element and queue element, fine… only it’s not fine. The struct of yours which you have is NOT for the queue element you’ve been handed by the dequeue call – it’s for the *next* element. So you can use the freelist element in there, but now you’re physically separating the freelist element and the queue element it’s putting on the freelist. This is bad for performance – you get a cache line miss.

Now though, we have a solution. We set the user state *for the queue element* to be the struct it is in. Bingo. We dequeue, we get the value (which is what we need for the queue to be correct) but we ALSO get back the correct freelist/queue struct for the queue element the dequeue function just gave us.

Problem solved. I need to update the ringbuffer now with this, too, it’ll improve performance – shame there’s no benchmark yet for the ringbuffer; I might need to make one now to see how much difference this makes.