run-time ERG/cache-line alignment

I’m going to go back to run-time alignment.

I think when I first did this ages ago I failed to realise that the ERG size would always be a positive integer mutiple of the cache-line size, and so I thought there had to be some math done to figure out the correct atomic isolation size.

Run-time alignment is necessary because of ARM.

Where the ERG size is 8 to 2048 bytes, and the code fails completely if the ERG is set too small, you basically have a problem and/or need to be pessimistic, and it’s painful to set a large ERG.

I also need people to be *aware* of the whole ERG issue, and I really don’t want that. I want them to be able to pick up the data structures like they’re normal and just use them.

Run-time isn’t needed on Intel, which has fixed cache line sizes – although, ha – I’m already treating them as a pessimistic case, since Intel these days usually transfers two cache lines at a time, something you can turn off in the BIOS on some machines.

Update

Converted all the lfds atomic abstraction macros to my current standard, where you always pass in the thing itself, never a pointer.

Corrected the ERG determination code – need to check it closely now though.

That leaves me with one problem before I can use the build system in anger and get it to build every variant on every GCC on every platform; and that’s setting the ERG size in the header file before building…

…which is a bit problematic seeing as you need to build to run the ERG determination code =-)

Back in the day, I did a lot of work to arrange run-time rather than compile-time support for different atomic isolation lengths. You need to use though the larger of ERG and cache line size and I recall backing all that work out because you couldn’t handle that problem in the preprocessor – something to do with needing to find the greatest common divisor, I have no idea now…

I think you can actually issue an instruction to get the ERG length, though. You don’t need to empirically determine it, as I currently do.

It would be very good to deal with this at run time, because people writing for ARM are probably writing for phones and they can’t know what core their code will be running on.

It also saves users having to even knnw about the issue.

Update

So, it’s taken this whole day, but I have just now successfully written the GCC In-line asm for DWCAS on aarch64. Woot!

This means I am not dependent on libatomic.

As an aside, I also discovered I’d messed up the empirical ERG determination in the test binary – the concept works, but I messed up the implementation by making the results look wrong.

I finish my current gig in two weeks, so then I’ll have a bunch more time for liblfds.

Just bought an x86 (32-bit) dev board

Only one core, but it’s hyperthreaded.

IT’s from about 2014 – the original Minnowboard.

I found a new one on ebay.

The later Minnowboards (they’re up to the third version now) are 64-bit. In fact, the only other 32-bit board I could find was AMD’s Geode, and it’s single-core, single-thread.

Update

So, I’ve been slowly making progress with the build system – actually putting it through its paces.

I’ve learned *so* much, and revealed a number of problems; it’s been an absolute God-send. It shows once again that with software and computers nothing works until you actually *do it* and make it work.

The big thing has been GCC 7.1.0 and the changes to how it supports double-word CAS on 64-bit targets (which is to say, aarch64 and x86_64, the only targets which offer this functionality).

Those changes seem not viable for my use, and it’s led me to implement in the abstraction layer in-line assembly for double-word CAS on x86_64 and aarch64.

*Thank God I had a build for this compiler and found out about all this before I made a release and users found out about it by it not working*.

So now I get to build every build variant with every compiler version (which I can build) on every platform. I can at least see my software passes tests, runs benchmarks, etc, on my own systems. Of course, they may (well!) indeed then fail on the vast range of other systems out there – but if they failed on mine *even before that*, then they DEFINITELY wouldn’t be working on any of the systems out there!

I can’t wait to see if there are significant performance differences between GCC versions.

GCC 7.1.0 removes -mcx16

Thank God I built the build system and thank God I spent an entire *quarter* figuring out how to build GCC.

Turns out GCC 7.1.0 removed -mcx16, so it no longer generates double word CAS on x86_64.

It still does on arm, aarch64 and x86 – just not x86_64.

I filed a bug.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

So this means I need to use an abstraction layer with inline assembly for DWCAS on x86_64 for GCC 7.1.0 and also for GCCs earlier than when this switch was introduced (it’s not present in 4.2.4, but the 4.3.x and 4.4.x versions of GCC do not build on x86_64, so I can’t tell exactly when it turned up.

In fact, the build system right now is duplicated because os x86_64 and this switch. All the makefiles are duplicates, one with and one without. What I will do, and also allows for consistency across builds, is drop -mcx16 and always use the inline assembly.

Honestly, it’s a bit of a risk. Inline assembly in GCC is like sticking your head in a black bag and hitting the keys. It’s impossble to know if you’ve done it right, unless you’re already an assembly programmer. Also, I suspect it will optimize less well than using the intrinsics.

However, with 7.1.0 I have no choice anyway.

huh

Remember I was having network connectivity problems with the PINE64? five to ten second drop outs?

Someone has just described pretty much the experience I had, but he ascribes it the Cisco switch (which I have).

I have in fact already begun to suspect the switch. Often when I issue a build on the PINE, during the file copy phase (getting GCC onto the local store), connectivity to the PINE is lost, and after that I can’t route to the PINE until I reboot the Cisco switch.

Of course, that doesn’t by any means prove it’s the switch – it could be the wireless router, or my own Linux install, or the OS on the PINE.

But given the bug reports I’m googling now, this has definitely gone up the list and I’m now looking for a new switch.

First round build results

“-march=native” only arrived on aarch64 with GCC 6.0.0, so the existing makefiles all fail on the PINE64 (as I’ve yet to build a 6.0.0 on later – I tried building 6.2.0, but the build failed with internal compiler errors and the compiler seg faultings).

(Have to decide what to do about this, because earlier versions of GCC are obviously going to be in use.)

x86_64 4.2.4 (and presumably earlier) don’t undestand -mcx16 (16 byte CAS).

arm (ARM32) 4.7.4 and 4.7.3 both fail with an internal GCC error.

So now I need to build a 6.x.x on aarch64. Thankfully, that’s the fastest board of the three – it takes only about four hours to build a GCC.

Update

I’ve spent a long, long, LONG time building – or trying to build – GCC versions, on the platforms available to me.

I’ve learned quite a bit, although mainly that GCC builds are not tested before being released, and the build system is extremely complex, undocumented, buggy *and* depends on a number of other builds systems, which are also complex, undocumented and buggy.

Most GCC versions on most platforms do not build. x86_64 does better, and most versions build – I think only 4.3.x and 4.4.x fail to build.

On the Raspberry Pi, the first version can build is 4.7.3, because of the floating point options chosen when building the Raspbian glibc.

On PINE64 (64-bit ARM), the earliest version which can be built is 4.8.0, as that introduced aarch64.

On mipsel (MIPS32, little endian variant) building is painfully slow, and highly error prone as the Ci20 dev board freezes up a lot under high load, but I think I can build 4.5 onwards.

For now, I’ve build (or tried to build, see above!) the final point release of each minor release.

Now I’ve just started using these compilers with the build system.

I’ve discovered right away that 4.2.4 (and earlier) on x86_64 do not understand -mcx16 (16 byte CAS) and so the build fails.

The build system builds all variants, and test, and benchmark and runs both, and collects the gnuplots from benchmark.

Tomorrow, I’ll be running them through on every platform, and we’ll see what we get!

About done building GCCs…

Building GCC is kinda an ever-ongoing task, because it’s so slow.

However, on all platforms except ARM, I have a compiler before and after 4.7.3.

On ARM, I can’t build anything earlier than 4.7.4 (painful irony!)

However, apt-get reveals 4.4 (and others) are available. I can use them for test, although I can’t use them for benchmark since they’re not built with the same configure as the other GCCs, so I’m covered.

Far as I can tell no GCC earlier than 4.7.4 can build on the Raspberry Pi 2.

I’ll keep building GCCs in the meantime, so I can benchmark and check makefiles, but I have enough to be getting on with, to get the next release out.

This brings me back now to getting bugzilla going. Ahhh, back to god-damn hell-smitten HTTP servers. BLEAHGAKKKK.