So, it’s taken this whole day, but I have just now successfully written the GCC In-line asm for DWCAS on aarch64. Woot!
This means I am not dependent on libatomic.
As an aside, I also discovered I’d messed up the empirical ERG determination in the test binary - the concept works, but I messed up the implementation by making the results look wrong.
I finish my current gig in two weeks, so then I’ll have a bunch more time for liblfds.
Converted all the lfds atomic abstraction macros to my current standard, where you always pass in the thing itself, never a pointer.
Corrected the ERG determination code - need to check it closely now though.
That leaves me with one problem before I can use the build system in anger and get it to build every variant on every GCC on every platform; and that’s setting the ERG size in the header file before building…
…which is a bit problematic seeing as you need to build to run the ERG determination code =-)
Back in the day, I did a lot of work to arrange run-time rather than compile-time support for different atomic isolation lengths. You need to use though the larger of ERG and cache line size and I recall backing all that work out because you couldn’t handle that problem in the preprocessor - something to do with needing to find the greatest common divisor, I have no idea now…
I think you can actually issue an instruction to get the ERG length, though. You don’t need to empirically determine it, as I currently do.
It would be very good to deal with this at run time, because people writing for ARM are probably writing for phones and they can’t know what core their code will be running on.
It also saves users having to even knnw about the issue.
I’m going to go back to run-time alignment.
I think when I first did this ages ago I failed to realise that the ERG size would always be a positive integer mutiple of the cache-line size, and so I thought there had to be some math done to figure out the correct atomic isolation size.
Run-time alignment is necessary because of ARM.
Where the ERG size is 8 to 2048 bytes, and the code fails completely if the ERG is set too small, you basically have a problem and/or need to be pessimistic, and it’s painful to set a large ERG.
I also need people to be aware of the whole ERG issue, and I really don’t want that. I want them to be able to pick up the data structures like they’re normal and just use them.
Run-time isn’t needed on Intel, which has fixed cache line sizes - although, ha - I’m already treating them as a pessimistic case, since Intel these days usually transfers two cache lines at a time, something you can turn off in the BIOS on some machines.
I’m not doing run-time cache-line alignment after all.
As I started to get back to it, I remembered why I hadn’t done it in the first place - the instruction to get ERG length on ARM is priviledged. You can’t run it from user-mode.
So I have my code which can try empirically to determine the ERG length, but I have no code to figure out cache-line length. Maybe I could try to think of some, but… all this empirical stuff sucks.
In other news, the x86 dev board I ordered arrived and I have successfully installed Debian and configured the board. It is now building GCC 4.8.5!
Oddly, the cache line length is marked as 64 bytes, rather than 32. My laptop is marked as 64 bytes, which is what I’d expect for an x86_64.
In other other news, this web-server server is likely to have a fresh OS install which means I’ll need to rebuild everything. That may happen over this weekend, and by the end of it I damn well intend to have Bugzilla again at last.
Just found this…
We would have to extend our notion of “CPU architecture” for that to make sense. For example, Pentium Pro / II CPUs had cache line size of 32 bytes, Intel Netburst CPUs (all Pentium-4 and Xeons of the time) have / had 128 bytes, while Pentium-III, Pentium-M and later Core CPUs have 64 bytes. They are all I686_CPU in our view.
I assumed 32 bit Intel was 32 byte cache line, 64 bit Intel was 64 byte cache line.
See? getting that Minnowboard has already uncovered something critical.
Friday before last I asked for the VM image on the server to be replaced by a fresh Debian 8 Jessie image. It was on Wheezy and the distro-update hadn’t worked properly.
This was done, and the new root password emailed to me (I know - I’ve told them about this already, a year ago; but they’re still apart from that the best Swiss provider wholly within Switzerland).
Problem was that the email address they sent that to…
…was on the server which was wiped by the VM image replacement.
So I got the password on Monday.
I then had my final three days in my (now previous) job, which were busy - and I continue to be busy with that, in fact.
My most pressing concern was to get the mediawiki back up. This took a day and a half, because the instructions in the Apache 2.4 docs for using FastCGI with php5-fpm simply do not work and it took up a lot of time finding that out. I found a different way to configure in the end.
The nice thing now is that since I have in the end moved back to Apache, and slapped the seemingly-crazy mpm_event config into shape (I do not understand all this “we can spin up more servers” stuff - if your machine lacks the resources for high load, it won’t be able to do this anyway), and now I can install Bugzilla and Mailman, neither of which could be done with nginx (no CGI support - in fact Apache has the obvious solution to handle the problem, have a tiny extra server which you issue CGI requests to and it spawns itself).
So I’m now installing Bugzill and Mailman.
It takes 37 hours to build GCC 7.1.0 on a Raspberry Pi :-)
GCC only supports profiling on aarch64 from version 4.9.0 onwards :-)
So, bugger me, been off work for a week and a half and only now finally had time to think a bit about something which has been on my mind for agggges - using offsets rather than absolute pointers.
The idea is to make it possible to share data structure instances across processes without having to have the same virtual address base, and between the kernel and user-mode.
So, having thought it over, there’s an obvious limitation; we can only use entities (structures, for state and elements) which are in the same single contiguous memory block.
Consider; say you allocate one block of memory, for a freelist state and a hundred freelist elements.
You share this between two processes. They each now have a different virtual address (first is at say 10,000, the second and 10 million), but the whole block is contiguous from that address.
The freelist internally uses offsets; so both processes can push and pop from the freelist and everything is fine.
Now let’s say the freelist elements each contain a btree element.
We then share the btree state between the processes.
Now we pop a freelist element and we want to insert it into the btree.
The problem is the btree is also using offsets; but the offset from the btree state to the freelist element is different in each process, because the freelists have different virtual memory addresses. (Also, ptrdiff_t is only allowed in the spec to show the difference between pointers in the same block of memory - in part most likely because of situations like this!)
So we simply cannot use anything other than those things which are in the same conitiguous block of memory.
So if we had one block of memory which contained the btree state, the freelist state and the freelist elements, then we’d be fine - so it is still useful.
But one thing we do lose is the ability when a freelist is empty to do a malloc and throw some more elements on there - because they will be in a different, non-conitiguous block.
One thing to also consider is the void pointer of user data. What this is set to is entirely up to the user, but the user faces the same problems as the data structure, so they will also need to use values which make sense across processes and/or kernel/user-mode.
So I’m working now on an experimental freelist with offsets.
Pretty cool if it works - you can take an instance of a data stucture, and use it concurrently over multiple kernel threads at the same time as you use it with multiple threads in multiple different processes.
So much to kernel/user-mode isolation… =-)
So, I’ve just tried logging into both the forum and the mailing list, as admin, and neither work.
I have no clue why, and they were working last time I used them, and everything else seems to be fine, and there is as far as I can tell no logging.
I’ll fix it soon, but probably I will ditch both of the them (Dadamail was likely to go anyway). That means going back to EsoTalk for the forum, and maybe a non-GUI install of mailman for the mailing list.
Ugh. Jesus. Forums and mailing lists and open source software. Ugh.
Just uncovered a stonking design flaw in the freelist elimination layer.
The idea is to have one cache line for every logical core.
What I was actually doing was having one atomic_isolation per logial core.
On CAS platforms, this is nominally one cache line (except of course Intel are now bringing over two cache lines at once, so even there it’s wrong) but on ARM where the max ERG is 2048, instead of having one cacheline I had a huge 2kb.
So what I actually need is one cache line, with atomic_isolation separation.
Home Blog Forum Mailing Lists Documentation GitHub Contactadmin at liblfds dot org