Update

Made the test and benchmark library compile.

Both liblfds and libtest_and_benchmark (libtab for short) have subsets of libstds (library single threaded data structures) in.

Originally and wrongly I was expanding the subset in liblfds, where those data structures were being used by libtab. Now I have only the data structures needed by liblfds in liblfds, and only those needed by libtab in libtab.

I now need to move libtab fully away from using liblfds to using libstds.

It was a blunder to have used liblfds, because liblfds provides data structures to the extent you have atomic support, which means you might not have a list, for example – but libtab uses the list everywhere.

Actually maintaining this portability behaviour in the code is a lot of work. If I just assumed x64-level atomics, the portability code would go away. In a sense it matters, because the portability code right now is untested. I do not – and I will need to – build variants which pretend to have less support. With software if it’s not tested, it doesn’t work.

Work getting done

Tzo!

Coded all day.

Have the new test and benchmark app to the point it compiles.

Still need to do some key work, but it’s an important step.

Importantly, I realised I’d made a huge blunder all along in test and benchmark – I use in test and in benchmark liblfds data structures, the list in particular.

I can’t do that, because liblfds is designed to offer data structures to the extent your system offers atomics; so you might not *have* the list.

In fact, the test and benchmark code needs to use single threaded data structures throughout.

This means I need to put some of the single-threaded data structure (stds) library data structures in the test and benchmark library.

I also need to introduce versioning on the stds code in liblfds, so multiple released can be compiled in the same project.

I finish my current contract work on Tuesday, and I’ll be taking a few years off, so the next release will come reasonable soon – few months tops.

The road to hell is paved with affinity APIs

I’m working away on the new test and benchmark application.

I need to support creating processes, to test position-independent data structures.

That means I need to pin processes to particular logical cores.

Know what?

That’s what’s written on the sign that points the way into hell.

Let me put this bluntly : Windows has no API to set process affinity beyond the first processor group, which has a maximum of 64 logical cores.

You read that right.

So if you have say 128 cores, and let’s say Windows has split these up into two 64 core groups – you can only set process affinity to be on cores 0 to 63.

You *can* set *thread* affinity to be on any core – but this is *not the same* as process affinity, and is less performant – but it looks like this is the best you can do.

It’s problematic to do this remotely (from another process). To do so you’d need to call CreateRemoteThreadEx(). In my case, I’m spawning new processes and I want them to quit when the benchmark work is done, so I need to co-ordinate between the main thread (which begins when the process is spawned) and the thread created by CreateRemoteThreadEx(), which will be created at some point after the main thread… it’s hard to wait on things in the main thread which haven’t yet been created. I could busy wait on a global variable…. but this is stomach-twistingly bad. I don’t *want* to write code like this.

You can set thread affinity from within the process itself by calling SetThreadGroupAffinity(). Obviously to use this you have to pass in information about which logical core in which processor group. I’m passing in some information already to the child process, through the command line (shared memory name and length in bytes), so I’ll have to add this.

It’s still not what I actually want. I want to set process affinity, from the parent process.

Windows thread/process affinity APIs are Civil Service quality – and I don’t mean the British Civil Service. I mean the *Egyptian* Civil Service.

Next step, finding out how bad it is under Linux. It’ll be bad, but it won’t be as bad, even if it’s just by not having processor groups, which are the worst single concept I’ve encountered since MS-DOS was designed with a 640kb RAM limit.

Shared memory and NUMA

Windows always does things differently to Linux, and this is almost always a problem, because Linux gets them right.

NUMA is the one exception I know of. Linux got it wrong, and Windows did it differently, and Windows did it right.

Linux has a system-wide policy which controls NUMA, and this is applied whenever a page is paged back in after being paged out. The upshot is you’ll get the system-wide NUMA policy, unless you pin your pages into memory so they can’t be paged. You youself in your application cannot control your NUMA behaviour. It’s controlled in the OS.

Windows does what you’d think would be done : when you make an allocation, you specify the NUMA node, and the OS tries as hard as it can to keep those pages in that node.

So this was all good and fine and clear until this week when I realised something.

I’ve been working on the test application for the position-independent data structures. They are intended for use with shared memory, where the shared memory has different virtual addresses in the difference processes; the data structures internally are using offsets rather than proper virtual memory addresses.

The new test application actually combines the test and the benchmark applications.

With the benchmarks, you want to be NUMA aware and take advantage of it. That means you need to pass in to the benchmark library a block of memory in each NUMA node, so it can put data in the correct places.

Now we see the problem – with shared memory, the data structure state, and all its elements, must be in the same single allocation.

How can you have one allocation per NUMA node *and* shared memory? because that means you have multiple allocations.

Suddenly Linux looks like it’s doing the right thing. Say you select striped allocations – your pages in any given allocation are striped over NUMA nodes. Okay, it’s not what you really want in your app – you want more fine grained control – but at least you’re able to *do* something meaningful with NUMA *within a single allocation*.

On Windows, where an allocation specifies its NUMA node, you just can’t do this.

You could in theory actually still make things work. In the data structure state, you’d have an array, which shows the address ranges for each allocation, and when you get hold of an offset (by popping an element from a freelist, say) you can then figure out *which* address range it is in, and so know the start of that range, and so figure out the actual virtual address represented by that offset.

Here though obviously you’re needing to do an array scan per freelist access, which is really not what you want.

Ironically, it’s on Windows where the position independent stuff really matters, because there are no locking primitives on Windows which are cross-process.

Moving to GitLab

GitHub has been bought by Microsoft.

I will be moving to GitLab.

I am looking to move in such a way that the only change to end-users is that the domain name changes.

Will see how that works out when I make the move.

Shock, amazement, actual work being done

I’ve been working on rewriting the test programme to handle processes, for the position-independent data structures.

Long story short, I’ve taken some of the existing code from test and benchmark, and started again : I’m actually now back to a single libraries, which is both test and benchmark, with a command line convenience wrapper as before (the library has to be there for people running on embedded systems – they don’t have a command line).

I originally wanted a porting abtraction layer library, but it can’t be done, because it’s just too messy to abstract away getting processor topology. To do that you need a topology library, to reduce complexity, and to do that, the porting abstration layer library has to include the main, non-porting library, which makes no sense : a porting abstraction layer should be at the very bottom, independent of everything else. You just can’t emit processor topology info in a clean way though – to *do* this you need a topology library.

Previously, in test, the test app simply ran one thread on every logical core. One really nice aspect of integrating test and benchmark is that the orthagonal logical processor sets can now be used also to run test, and that if only simple processor info is available (the user implements just say one system node, and then one logical core for each logical core), the benchmark app can still run on that toplogy.

It’s also much easier for the user to compile, and much easier to document.

I originally wanted to have test and benchmark run on threads in one process, then in one thread per process with many processes, then many threads in many processes.

Many threads in many processes turned out to be tricky – it’s not obvious what logical processor sets to compose.

So I backed out of that and now, using the normal logical processor sets, either run them as threads, or as processes.

I then ran into a nasty, messy problem, of starting up processes.

In Linux you fork and it’s great.

In Winodws, Jesus, all you can do is call an external binary. It’s horrific. The only way to communicate with it, without needing another bunch of abstractions (for pipes and so on) is passing it a command line!

I spent today finally getting a passable solution to this, with some abstraction for processes, process sets, command line arguments – oh and command lines are a complete PITA under Windows. In Linux, you pass in an array of pointers to strings. In Windows, you actually have to form up a single long string!

It’s like being in the dark ages.

So, making some progress at last.

I finish my current contract job in nine weeks, at which point I’ll be full time on liblfds until the next release is out (well, barring some time catching up with friends, which will take a week or two).

Apologies for web-site disruption

I noticed Apache was mis-configured and was serving liblfds pages from other virtual domains on the server.

I fixed it.

This broke my configuration. (I hadn’t really fixed it.)

Apache is hard to configure because the docs are all over the place and there are a dozen ways of doing the same thing and there are plenty of strange behaviours built into the server.

I’ve backed out HTTPS for now, to get things working till I have time to sort them out properly.

(As an aside, WordPress is a bit crap. If the site URL changes, you can only fix it by editing the mySQL database directly – it’s not actually config in a user-editable text file…)

Update

So, been workng on the new test application.

I tried to just write it, but it’s too complex; I should have – and now have – composed a state machine.

I’m now implementing the state machine.

Honestly speaking, I only feel like I’ve done a serious piece of work, which I’m happy for other people to see and judge me by, when I’m using a state machine.

Non-state machine code, unless it’s trivially small, is not serious work.

Shock horror an actual post about liblfds.

I’ve been working on the test application.

With the additon of position independent data structure variants, I need to be able to spawn processes and use shared memory, for testing.

I have a number of platforms to think about, to form an abstraction layer over;

1. Windows
2. Linux
3. Android
4. Embedded

There’s also kernel mode to think about, but kernels don’t have processes as such, and so they don’t have shared memory as such. I do in principle want to test user-mode and kernel-mode code executing concurrently on the same data structure instance, but then I’ll need to really actually make something work in the kernel for both Windows and Linux. I’m familiar with Windows kernel programming, so I could do that (install a driver, then the test app communicates with it), but I’m not familiar with Linux kernel programming (I’ve a kbuild build of liblfds, but it’s not *used* in anything; I have no idea if it’s a valid build) so these aren’t on the cards right now.

Shared memory is pretty much identical across Windows and Linux so that’s no problem.

No clue how it works on Android – Googling shows up various Java APIs – hopefully Linux under the covers.

Embedded platforms don’t have processes, well, they have one process, *the* process, so no shared memory.

Where embedded doesn’t offer processes or shared memory, the test app needs to run differently on different platforms, or, rather, depending on what’s available in the platform abstraction layer; position independent tests only happen if there’s support for shared memory and processes.

A weaker form of position independent testing is available just by running them in the same address space, over multiple threads, so that might be something which happens if processes/shared are not available.

Then we come to processes.

Process are completely different between Linux and Windows.

Linux uses fork(). You call fork, and then you have two processes, and they each get a different return value from fork.

Windows uses CreateProcess, which takes a *pathname to an executable*, and spawns a new process running that executable. Parent and child by default have rights to access each others memory, child can inherit handles, etc.

These two things really are *not* the same.

Consider my use case; I want to spawn one process per logical core, have it open up a block of shared memory, and then, when everyone is ready (I’ll spinlock on a value in the shared memory) run a particular test.

One problem to begin with is that the design of libtest is based around threads; a “test” is a function which inits a threadset, and that spawns threads which are given a function pointer to the test code. This needs now to be a processset, not a threadset; this wouldn’t be too bad under Linux – but under Windows, to make a process, I have to give a *pathname* to an executable! and that means, if I want just one test binary (and I do), I need to invoke the test binary *with command line arguments such that it knows what to do and will participate in the test which should now be run*.

This is nuts. Invoking new processes should not involve work on the command line parser.

I can get around a bit I suppose by having just one special command line argument, which tells the test programme to open up shared memory and get its instructions from there.

Forum blues

Back to square one for a forum.

Esotalk is really nice – but the recaptcha plugin has disappeared from github, and no forum is viable without a registration captcha.

Flarum is the successor to Esotalk, but last time I tried it, maybe six months ago, I couldn’t get it working. It’s not production ready, even for open source values of production.

Have to remove the forum tab from the site for now.