Proving correctness

I’ve been working on the new test/benchmark programme.

It’s a lot of work. There’s a -lot- of code.

Problem is, the test code is not serious. All it does is run lots of threads doing lots of work and tries to make things go wrong – there’s no directed effort to cause unusual cases, it’s just run-and-hope – and there’s fairly limited checking you can do at the end of the test to see if data is correct.

It does find bugs, but that’s the most you can say.

I need and I’ve known I’ve needed for a long time a more formal, rigourous test.

This is not a straightforward problem to solve. There’s ongoing research work into how to do this in a timely and practical manner – which is to say, on unmodified or essentially unmodified source code, no translating the source code into a model for a modeller (since such work obviously leads to bugs).

I read a white paper just now which is fairly recent (2014) and what’s being done basically is enumeration of paths of execution, with culling for paths which are equivelent, with the compiler back-end being co-oped to notice when memory accesses occur (although I don’t know how it really knows, since what the hardware does is different to what the compiler does – OTOH, in theory, the world looks normal to a single thread so maybe that all works out, and you the processor ensuring the world looks normal is enough?), and then there’s a correctness checker which I’ve not yet read about, so I’m not sure how it works.

The code is on github, it’s for C++, but you know open source – it doesn’t work. I’m going to need to write my own.

Duration inspiration

So I’ve been banging away at getting the threading test/benchmark code working.

I had originally looked to use the Windows-type createprocess approach in Linux (you can do so) but this is wrong – the way Windows does this is more limited, and I can imagine use cases with Linux where thee more flexible fork() method will be -necessary-. So now Windows will use CreateProcess() (which needs a command line binary to execute) and Linux will use fork(), and that choice then defines the porting abstraction layer.

I also decided to move all memory allocation fully back out to the test/benchmark library caller – there was helper code in the library, but it’s just making life complicated and is potentially going to get in the way for some users.

So I’ve sorted out the fundamentals of the porting abstraction layer.

That leaves the duration problem – bare C89, so no timers. How do we know how long to run tests for? see previous post for details.

What’s just occured to me is to classify each test as slow or fast, and on the command line, the user can specify the number of iterations for each class, to over-ride the defaults.

It’s some complexity, but there has to be some user involvement because there is no way to measure time, so it’s impossible in the code to figure out how long to run a test for.

Update

Working on the test and benchmark programme.

There’s a lot of functionality, so a lot of work.

Currently getting thread-based testing off the ground.

Making progress. Two main things to sort out – how to handle durations and memory allocation.

The standard library isn’t used, so there are no time functions.

Duration is number of iterations.

Problem is, tasks (tests and benchmarks) vary in how long an iteration takes, and will vary as the platform varies.

So some users will want less iterations, some more, and the number of iterations will still need to vary by task.

Memory allocation is a problem for the command line wrapper.

The test and benchmark code itself performs no allocations, but the caller has to.

The test and benchmark programme being general then needs to deal with NUMA and NUMA/shared memory, on Windows and Linux.

So, there’s quite a few variations of allocation methods.

Once that’s airborne, then process-based.

After that, then re-implementation of all tests, and also then a complete set of benchmarks.

At that point I can get back to actually coding lock-free data structures… =-)

(This WordPress post was brought to you by Mousepad.)

Addendum – it’s even worse than I realised. You have to click “Publish” twice to publish.

Minor site update

Finally got rid of the frames.

Moved the site back over to https (had time to sort out the certs from LetsEncrypt).  There’s still a http server for the site, it redirects to https.  I’m fairly sure everything is working.

WordPress updated to v5 and the new editor is absolutely appalling, with no way to use the previous editor.  It’s about unusable.  It uses a tiny part of the screen for editing, the text and cursor keep being moved around when you perform operations in the UI, and the cursor disappears at times, not to mention the usual bizzaro-world totally unexpected jumping around when you’re doing editing and moving the cursor (that’s really the worst thing – the unpredictability and inconsistency of cursor movements – you have to *think* and *pay attention* to *move the cursor*, because it can’t be done on autopilot because the movements are not consistent).  Answer is editing in a real text editor and just pasting here.

I’ve had to write a book, the last few months.

Another month or two to go.

Then back to working on the next release.

Cute little problem

Running into a cute little problem, to do with initializing the test and benchmark programme.

The test and benchmark code is in a library – there’s a command line wrapper provided for convenience. The code is in a library so users on embedded platforms and the like can use the test and benchmark functionality.

The code in the library performs no allocations – the user passes in memory. The user could after all be an embedded platform with no malloc and just global arrays.

The library code is complex enough there needs to be some flexibility in memory allocation, so the user provided memory is the store which is placed into an API and that API offers malloc() like behaviour.

The test and benchmark code, being NUMA aware, needs an allocation on each NUMA node.

Asking the user to do the work to figure out his NUMA arrangement is quite onerous, though – and in fact the porting layer already has this code in it.

So what we really want it to get the topology info from the porting layer.

To do this though we need… some store from the user.

So it kinda looks like first we make a topology instance, and then the user uses this to find out about his NUMA layout and make allocs.

To make a topology instance though the user needs to know how much store to allocate – and that’s the cute little problem.

How do you write code which can either work and do its job, *or*, tell you how many bytes of store it will need to do its job?

Now if the function is “flat”, in that it needs no allocations to *find out* how much store it needs, then it’s straightfoward enough.

However, if the function is “deep”, and it needs allocations, as it goes along, to find out how much store it needs, then life is more complicated – in fact, what must happen is that the user calls the function repeatedly, until he gets the same result twice, and the user needs to pass in each time as much store as the function asked for the time before.

There are Windows functions like this.

Problem is… now it’s quite a bit of extra work, and I’m not sure I’m *getting* very much for all this.

Update

I’ve had some time to do a bit of work.

The library now compiles again.

I want to move over from using an array for pointer-and-counter to using a struct, but that’s a fundamental change, so I need to get the new test suite up and running.

So now I’m working to get the new test and benchmark programme working.

For now I think I’m only going to release single-segment position independence.

There’s a single multi-segment data structure, for the nodeallocate freelist; I’ll keep that, so I can benchmark it, to get a first sense of how much overhead multi-segment brings.

A thought

With position independent data structures, I have to use an offset.

Maximum convenience is when the offset is from the data structure state structure – but this implies negative offsets can occur, which means I have a problem becasuse I can’t address the entire address range.

Well – if we say all pointers used will be to 4 or 8 bytes offsets, then we have two or three bits free at the LS bit end of things.

When we do pointer math we get a ptrdiff_t, like it or not.

What we could do is cast the pointers to int long long unsigned (or anyway the unsigned int same length as a pointer), figure out the difference and if it’s negative or positive and use one of those free LS bits as the sign bit.

That way the *normal* data structure can support the full address range, using offsets, and so is used in the normal case and is also sound for use with single segment relocations.

(Multi-segment still needs all the extra work with storing multiple base addresses.)

This is all fly-by-night, of course. Totally outside of the Standard.

The problem with this – user data is a void pointer for key and a void pointer for value.

These are both converted to an offset.

They could point to chars.

They could also be set directly. Being converted to an offset doesn’t hurt, you’re converted back afterwards, but having the LS bit used for something else will.

Ack. Just realised, direct setting is a problem for the normal signed offset variant. If the user sets a value directly, and the value is far enough away from the address of the data structure state, signed overflow will occur. Implementation defined is not a place I want to be. Need a different macro for setting direct values.

Update

I was wrong about using unsigned offsets.

The result of pointer math will always be a ptrdiff_t, which is to say signed.

I can’t *get* an unsigned diff… well, mmm, unless I cast the pointers to unsigned ints first…?

I’m starting to worry that I’m pushing the limits here though.

In other news, where the offset/version combo for an offset has to be signed for the offset and unsigned for the version, an array of lfds720_pal_uint_t doesn’t work. I can take the address of the first element, cast it to a pointer to ptrdiff_t and then deref, but actually, having gone over the struct packing stuff again to be sure, I think I will actually be fine using a struct. I didn’t want to go that way, I felt it was opening a window for compiles to get things wrong, but I read a very nice article about it and now I feel happy.

This is the article;

http://www.catb.org/esr/structure-packing/

“The interesting news is that NTP has apparently being getting away with this for decades across a very wide span of hardware, operating systems, and compilers, including not just Unixes but under Windows variants as well. This suggests that platforms with padding rules other than self-alignment are either nonexistent or confined to such specialized niches that they’re never either NTP servers or clients.”

Anyways, the structures are nicer than the two element array – it’s clearer what’s going on.

Update / I’m sad

I am sad.

This is because I think I need three variants of every data structure.

The first is the normal variant – pointers are what they are, absolute values (virtual or physical as they may be).

The second is the single segment variant – pointers are offsets, and all data must be within a contiguous range which is half the address range; the only way the user can guarantee this is to make a single contiguous allocation. So here we’re looking at shared memory, or buffers passed down into or up from the kernel, that sort of thing.

The third is the multi segment variant – pointers are offsets, and we keep track of the base address of each segment, and that has to be done on a per-address range basis (so each process needs to register each new block of shared memory, to indicate what the base address is in its virtual address range). However, since all offsets are positive, we can use unsigned (and we have to anyway, since we borrow some MS bits for a segment number), and so we support the entire address range. Downside is the per-address range state, which the user has to set up and add segments to when they are made, and the overheads (array scan) in the data structure to figure out which segment a given pointer belongs to. Oh also actually since we borrow MS bits for the segment number, the more segments you want to support, the smaller the maximum address range for each segment. Again, this means users have to allocate blocks.

The third variant supports what the second variant does, but the third variant I think it primarily for NUMA on Windows – not for everyone and their dog – and the second variant is *so* much easier to use and the single segment use case is going to be common, shared memory between processes, so it’s worth having that variant.

Update

I finally had two free days.

I’ve been working on position independence.

Been backwards and forwards a bit.

Currently thinking to make all normal data structures work using offsets, which means they support a single address range regardless of absolute address values, as long as the relative positions of all data structure states, elements and user data are the same.

There can then be an extra variant for each data structure which supports multiple segments.

The one (and big) problem with this is that if I make the API nice, and use offsets relative to the data struture state, I have to support negative offsets, which means on 32 bit machines all data needs to be within 2 GB of the data structure state. That is by no means guaranteed, and is outside of the control of the user. Actually the same problem exists on 64 bit machines – if we imagine a 64 bit virtual adress range (it’s not 64 bit in practise, but this doesn’t change things) we still only have 63 bits, so all data still needs to be within half the address range.

If the API is not quite so nice, and you specify a base address and so all offsets are positive, this problem goes away. Note though reading/writing data structure element user data is in the nice API relative to the data structure element, so in the not quite so nice API, we have somehow to know the base pointer, which means each element has to point to the state structure (which has a copy), and this is a significant extra cost in terms of memory access.