How not to validate data

Trying to make an account on a Russian site, to order a book.

Need to enter name, email address, password.

I enter my name – in Latin.

“Enter your real name”.

I translate to Cyrillic, now we’re okay.

I guess they just don’t want all the extra business that comes from overseas. Damn all that extra profit!

I enter my email address – in Latin – this is fine.

I enter a password.

“Your password contains a link or email address.”

Actually, what it contains which they don’t like are spaces.

I remove the spaces, and get a green tick – password okay.

I hit submit.

“Your password is more than 20 characters.”

The only reason I’m even trying with these jokers is that this is an out of print book for a birthday present, which I already obtained once – but which is no longer available from that seller – and which the Royal Mail in the UK lost the instant I posted it with them; it didn’t even make it out of the post office.

Turns out when I first tried to submit, and was told the password was too long, they sent me a registration email – but with a blank password. When I tried again with a password shorter than 20 characters, they again sent me a registration email, *with the pasword in the clear*. At this point, I am bailing – I mean, I would make a burner credit card for them, with a balance in roubles equal to the bill, so I’m safe enough, but this is beyond the pale.

As an aside, yesterday, I was configuring the router in my friends place (where I am now) and I turned off WPS.

This made the 5 GHz network disappear.

Really disappear, not just no SSID.

No software works.

Software is the single most unreliable device that exists in the world, except for the Royal Mail.

More NUMA / shared memory thoughts

Spent the day thinking over shared memory and NUMA.

Supporting a single segment of shared memory is smooth and graceful. It looks good in the API, is simple and easy to understand for the user.

Multiple segments is messy. The user needs to provide per-process state, and to register each segment in each process, before it can be used. Most significant bits have to be taken from the offset value, to indicate which segment the offset is from. When the user passes in a pointer, a lookup has to occur to figure out which segment that pointer is from.

There is a reason to use multiple segments in Linux.

This is that memory policy is on a per-process basis, not per-data structure.

So if I go striped, fine, I can allocate one shared memory block and it’ll be striped on a page basis.

But what if I want striped for one data structure, but something else for another?

There is only one policy, and it is enforced when pages are swapped back in, so you can’t set it, do stuff, and then change it : whatever you have set *now* is what gradually comes to be applied, as pages swap in and out.

In fact this is a problem anyway : if I do have multiple shared memory segments, one per NUMA node, and I’m so controlling my NUMA directly, and striping on a per-entity bais – memory policy will mess it up for me by applying itself to my allocations.

So there is only one memory policy and it applies to everything in your process, like it or not. You’re fucked anyway. Multiple segments will not save you, unless you pin the pages so they can’t swap, which isn’t a reasonable thing to ask.

So on Linux, multple shared memory segments are not useful, because memory policy stops you from controlling your own NUMA anyway.

On Windows, you do need multiple shared memory segments because the OS does not control NUMA. You do it yourself. So if you want to spread an allocation over multiple NUMA nodes, you need to manually allocate on each of them and then put those elements into the data structure.

Multiple shared memory segments, NUMA, Linux and Windows

I bin learning fings, Oi have.

With position independent (i.e. or maybe e.g. shared memory) data structures;

On Linux you do not need support for multple shared memory segments *as far as NUMA is concerned*.

This is obvious really – you just turn on striping.

You do need support for multiple shared memory segments *just because*, i.e. the user may want this for whatever reason.

On Windows, you *do* need support for multiple shared memory segments *as far as NUMA is concerned*, to perform striping manually, which is how you have to do it under Windows.

You also need it for itself, as on Linux.

Shared memory

Position independent data structures support shared memory (i.e. differing virtual address ranges) by using offsets from a known base rather than full virtul addresses.

So far I’ve only supported s single shared memory segment, so all data used has to be in that one segment. The offset is from the data structure state.

This is obviously a problem with NUMA.

With NUMA, you might well want to have a shared memory segment in every NUMA node.

This means in general multiple shared memory segments, which means mutiple offsets, which means when you are manipulating elements in the data structure and so working with offsets, knowing which shared memory segment a given offset is from, so you can know its base.

Central to almost all data structures is the atomic compare-and-swap (CAS).

If we have one segment only, we can compare the offsets across all the different virtual memory ranges and we will know we’re comparing the same data structure element.

If we have multiple segments, we can have the same offset but in different segements. Somehow we have to know, in the CAS, which segment an offset belongs to.

The only way I can see to do this is to borrow some most significant bits.

On 64-bit platforms this should be fine.

If we borrow say 8 bits, we can have 256 shared memory segments, and we have 56 bits remaining for the offset.

On 32-bit platforms it barely works.

If we borrow just 4 bits, and so can have 16 shared memory segments, we have 28 bits left over for the offset – which is 512mb.

It also means we have at times to do a lookup, in the data structure; we have an array, and here we store the base addresses of the different segements, and we look them up when we need to convert the offset to a full virtual address (which we do when we pass elements back to the user, i.e. after a dequeue() or pop()).

Position independence without NUMA is basically a fail, so I think this has to happen.

Shared memory and NUMA

I’ve been thinking about shared memory and NUMA.

Windows always does things differently to Linux, which is usually bad, because Linux usually gets it right or pretty much right.

I think Linux made a bad job of NUMA. Linux tries to make NUMA go away, in the sense of making it so the developer doesn’t need to think about it. This is done by the OS offering NUMA policies, which control how memory allocations are handled with regard to NUMA – local node, striping across all nodes, etc. Critically, when a page has been paged out and then is paged back in, the page is normally expected to be able to change which NUMA node it is in (although it might well not do so).

Windows, which went for a more “here are the controls, do the right thing” approach, is more like C. The developer has to handle the matter.

The library supports bare metal platforms so it does not perform memory allocation; rather, the user passes memory in. The same has to be true for the test and benchmark application, so it can be run on bare metal platforms.

So the user allocates memory and passes it in.

But what happens about shared memory, for the position independent data structures?

THe user allocates shared memory, rather than normal memory, and passes it in, and the child test processes when they run open the shared memory and use it.

So that’s okay.

What happens with NUMA?

The user allocates equal memory on each NUMA node and passes it all in.

There’s a function for this in Windows and Linux, so that’s okay for Windows, but what about Linux moving pages between NUMA nodes on paging-in? the only way to stop this is to pin a memory page, so it cannot be paged out.

So, okay, I can do this for the tests and benchmarks.

What about shared memory with NUMA?

Well, obviously now I would need to allocate equal blocks of shared memory on each NUMA node and pass them in.

Oh. Problems.

On Windows it’s fine – there’s a function to allocate shared memory on a specific NUMA node.

On Linux, there is no such function. Shared memory is placed on NUMA nodes just as non-shared memory, according to the NUMA policy.

I think I might be able to change the NUMA policy just before creation of the shared memory to use and only use a singe NUMA node, the one I want to use; but shared memory like all allocations is really allocated on faulting, so doing this doesn’t *do* anything.

I suspect what I need to do is change NUMA policy, create shared memory, pin the memory, then fault every page, then revert NUMA policy.

(Another way, says SO, is to create, then move the pages to the desired NUMA node.)

Obviously, this all feels wrong.

Am I doing the wrong thing?

Should I just suck it up and let Linux do what it want to do?

One issue here is comparing like with like.

Actually it raises the question of what is like with like?

If I run the benchmarks on Windows, with low-level NUMA control, and then I run them on Linux, with the same low-level NUMA control, I have like with like.

But if on Linux users are simply using NUMA policy, then I’m coming apples and oranges… …except if Linux *is* normally like this, then it really is what you normally get, and so that *is* what you have to compare against.

Update

Just finished moving the test and benchmark library over to the single threaded data structures.

Update

Made the test and benchmark library compile.

Both liblfds and libtest_and_benchmark (libtab for short) have subsets of libstds (library single threaded data structures) in.

Originally and wrongly I was expanding the subset in liblfds, where those data structures were being used by libtab. Now I have only the data structures needed by liblfds in liblfds, and only those needed by libtab in libtab.

I now need to move libtab fully away from using liblfds to using libstds.

It was a blunder to have used liblfds, because liblfds provides data structures to the extent you have atomic support, which means you might not have a list, for example – but libtab uses the list everywhere.

Actually maintaining this portability behaviour in the code is a lot of work. If I just assumed x64-level atomics, the portability code would go away. In a sense it matters, because the portability code right now is untested. I do not – and I will need to – build variants which pretend to have less support. With software if it’s not tested, it doesn’t work.

Work getting done

Tzo!

Coded all day.

Have the new test and benchmark app to the point it compiles.

Still need to do some key work, but it’s an important step.

Importantly, I realised I’d made a huge blunder all along in test and benchmark – I use in test and in benchmark liblfds data structures, the list in particular.

I can’t do that, because liblfds is designed to offer data structures to the extent your system offers atomics; so you might not *have* the list.

In fact, the test and benchmark code needs to use single threaded data structures throughout.

This means I need to put some of the single-threaded data structure (stds) library data structures in the test and benchmark library.

I also need to introduce versioning on the stds code in liblfds, so multiple released can be compiled in the same project.

I finish my current contract work on Tuesday, and I’ll be taking a few years off, so the next release will come reasonable soon – few months tops.

The road to hell is paved with affinity APIs

I’m working away on the new test and benchmark application.

I need to support creating processes, to test position-independent data structures.

That means I need to pin processes to particular logical cores.

Know what?

That’s what’s written on the sign that points the way into hell.

Let me put this bluntly : Windows has no API to set process affinity beyond the first processor group, which has a maximum of 64 logical cores.

You read that right.

So if you have say 128 cores, and let’s say Windows has split these up into two 64 core groups – you can only set process affinity to be on cores 0 to 63.

You *can* set *thread* affinity to be on any core – but this is *not the same* as process affinity, and is less performant – but it looks like this is the best you can do.

It’s problematic to do this remotely (from another process). To do so you’d need to call CreateRemoteThreadEx(). In my case, I’m spawning new processes and I want them to quit when the benchmark work is done, so I need to co-ordinate between the main thread (which begins when the process is spawned) and the thread created by CreateRemoteThreadEx(), which will be created at some point after the main thread… it’s hard to wait on things in the main thread which haven’t yet been created. I could busy wait on a global variable…. but this is stomach-twistingly bad. I don’t *want* to write code like this.

You can set thread affinity from within the process itself by calling SetThreadGroupAffinity(). Obviously to use this you have to pass in information about which logical core in which processor group. I’m passing in some information already to the child process, through the command line (shared memory name and length in bytes), so I’ll have to add this.

It’s still not what I actually want. I want to set process affinity, from the parent process.

Windows thread/process affinity APIs are Civil Service quality – and I don’t mean the British Civil Service. I mean the *Egyptian* Civil Service.

Next step, finding out how bad it is under Linux. It’ll be bad, but it won’t be as bad, even if it’s just by not having processor groups, which are the worst single concept I’ve encountered since MS-DOS was designed with a 640kb RAM limit.

Shared memory and NUMA

Windows always does things differently to Linux, and this is almost always a problem, because Linux gets them right.

NUMA is the one exception I know of. Linux got it wrong, and Windows did it differently, and Windows did it right.

Linux has a system-wide policy which controls NUMA, and this is applied whenever a page is paged back in after being paged out. The upshot is you’ll get the system-wide NUMA policy, unless you pin your pages into memory so they can’t be paged. You youself in your application cannot control your NUMA behaviour. It’s controlled in the OS.

Windows does what you’d think would be done : when you make an allocation, you specify the NUMA node, and the OS tries as hard as it can to keep those pages in that node.

So this was all good and fine and clear until this week when I realised something.

I’ve been working on the test application for the position-independent data structures. They are intended for use with shared memory, where the shared memory has different virtual addresses in the difference processes; the data structures internally are using offsets rather than proper virtual memory addresses.

The new test application actually combines the test and the benchmark applications.

With the benchmarks, you want to be NUMA aware and take advantage of it. That means you need to pass in to the benchmark library a block of memory in each NUMA node, so it can put data in the correct places.

Now we see the problem – with shared memory, the data structure state, and all its elements, must be in the same single allocation.

How can you have one allocation per NUMA node *and* shared memory? because that means you have multiple allocations.

Suddenly Linux looks like it’s doing the right thing. Say you select striped allocations – your pages in any given allocation are striped over NUMA nodes. Okay, it’s not what you really want in your app – you want more fine grained control – but at least you’re able to *do* something meaningful with NUMA *within a single allocation*.

On Windows, where an allocation specifies its NUMA node, you just can’t do this.

You could in theory actually still make things work. In the data structure state, you’d have an array, which shows the address ranges for each allocation, and when you get hold of an offset (by popping an element from a freelist, say) you can then figure out *which* address range it is in, and so know the start of that range, and so figure out the actual virtual address represented by that offset.

Here though obviously you’re needing to do an array scan per freelist access, which is really not what you want.

Ironically, it’s on Windows where the position independent stuff really matters, because there are no locking primitives on Windows which are cross-process.