Shared memory and NUMA

Windows always does things differently to Linux, and this is almost always a problem, because Linux gets them right.

NUMA is the one exception I know of. Linux got it wrong, and Windows did it differently, and Windows did it right.

Linux has a system-wide policy which controls NUMA, and this is applied whenever a page is paged back in after being paged out. The upshot is you’ll get the system-wide NUMA policy, unless you pin your pages into memory so they can’t be paged. You youself in your application cannot control your NUMA behaviour. It’s controlled in the OS.

Windows does what you’d think would be done : when you make an allocation, you specify the NUMA node, and the OS tries as hard as it can to keep those pages in that node.

So this was all good and fine and clear until this week when I realised something.

I’ve been working on the test application for the position-independent data structures. They are intended for use with shared memory, where the shared memory has different virtual addresses in the difference processes; the data structures internally are using offsets rather than proper virtual memory addresses.

The new test application actually combines the test and the benchmark applications.

With the benchmarks, you want to be NUMA aware and take advantage of it. That means you need to pass in to the benchmark library a block of memory in each NUMA node, so it can put data in the correct places.

Now we see the problem – with shared memory, the data structure state, and all its elements, must be in the same single allocation.

How can you have one allocation per NUMA node *and* shared memory? because that means you have multiple allocations.

Suddenly Linux looks like it’s doing the right thing. Say you select striped allocations – your pages in any given allocation are striped over NUMA nodes. Okay, it’s not what you really want in your app – you want more fine grained control – but at least you’re able to *do* something meaningful with NUMA *within a single allocation*.

On Windows, where an allocation specifies its NUMA node, you just can’t do this.

You could in theory actually still make things work. In the data structure state, you’d have an array, which shows the address ranges for each allocation, and when you get hold of an offset (by popping an element from a freelist, say) you can then figure out *which* address range it is in, and so know the start of that range, and so figure out the actual virtual address represented by that offset.

Here though obviously you’re needing to do an array scan per freelist access, which is really not what you want.

Ironically, it’s on Windows where the position independent stuff really matters, because there are no locking primitives on Windows which are cross-process.