Shared memory and NUMA

I’ve been thinking about shared memory and NUMA.

Windows always does things differently to Linux, which is usually bad, because Linux usually gets it right or pretty much right.

I think Linux made a bad job of NUMA. Linux tries to make NUMA go away, in the sense of making it so the developer doesn’t need to think about it. This is done by the OS offering NUMA policies, which control how memory allocations are handled with regard to NUMA – local node, striping across all nodes, etc. Critically, when a page has been paged out and then is paged back in, the page is normally expected to be able to change which NUMA node it is in (although it might well not do so).

Windows, which went for a more “here are the controls, do the right thing” approach, is more like C. The developer has to handle the matter.

The library supports bare metal platforms so it does not perform memory allocation; rather, the user passes memory in. The same has to be true for the test and benchmark application, so it can be run on bare metal platforms.

So the user allocates memory and passes it in.

But what happens about shared memory, for the position independent data structures?

THe user allocates shared memory, rather than normal memory, and passes it in, and the child test processes when they run open the shared memory and use it.

So that’s okay.

What happens with NUMA?

The user allocates equal memory on each NUMA node and passes it all in.

There’s a function for this in Windows and Linux, so that’s okay for Windows, but what about Linux moving pages between NUMA nodes on paging-in? the only way to stop this is to pin a memory page, so it cannot be paged out.

So, okay, I can do this for the tests and benchmarks.

What about shared memory with NUMA?

Well, obviously now I would need to allocate equal blocks of shared memory on each NUMA node and pass them in.

Oh. Problems.

On Windows it’s fine – there’s a function to allocate shared memory on a specific NUMA node.

On Linux, there is no such function. Shared memory is placed on NUMA nodes just as non-shared memory, according to the NUMA policy.

I think I might be able to change the NUMA policy just before creation of the shared memory to use and only use a singe NUMA node, the one I want to use; but shared memory like all allocations is really allocated on faulting, so doing this doesn’t *do* anything.

I suspect what I need to do is change NUMA policy, create shared memory, pin the memory, then fault every page, then revert NUMA policy.

(Another way, says SO, is to create, then move the pages to the desired NUMA node.)

Obviously, this all feels wrong.

Am I doing the wrong thing?

Should I just suck it up and let Linux do what it want to do?

One issue here is comparing like with like.

Actually it raises the question of what is like with like?

If I run the benchmarks on Windows, with low-level NUMA control, and then I run them on Linux, with the same low-level NUMA control, I have like with like.

But if on Linux users are simply using NUMA policy, then I’m coming apples and oranges… …except if Linux *is* normally like this, then it really is what you normally get, and so that *is* what you have to compare against.