Monthly Archives: August 2012

More SMR / NUMA

Man, going crazy here.

If you’re writing a library which fully supports pre-allocaton (e.g. does no allocation of its own) then NUMA is a complete, humungous pain in the ass.

Consider SMR. To be NUMA aware and performant, I keep a thread SMR state list and a spare release callback thread SMR state per NUMA node. So the user needs to know how many NUMA nodes there are, then allocate an array of pointers to SMR NUMA state, allocate state once for each NUMA node (on the respective nodes), then pass that into the SMR init function, along with the SMR state, for initialization.

Then later when a user inits a new thread SMR state, he has to indicate which NUMA node it is allocated in, so that it goes into the correct list.

This means both the user and library have to know about NUMA nodes and have to agree on a way of communicating about them. It’s all very well the user making for SMR state init an alloc per NUMA node, but how does he know what NUMA nodes exist and how do I know which allocs are for which NUMA node?

I don’t want to place on users a burden of having to learn about and write up a bunch of NUMA code, to be able to use liblfds.

However… anyone on a non-NUMA system can simply indicate NUMA node 0, always. So that’s okay.

Users who *are* using NUMA will already be prepared to do NUMA work. The issue then is communicating meaningfully about NUMA node numbers. On Windows, NUMA nodes seem to have arbitrary integer IDs. On Linux, there’s a bitmask, so they all have power of two IDs.

So, I guess the answer is; if you don’t have NUMA, indicate NUMA node 0. If you do, then I expect you to be able to know what NUMA nodes you have, to make allocations on them and to indicate to me their node numbers.

That means then that the library doesn’t have to provide an API to find the NUMA node numbers for the user; all it has to do is accept numbers from the user and assume they’re true.

Okay… let’s see how that rolls.

SMR / NUMA / Meh…

Support for pre-allocation makes NUMA aware SMR (at least with the idea of an SMR thread state list per NUMA node) a complete pig.

It’s better to have just one list and note in each thread state which NUMA node it belongs to, even though it means travesing elements on other NUMA nodes. We assume thread init is rare and doesn’t need to be as fast as humanly possible.

The OS when it pages can end up moving page to another node, but since that page has a preferred node, it should on the whole in the end return to the intended node.

…except I’ve just realised I still need a spare SMR thread state per NUMA node for SMR freelists.

Oh man…!

SMR / NUMA

In the process of making SMR NUMA aware.

MS PG

MS processor groups – gahhhh!

More SMR/NUMA

The current idea then is that SMR maintains a list of thread states per NUMA node.

If the user starts a thread and pins it to a core, then the core number is known, which means we can know the NUMA node and the user can allocate from the correct NUMA node and we can put that allocation into the correct list.

But what if the user doesn’t pin the a thread to a core? what if he doesn’t mind which NUMA node the thread runs in?

Then we need to be able to query the OS to find out which core the thread is running on.

Turns out this functionality, for Windows, turned up only in Vista! and even then, only for systems with less than 64 logical cores. For systems with more, you need Windows 7!

Another big drawback is that it requires liblfds proper, rather than just the benchmark library, to have knowledge of CPU topology. That’s fine if you don’t have to port it, but it’s a PITA if you do.

Also though imagine porting to a new platform. A system requirement is knowing which code a thread is running upon. I bet that’s not commonly supported.

But if you want to be NUMA aware, hell, you gotta know which damn NUMA node you’re using!

I think thought it’s reasonable to assume threads are pinned to cores – no high performance design will have floating threads, or more than one thread per logical core.

Still means I need to know CPU topology. That’s not good. I think thought a simplified version could be implemented, where the data is only cores and NUMA – no info on caches. This I can query from the OS for Windows and Linux. For porting to non-NUMA aware platforms, they can lie and state everything belongs to the first NUMA node.

NUMA

Actually, the problem with NUMA is that you’re not guaranteed, if you’ve been paged out and so are now being paged back in, to be paged back into the NUMA node you were originally allocated into.

So you can end up with allocations all over the place, with you thinking they’re in nodes they’re just not in any more.

Also of course as well the OS can move threads across cores – even across cores which cause a change in ideal NUMA node.

This really stinks. We’re trying to compensate in software for something which is a hardware issue and where there are just no guarantees you can use with which to write your software.

SMR and NUMA

Been having a bit of a break. Have ODed on liblfds.

Been working on getting the tests to work, did a bit of work on the SMR API, realised I could have a retirement state present in every SMR thread state, so the user wouldn’t have to mess about allocation for retirement states – an API simplification.

Then though I realised there was an issue. The SMR thread state lists are add-only. When a thread retires, it state cannot be removed from the list, so it’s marked as available. When another thread comes to start, it first tries to claim an available state.

Problem of course is you can end up claiming a state in an inappropriate NUMA node.

So what I’m thinking is this; an SMR state maintains an SMR thread state list per NUMA node. When a thread scans, it’s actually scanning only the list for its NUMA node and when it’s done, it raises a flag private to that NUMA node and increments a shared counter.

If the counter hits the number of NUMA nodes, then the incrementor bumps the generation counter in the SMR state and then (details not defined yet) we get on with the SMR stuff.

This is complicated. The code is getting complex to handle non-abstracted hardware issues.

I’m thinking now as I type this that the data structures and SMR need perhaps internally to be designed more with NUMA in mind, e.g. a freelist will actually consist of a freelist per NUMA node and exhaustion on a particular NUMA node (even if the others have available elements) causes a NULL to be returned.

But this doesn’t make sense for the other data structures. The whole point of them is that centralise data – the stack is ordered, the queue has an entry and an exit, the ringbuffer operates on one set of data.

Makes me wonder if properly distributed versions of these data structures have better characteristics with regard to NUMA. Imagine a freelist with the collision layer in front of it, where poppers can meet with pushers before having to go onto the freelist proper (and contend the top pointer). Here we would need to take into account which NUMA node an element belonged to.

We would then have a freelist per NUMA node underneath.

A stack could do the same with the collision layer, but then you *can’t* care about NUMA node, because the element being pushed, like its NUMA node or not, contains the data you want to pop!

A queue of course inherently acts in this way; you’re transfering data from source to destination. By design, the source and destination can be acting in different NUMA nodes.

With SMR, we can do something useful, which is have essentially an SMR per NUMA node (within the single SMR state), where the only cross-NUMA communication is checking to see if the other NUMA-SMRs are ready to advance the generation counter.

Long hard slog

Getting there, slowly.

Fixing up test code for modified APIs takes a long time, like an hour per test, and there are about twenty tests still to fix.

That’s just the intial fix up, too. Then it has to be made to work.

Wish I had a time machine!

status / queue idea

Main code compiles.

Fixing up tests.

Had an idea. You know the funnel counter, where threads travel up a tree, collecting inc/decs and then finally the total is applied to the counter?

Why not have the same for the enqueue part of a queue. Don’t enqueue one element at a time, enqueue a list of elements.