Monthly Archives: July 2012

Allocator

Who allocates for the allocator…?

Well, you could build up state locally then once you know what you need to know, alloc in the correct place and transfer your state there.

I guess I’m kinda thinking of an alloc state, which carries the NUMA/LC map.

However, in some situations, the only solution is striping, e.g. four NUMA nodes and the user says “ALL” – you cannot have equal performance on all four nodes all the time. You could if he only cared about two of the NUMA nodes, then you’d select a middle-man.

But this now is getting back more or less to what the NUMA API for Linux offers; you basically configure your preferences and then malloc does what you’re asaking for. I don’t need to implement this, it exists already.

This is not the case under Windows, which only offers a function whereby you indicate the NUMA node you wish to allocate from.

Something else has just struck me though; large pages really are for very specific tasks and their controlled environments, because their allocation, relying as it does on contigious physically memory blocks, can only really be guaranteed just after boot.

Maybe I can for now not worry about large pages – just make the library for normal arbitrary post-boot usage…

What it comes down basically is that the user must decide where to store his data (which memory) and where to run his code (which core), where his code will be running on a wide range of systems.

What he really wants of course is locality – but now this brings us not only to controlling where the memory goes, but where the threads go. We would need an allocator, for memory, and a starter, for threads…

…and some generalised (and viable!) method for specifying desired behaviour.

So we would imagine some high-level way of quantifying the relationships between threads in an application – this thread communicates with this thread, etc. But then how much do they communicate? 30% with one thread, 70% with another? it’s hard to imagine this working well – but the problem here is that too much of the machine detail (cores and memory) has come to have to be exposed to developers.

It’s a similar problem with liblfds now. Take SMR for example. A library exists not just to encapsulate functionality but also knowledge. It exists to save users having to *understand* the details of its inner workings. If SMR is say removed fully from the library and made the users responsibility, the user now has to understand SMR. The library has in that sense failed.

Memory allocation

Starting to think about a memory allocation interface – not an allocator as such, but rather an API which offers useful functionality for memory allocation.

Right now, the library is still ugly. Memory allocation (and SMR) are sort of half-in, half-out.

But I don’t want to put memory allocation really into the data structures – it stops them from doing just one thing.

So what about a memory allocator API and having the data structures always just being passed memory.

The MA (memory allocator) API would have some simple NUMA info – which logical cores belong to which NUMA nodes and the latency from a logical core to a given NUMA node.

The user when allocating would indicate which logical cores he cared about with regard to the performance of the allocation, e.g. “ALL”, or a list which is a subset of logical cores. We then work out which is the NUMA node to use.

Open question is large page support. On Windows, it’s easier to use than Linux – it’s a function call, just like malloc. On Linux, you have to do some OS-level configuration before you can issue the function calls to try to allocate large pages.

Perhaps we have an allocation option to obtain large pages, and in the Linux case, the user has to have performed the OS config or of course it will fail at run-time.

Adieu, run-time cache-line alignment

Removed run-time cache-line alignment.

Kinda breaks my heart a little bit to do this, because it’s a fair bit of work, but basically it’s just not C. The language isn’t able to naturally express this useage. You can do it, but it’s just not right.

Cache topology

Hmm.

I thought I couldn’t get detailed cache topology data (associativity, etc) but yes, under Linux I can and under Windows too.

Not so useful to me right now, since the user handles allocations, but potentially of use in future (to avoid cache-line thrashing).

Progress

Busy making it all compile.

Thinking of removing run-time cache-line alignment, going back to compile-time only.

My feeling is that the use case for run-time is rare (how often will the a compatable CPU have different cache line lengths?) and the costs of supporting this (awkward code) affect everyone, all the time.

OTOH, it just struck me – maybe in the mobile world, we see CPUs across Android platforms varying their cache-line widths? but they’re all ARMs… my guess is they’re all 32 byte cache lines. Going to check now. If so, hmm… well, 64 bit ARMs are coming out. They will end up in phones. Then you have 32-bit Android and 64-bit Android. This will be masked from end users, but how? 32 bit binaries for everyone? fat binaries? automatic selection of correct build?

Problemo solvedo

Fortunately, however, you can use the ternary operator in array size definitions.

Learn somethin’ new every day

First time I’ve been using the preprocessor to do some math.

Turns out you can’t use sizeof() in math with preprocessor #if…

“The token sequence that makes up the constant expression undergoes macro replacement, except that names prefixed by defined are not expanded. In this context, the expression defined NAME or defined ( NAME ) evaluates to 1 if NAME is currently defined, 0 if it is not. Any other identifiers in the expression including those that are C keywords are replaced with the value 0. Then the expression is evaluated. The replacement even of keywords means that sizeof can’t be used in these expressions to get the result that you would normally expect.”

Makes sense, now I see it.

Function call stack alignment

On x64, functionn call stack alignment is 16 bytes.

On ARM, 8 bytes.

On x86, 4 bytes.

Contigious double-word CAS on x86 requires 8 byte alignment.

It appears that stack alignment cannot be controlled through the MSVC compiler (recent versions of GCC support this, though).

Accordingly, on x86, either I pass in the address of an already aligned variable and deref, or I pass by value and copy to a local variable which is correctly aligned.

Nope

Ain’t gonna fly – large memory.

User really does have to fully control memory.

Man, high performance memory allocation is a real mess.

Road bump

Realised I’m still using malloc in the add-onlys.

Looking at add-only hash now. It has a table of add-only unbalanced btrees – in other words, it does a lot behind the scenes. At things stand, all those allocs will have to be performed by the user – the hash state, the table of btree states, and I’ll have to expose the hash internal compare/delete functions to the user, so he can use them in the btree init. Ugly.

Maybe I should consider some kind of malloc callback where the library can indicate which NUMA node it wants to be on…