WIndows topology API seems to have a design flaw

I may be wrong, but I am really starting to think now that the extended MS API for getting system topology is actually broken – you can’t figure out the actual system topology from the information it gives you.

The problem is that processor groups can span NUMA nodes.

As such, processor groups can also span level three caches, because as a typical example on my Intel, the level 3 cache spans all logical processors in one socket, and that socket has one NUMA node. If a processor group can span NUMA nodes, it can span level three caches.

Information about NUMA nodes is returned with a single group affinity member – so a NUMA node which has more than one processor group will have more than one record coming back from the Windows topology API. However, NUMA nodes also have an ID in their info structure, so I can tie together these multiple records and know they’re the same NUMA node.

HOWEVER! cache records also only have a single group affinity member – BUT THEY HAVE NO UNIQUE ID.

So a level three cache with more than one processor group will return multiple records – but I will have no way of knowing they are records for the same cache.

That means I cannot actually know the topology of the system – I can’t properly know caches – and I care about that, because caches are fundamental to memory behaviour and so the selection of thread sets to benchmark performance.

I think the same issue may exist for level one and two caches. L1 and L2 are per physical processor, but a processor group is made up of logical cores. If a physical processor with two logical cores has one logical core in group 0 and the other in group 1, then the L1 and L2 caches for that physical processor also will be in two processor groups.

Oh, processor groups. What a fantastic bloody idea not.

I am not surprised this issue exists. The API to get topology info is appallingly complex and inadequately documented for its complexity. The processor group concept raised the complexity of the data by a power of two – and it’s complex enough to begin with.

I will need to think more, but if I can’t figure something more out, then the benchmark application cannot support processor groups on Windows due to flaws in the Windows topology information API.

Christ, it gets worse.

Check this;

Windows (as I know) tries to minimize the number of processor groups, by making them as large as possible (up to their 64 core limit – 64 cores, 640kb RAM, you think they’d notice). However, naturally, if the number of logical cores in the system isn’t divisible by 64, you end up with uneven groups. (It also implies a processor group, which is the basic scheduling entity, can have multiple NUMA nodes, which implies scheduling is not actually NUMA aware, but only processor group aware – and that’s insane).

Intel bought out a 10 core CPU (bloody obvious, given the relatively slow rate of higher core numbers) with 20 logical cores, which induces this.

So there’s now a hotfix for Windows which modifies how processor groups are formed up, to ensure even groups. So some systems will have this, and some not…

Bloody Microsoft

I’ve been rewriting the topology code.

This code is partially an abstraction layer (and so that part must be friendly to users, easy to understand and implement). It has the job of representing the system processor topology in a btree, i.e. NUMA nodes, sockets, caches, physical cores, logical cores.

Under Linux, it’s easy.

Under Windows, OMG I don’t even

The problem is Windows has added into the mix a completely redundent, unnecessary and complicating concept called a “processor group”. Actual logical core IDs only go up to 64 – the largest number of logical cores in one processor group. So to identify a core, you must identify the processor group, too. Processor groups *SPAN NUMA NODES*, so the members are potentially HETEROGENEOUS.

Because the processor group must be specified, when you start using the Windows API for enumerating topology, you can have MULTIPLE entries for the SAME entity, where each entry lists a set of logical cores *for the same processor group*. So a given socket might have say four entries, all the logical procesors in processor group 0, then all those in group 1, etc.

Now, remember – I’m representing the topology in a btree. To link an element, I need in fact to know its full set of logical processors, so I can sort it to the correct location in the tree. In Windows, I can only KNOW that I know this BY FULLY ENUMERATING THE ENTIRE TOPOLOGY, i.e. I actually have to iterate over the entire topology for EVERYTHING I FIND, to ensure I have every logical processor which belongs to it.

The only reason for this madness is processor groups, which I’m going to utterly ignore, always (except I need to know them to be able to start threads on the logical cores I want).

Test and benchmark

I’ve been thinking quite a bit.

I’m currently (re-re-re-re-)writing the benchmark app.

That’s fine. I’ve gone back to a using a library for as much stuff as possible, because I find it easier to think and lay out code properly. When writing libraries these days, I look to avoid memory allocation – I have the user pass in allocations. Problem is when you come to something where the size of the allocation cannot reasonably be known – it’s the result of some complex work inside the library, so the user can’t know beforehand how much store to allocate.

One simple, inefficient solution is to ask the user to pass in worst-case allocations. For smaller allocations, where the worst-case can easily be calculated, this can be okay, and I’m pretty sure I can use this approach for the benchmark app, in its current form – although this is because the benchmark app makes no large per-NUMA node allocations; all the per-NUMA stuff can be allocated on the stack, in each benchmark thread, and so naturally (normally – OS decides but this is how Windows and Linux work) goes on the NUMA node for that thread.

If the benchmark did make large per-NUMA node allocations, the user would have to make them and hand them over…

…which is exactly how things are with the test app. This *does* make large per-NUMA node allocations.

In fact, what ends up emerging from all this is that the benchmark and test both become libaries, and the apps are just stdlib main()-type binaries which call the very high level benchmark/test call (singular), handing over the necessary memory, and getting results back.

This way embedded users who cannot malloc simply reserve heap space and pass it in.

(Abstraction layer for I/O, where the output is going into a line buffer, which is passed to a user provided callback – both the normal output and the gnuplots going to this interface, which on the Windows/Linux stuff means to stdout – the user pipes to disk).

This almost suggests test and benchmark become integral in the liblfds library – but this isn’t so, for two reasons; firstly, it makes the library bigger, where test and benchmark only need to run for particular reasons rather than being generally available and secondly, test requires a thread-starter/waiter abstraction and benchmark requires a thread-starter/waiter and topology abstraction.

The test app generally right now needs to be rewritten from scratch to be something really decent.

The benchmark app is now in its design really decent, but I’m at the point where I need to choose between a malloc or non-malloc approach.


I currently have the Ci20 board doing a brute-force on Tor .onion ‘domain names’ to give me a vanity name which starts with ‘liblfds’ and then is followed by a number.

Once I have this, I’ll move fully to the .onion site, and leave a placeholder on the domain, which explains where the site has gone.

.onion endpoint up

I have create a .onion endpoint for the site.


The blog doesn’t work as wordpress redirects itself to a URL it is configured to think of as its home, so it can’t exist on two URLs at the same time – it’s not using relative URLs all the time. Have to think about what to do about this.


Just added per-data structure CAS/DWCAS backoff values. Removed the PRNG part of the backoff, but I think I said that earlier.

Been working on benchmark. Starting pretty much from scratch, but with the right ideas this time – btree for benchmarks, btree for results, abstraction layer for the locks.

The benchmark app has been a tar pit, because I’ve kept NOT starting from scratch, and so been mentally constrained in my thought my the code which already exists – which meant I never was able to break away from its design flaws.

I think the benchmark will be done, and done *PROPERLY AT LAST FUCK ME* in a week.

That thing has taken up an order of magnitude more time than it should have. Live and learn – tar pits.

Once benchmark is done, and I’ve sorted out power for the Pi and Ci20, I need to write a build and test system. I can’t keep manually making releases, it’s way too time consuming (and error prone).

Jesus, that reminds me, I’ll need to reinstall the Windows 8 trial to get the MSVC 2013 dev environment up… that whole process is so costly, in time and effort and mental agony. I literally dread doing it. It’s a plague upon the library and my personal happiness. I need to find a better way.

First gnuplot


ABsolutely fascenating. When placed in graphcal form, behaviours otherwise obscured in the mass of numbers become visible to the eye. This particular machine has dual sockets, with a total of 36 logical cores. That’s a strange number I suspect for a hypercube, and so I suspect some of the cores are disadvantaged with regard to the others when it comes to memory access – and so we see a slow core…

This of course was playing merry hell with the standard deviation numbers.

PRNG ha!

Adding an ABA counter to each queue state doesn’t solve the problem because they all init to 0, so if we have elements bouncing between queues, all those queues to begin with are going to have similar ABA values, i.e. the problem isn’t fixed.

So I need to keep the PRNG code so I can init each per-queue ABA counter to a different value!

loling here 🙂

At least all that PRNG code isn’t wasted!

Removing PRNG

As subject.

A downside is that PRNG was being used to control the initial counter value for the next pointer in queue elements.

Now I have to use an atomic add, which will I think be the ONLY use of atomic add in the code base, and it’ll mean a bottleneck on that counter in the queue state.

Nothing else for it, except per-thread state, and I can’t justify per-thread state to avoid one atomic add.