I may be wrong, but I am really starting to think now that the extended MS API for getting system topology is actually broken – you can’t figure out the actual system topology from the information it gives you.
The problem is that processor groups can span NUMA nodes.
As such, processor groups can also span level three caches, because as a typical example on my Intel, the level 3 cache spans all logical processors in one socket, and that socket has one NUMA node. If a processor group can span NUMA nodes, it can span level three caches.
Information about NUMA nodes is returned with a single group affinity member – so a NUMA node which has more than one processor group will have more than one record coming back from the Windows topology API. However, NUMA nodes also have an ID in their info structure, so I can tie together these multiple records and know they’re the same NUMA node.
HOWEVER! cache records also only have a single group affinity member – BUT THEY HAVE NO UNIQUE ID.
So a level three cache with more than one processor group will return multiple records – but I will have no way of knowing they are records for the same cache.
That means I cannot actually know the topology of the system – I can’t properly know caches – and I care about that, because caches are fundamental to memory behaviour and so the selection of thread sets to benchmark performance.
I think the same issue may exist for level one and two caches. L1 and L2 are per physical processor, but a processor group is made up of logical cores. If a physical processor with two logical cores has one logical core in group 0 and the other in group 1, then the L1 and L2 caches for that physical processor also will be in two processor groups.
Oh, processor groups. What a fantastic bloody idea not.
I am not surprised this issue exists. The API to get topology info is appallingly complex and inadequately documented for its complexity. The processor group concept raised the complexity of the data by a power of two – and it’s complex enough to begin with.
I will need to think more, but if I can’t figure something more out, then the benchmark application cannot support processor groups on Windows due to flaws in the Windows topology information API.
Christ, it gets worse.
Windows (as I know) tries to minimize the number of processor groups, by making them as large as possible (up to their 64 core limit – 64 cores, 640kb RAM, you think they’d notice). However, naturally, if the number of logical cores in the system isn’t divisible by 64, you end up with uneven groups. (It also implies a processor group, which is the basic scheduling entity, can have multiple NUMA nodes, which implies scheduling is not actually NUMA aware, but only processor group aware – and that’s insane).
Intel bought out a 10 core CPU (bloody obvious, given the relatively slow rate of higher core numbers) with 20 logical cores, which induces this.
So there’s now a hotfix for Windows which modifies how processor groups are formed up, to ensure even groups. So some systems will have this, and some not…