WIndows topology API seems to have a design flaw

I may be wrong, but I am really starting to think now that the extended MS API for getting system topology is actually broken – you can’t figure out the actual system topology from the information it gives you.

The problem is that processor groups can span NUMA nodes.

As such, processor groups can also span level three caches, because as a typical example on my Intel, the level 3 cache spans all logical processors in one socket, and that socket has one NUMA node. If a processor group can span NUMA nodes, it can span level three caches.

Information about NUMA nodes is returned with a single group affinity member – so a NUMA node which has more than one processor group will have more than one record coming back from the Windows topology API. However, NUMA nodes also have an ID in their info structure, so I can tie together these multiple records and know they’re the same NUMA node.

HOWEVER! cache records also only have a single group affinity member – BUT THEY HAVE NO UNIQUE ID.

So a level three cache with more than one processor group will return multiple records – but I will have no way of knowing they are records for the same cache.

That means I cannot actually know the topology of the system – I can’t properly know caches – and I care about that, because caches are fundamental to memory behaviour and so the selection of thread sets to benchmark performance.

I think the same issue may exist for level one and two caches. L1 and L2 are per physical processor, but a processor group is made up of logical cores. If a physical processor with two logical cores has one logical core in group 0 and the other in group 1, then the L1 and L2 caches for that physical processor also will be in two processor groups.

Oh, processor groups. What a fantastic bloody idea not.

I am not surprised this issue exists. The API to get topology info is appallingly complex and inadequately documented for its complexity. The processor group concept raised the complexity of the data by a power of two – and it’s complex enough to begin with.

I will need to think more, but if I can’t figure something more out, then the benchmark application cannot support processor groups on Windows due to flaws in the Windows topology information API.

Christ, it gets worse.

Check this;

https://blogs.msdn.microsoft.com/saponsqlserver/2011/10/08/uneven-windows-processor-groups/

Windows (as I know) tries to minimize the number of processor groups, by making them as large as possible (up to their 64 core limit – 64 cores, 640kb RAM, you think they’d notice). However, naturally, if the number of logical cores in the system isn’t divisible by 64, you end up with uneven groups. (It also implies a processor group, which is the basic scheduling entity, can have multiple NUMA nodes, which implies scheduling is not actually NUMA aware, but only processor group aware – and that’s insane).

Intel bought out a 10 core CPU (bloody obvious, given the relatively slow rate of higher core numbers) with 20 logical cores, which induces this.

So there’s now a hotfix for Windows which modifies how processor groups are formed up, to ensure even groups. So some systems will have this, and some not…

Bloody Microsoft

I’ve been rewriting the topology code.

This code is partially an abstraction layer (and so that part must be friendly to users, easy to understand and implement). It has the job of representing the system processor topology in a btree, i.e. NUMA nodes, sockets, caches, physical cores, logical cores.

Under Linux, it’s easy.

Under Windows, OMG I don’t even

The problem is Windows has added into the mix a completely redundent, unnecessary and complicating concept called a “processor group”. Actual logical core IDs only go up to 64 – the largest number of logical cores in one processor group. So to identify a core, you must identify the processor group, too. Processor groups *SPAN NUMA NODES*, so the members are potentially HETEROGENEOUS.

Because the processor group must be specified, when you start using the Windows API for enumerating topology, you can have MULTIPLE entries for the SAME entity, where each entry lists a set of logical cores *for the same processor group*. So a given socket might have say four entries, all the logical procesors in processor group 0, then all those in group 1, etc.

Now, remember – I’m representing the topology in a btree. To link an element, I need in fact to know its full set of logical processors, so I can sort it to the correct location in the tree. In Windows, I can only KNOW that I know this BY FULLY ENUMERATING THE ENTIRE TOPOLOGY, i.e. I actually have to iterate over the entire topology for EVERYTHING I FIND, to ensure I have every logical processor which belongs to it.

The only reason for this madness is processor groups, which I’m going to utterly ignore, always (except I need to know them to be able to start threads on the logical cores I want).

Test and benchmark

I’ve been thinking quite a bit.

I’m currently (re-re-re-re-)writing the benchmark app.

That’s fine. I’ve gone back to a using a library for as much stuff as possible, because I find it easier to think and lay out code properly. When writing libraries these days, I look to avoid memory allocation – I have the user pass in allocations. Problem is when you come to something where the size of the allocation cannot reasonably be known – it’s the result of some complex work inside the library, so the user can’t know beforehand how much store to allocate.

One simple, inefficient solution is to ask the user to pass in worst-case allocations. For smaller allocations, where the worst-case can easily be calculated, this can be okay, and I’m pretty sure I can use this approach for the benchmark app, in its current form – although this is because the benchmark app makes no large per-NUMA node allocations; all the per-NUMA stuff can be allocated on the stack, in each benchmark thread, and so naturally (normally – OS decides but this is how Windows and Linux work) goes on the NUMA node for that thread.

If the benchmark did make large per-NUMA node allocations, the user would have to make them and hand them over…

…which is exactly how things are with the test app. This *does* make large per-NUMA node allocations.

In fact, what ends up emerging from all this is that the benchmark and test both become libaries, and the apps are just stdlib main()-type binaries which call the very high level benchmark/test call (singular), handing over the necessary memory, and getting results back.

This way embedded users who cannot malloc simply reserve heap space and pass it in.

(Abstraction layer for I/O, where the output is going into a line buffer, which is passed to a user provided callback – both the normal output and the gnuplots going to this interface, which on the Windows/Linux stuff means to stdout – the user pipes to disk).

This almost suggests test and benchmark become integral in the liblfds library – but this isn’t so, for two reasons; firstly, it makes the library bigger, where test and benchmark only need to run for particular reasons rather than being generally available and secondly, test requires a thread-starter/waiter abstraction and benchmark requires a thread-starter/waiter and topology abstraction.

The test app generally right now needs to be rewritten from scratch to be something really decent.

The benchmark app is now in its design really decent, but I’m at the point where I need to choose between a malloc or non-malloc approach.

.onion

I currently have the Ci20 board doing a brute-force on Tor .onion ‘domain names’ to give me a vanity name which starts with ‘liblfds’ and then is followed by a number.

Once I have this, I’ll move liblfds.org fully to the .onion site, and leave a placeholder on the liblfds.org domain, which explains where the site has gone.

OMFG

AHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH.

I have to call the Swedish Tax Agency – apparently there’s a rebate.

I sit on hold for just over 20 minutes.

That’s how I know I’m dealing with the State.

JUST LITERALLY TWO SECONDS before a woman answers, the USB hub disconnects/reconnects and this stops the mic from working.

ahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh!!!!!!!!!!!!!!!!!

*pause*

AHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!! !!!!!!!!!!!!!!!!

I have to hang up.

This happened on my previous call to Rabobank. On the phone for a bit sorting stuff out, the chap goes off to find something out, he comes back, my mic is dead.

Mic is now directly plugged into the laptop.

You know, all the external USB kit I have – the hub, the external enclosures (both types), NONE OF IT worked/works properly with USB 3.

I can’t order replacements while I’m in the EU/Malta, they’re super super expensive – 2x the US cost *and in euros as well*. A 30 USD enclosure is 60 euro here. It’s like being asked to pay 20 euro for a loaf of bread. You *could*, but you never would.

Too much tax and regulation – one effect of which is that I now will have spent 40 minutes on hold. There is often a human cost, even in the most mundane of ways.

.onion endpoint up

I have create a .onion endpoint for the site.

http://k263wv5uwidihxl4.onion:8080/

The blog doesn’t work as wordpress redirects itself to a URL it is configured to think of as its home, so it can’t exist on two URLs at the same time – it’s not using relative URLs all the time. Have to think about what to do about this.

Television news reporting

I have very recently after a month without begun again to attend a gymnasium.

I do not own and so do not watch television, either broadcast or online. This has been so now for twenty-one years.

Gynasiums often are fitted with televisions and so perforce in that environment I encounter television.

Sometimes, a news programme is showing.

This happened today.

A man in the USA took a rifle and attacked a group of strangers. He killed fourteen people before being killed himself.

The motive is unclear but may be ideological or religious, as he was Muslim.

This is horrific, appalling, but it is a kind of horror which we are inured to, and which in its nature seems to originate essentially from a mind being so utterly lost, and unable to reason for itself, and possible damaged in some ways, that it comes to see the world in such a way that this all seems like the right thing to do.

With 300 million people in the USA, there are always going to be at the extreme end of the curve a very few number of cases where these unimaginable events occur.

But.

The reporting of the story consisted of video and photographs of the scene while the rescue efforts were still in progress, with ambulances and medical staff all about and the wounded being tended to, the dead being covered up. There was a photo of the head and shoulders of a woman laying on a stretcher trolly, with some kind of oxygen mask on. A photo taken from the air, I think from a drone, of a bloody body covered by a sheet which was soaked through with blood. You get the idea.

These people, who were shot, have experienced the most horrific and traumatic event of their lives. Appalling fear and terror, of people around being shot, screaming and dying, the possibility of being shot, then seeing the gunman turning his gun towards them, bullets firing, unimaginable pain and terror as the bullets actually rip through their bodies, maiming them, the terrible fear of dying, which many of them did.

These people are experiencing the most agonizing trauma of their entire lives.

At this moment, news photographers are dashing in, as fast as they can, to grab as much video and photographic footage of these people at this very moment, to broadcast their faces, their suffering, their wounds, to tens of millions.

It is foul.

These are not people not led astray by years of religious ranting delivered to a mind unable to reason for itself; they are everyday people, you and me, normal men in the street – violating the wounded at the most agonizing moment of their lives.

I was profoundly affected by this – far more so that the shooting itself, as these are acts which we have become inured to, and which come from a cause we can more easily assign to something akin to madness.

It puts me to mind in fact of the two Dutch brothers who, for the monetry reward, informed the Gestapo as to the where-abouts of Anne Frank, who was then as the brothers knew they would be assulted and imprisoned, then dispatched to a concentration camp, where she after a few months of unimaginable horror and trauma herself died, at the young age of fifteen, just as her life was beginning, of malnutrition and disease and no doubt of the incredible trauma of all that she saw and herself experienced.

They knowingly inflicted that upon her, for that which they themselves gained.

It is no different in its nature to photographing these people at this time.

Update

Just added per-data structure CAS/DWCAS backoff values. Removed the PRNG part of the backoff, but I think I said that earlier.

Been working on benchmark. Starting pretty much from scratch, but with the right ideas this time – btree for benchmarks, btree for results, abstraction layer for the locks.

The benchmark app has been a tar pit, because I’ve kept NOT starting from scratch, and so been mentally constrained in my thought my the code which already exists – which meant I never was able to break away from its design flaws.

I think the benchmark will be done, and done *PROPERLY AT LAST FUCK ME* in a week.

That thing has taken up an order of magnitude more time than it should have. Live and learn – tar pits.

Once benchmark is done, and I’ve sorted out power for the Pi and Ci20, I need to write a build and test system. I can’t keep manually making releases, it’s way too time consuming (and error prone).

Jesus, that reminds me, I’ll need to reinstall the Windows 8 trial to get the MSVC 2013 dev environment up… that whole process is so costly, in time and effort and mental agony. I literally dread doing it. It’s a plague upon the library and my personal happiness. I need to find a better way.

First gnuplot

ops_per_lp_one_chart_per_lps_per_benchmark_one_datastructure_36cores_dwcas10only

ABsolutely fascenating. When placed in graphcal form, behaviours otherwise obscured in the mass of numbers become visible to the eye. This particular machine has dual sockets, with a total of 36 logical cores. That’s a strange number I suspect for a hypercube, and so I suspect some of the cores are disadvantaged with regard to the others when it comes to memory access – and so we see a slow core…

This of course was playing merry hell with the standard deviation numbers.