blog

2023-04-09

Update

I have spent Easter getting the test/benchmark programme back on its feet.

The last time I working on the library, many years ago now, I managed to dig myself into a deep hole - the library itself was fine, but I rewrote the test and benchmark programmes because I had added position independent variants of some of the data structures (for use with shared memory with different virtual addresses) and this meant I needed to be able to test processes, not just threads.

Originally, I thought I would test multiple processes, and multiple threads in multiple processes, but after a while I realized it was enough to test wholly with threads and wholly with processes - but by then, it was too late. I’d spent months developing the next test/benchmark programme, and it was too complex now, and I had to rewrite all the tests and benchmarks themselves - and I wanted a break, and to work on something else for a while, which I did.

Getting back into the code has always been difficult because it became too complicated - threads, processes, Windows support, Linux support, benchmark variants using locking mechanisms, and so on.

However, over time, the pressure to get a new release out has been building in the back of my mind; after all, rather a lot of unpublished work exists, and of course that the library does not support arm64 is absolutely insane, and I spent 1300 euro getting hold of a RISC-V dev board which I’ve not used at all for about three years now…

Getting back into complex code which is close to but not actually working, which you’ve not touched for many years, isn’t really on the cards. You need to start again, although you can re-use the code you can understand.

So I’ve been building a new test and benchmark library, from the code of the old library.

Most of it has been re-usable; I’ve been able to understand what it does from scratch. There was one improper entanglement of two APIs, which was making the code harder to keep in the mind, which I’ve sorted out. I’m currently working to the get the library to the state where I can run the first test, and then after that I can proceed over the following weeks to rewrite all the tests, as this needs to be done where the new library operates tests differently to the old library.

There is also one more major question, that of platform support.

In particular, I am looking at dropping support in the test and benchmark programme for Windows and the Windows kernel.

There are a certain number of independent reasons for this, some technical, some non-technical.

Technical reasons are that the most recent Windows I have a licence for is Windows 7. I would need to buy a licence for Windows 11, and run that; and I have a problem with that, because it emits unspecified telemetry which I suspect is properly tantamount to surveillance. Additionally, last time I looked, getting hold of command line compilers was problematic; and I can only use command line compilers, because a UI based configuration interface is a total failure when you have many thousands of configuration settings to configure by hand - I use gnumake.

(Another reason, which finally passed about four years ago, was that Microsoft were still on only C89.)

Non-technical reasons are my views on Microsoft’s conduct with regard to forcing upgrades of Windows 10 onto users (the close button being taken to mean “proceed”), the AppGet affair, and my personal experiences with Microsoft over the last two years or so, which were entirely typical for a very large organization, which does not make them acceptable. I will also never forgive Microsoft for introducing top-quoting.

I find it hard, on an emotional level, to pay cash for a license and then invest a lot of time working for the benefit of, a corporation which has behaved objectionably toward me and others. It is also problematic to submit to undefined surveillance.

A lot of work has gone into Windows support, over the years. Throwing it out is difficult - but continuing support requires actions which are actually impossible; there’s no way I’m going to buy a licence, not given my views and feelings about Microsoft - and so then we get what we get.

The library itself, the data structures, is completely platform agnostic - it uses a bare C compiler. It would no longer ship with build files for Windows and the Windows kernel, but the code will in theory compile just fine (although of course no code compiles just fine on a platform which it has not been compiled on, no matter what ought to happen in theory).

One question now of course is whether to change the C version required by the library. Currently its C89. I need to examine the following standards, and see what I would get from them.

So, getting back to the here and now, I’m close to getting the first test running, on threads. I have to sort out the organization of the particular bit of code which starts threads.

Then I need to sort out process support - that will be the second test.

Then we’re good to go for the tests.

I need to add support for collecting results, to get the benchmarks going.

Once the tests and benchmarks are both running, then it’s a case of grinding away, day by day, at rewriting all the tests and benchmarks.

Once those are done, I then have to debug hazard pointers, or omit them from the release, and then a release can be made.

2023-04-10

Progress

So, good progress.

I believe I now have the first test running - but I have reaized there is currently no mechanism for information or progress output!

I have also realized something completely obvious, which I before overlooked; you cannot run a benchmark without a timer.

Right now, the test suite is set up to run each test for a given number of iterations, so users can test on platforms without timers.

Problem is, you must have a timer if you want to benchmark.

Now, I have code for high resolution timers, that’s not a problem. The problem is that both the test and benchmark code should use a timer if it is available, and if not, then the test must use iterations and the benchmark cannot run. That’s complicated.

I have to think about this.

Ah, I also need to implement a way in which tests/benchmarks return results. I need to collect the results, so I can make gnuplots after a benchmark has completed.

2023-04-12

Update

Got a bit of work done, yesterday.

Added timers to the abstraction layer, and now using them in the test.

Added in the first benchmark.

Not sorted out results, yet.

Need now to sort out spawning processes, and using shared memory (which is already in the abstraction layer), so I can make the first position independent test.

Progress

> ../../bin/test
                       R                        R   = Notional system root     
                       N                        N   = NUMA node                
                       S                        S   = Socket (physical package)
                      L3U                       LnU = Level n unified cache    
      L2U             L2U       L2U L2U L2U L2U LnD = Level n data cache       
 P   P   P   P   P   P   P   P   P   P   P   P  LnI = Level n instruction cache
L1I L1I L1I L1I L1I L1I L1I L1I L1I L1I L1I L1I P   = Physical core            
L1D L1D L1D L1D L1D L1D L1D L1D L1D L1D L1D L1D nnn = Logical core             
011 010 009 008 007 006 005 004 003 002 001 000                                
Segmentation fault
> 

That’s what my framework laptop looks like, topologically.

I had the debug the single-threaded data structure library to get it working, and that has revealed a previously masked bug, which is why the seg fault.

It’s nice to see that test header output again :-)

Also now remembering all the gdb commands - not used them for six years!

C doesn’t need remembering, that’s built in.

That’s actually a pretty topology. It’s a bigLITTLE system. There are four BIG cores, you see them on the right, each with its own unified L2 cache, and there are eight little cores, on the left, each set of four sharing a single unified L2 cache.

I’ve turned hyperthreading off, which is why it’s one core per L1I/L1D.

2023-04-13

First Light

> ../../bin/test
                       R                        R   = Notional system root     
                       N                        N   = NUMA node                
                       S                        S   = Socket (physical package)
                      L3U                       LnU = Level n unified cache    
      L2U             L2U       L2U L2U L2U L2U LnD = Level n data cache       
 P   P   P   P   P   P   P   P   P   P   P   P  LnI = Level n instruction cache
L1I L1I L1I L1I L1I L1I L1I L1I L1I L1I L1I L1I P   = Physical core            
L1D L1D L1D L1D L1D L1D L1D L1D L1D L1D L1D L1D nnn = Logical core             
011 010 009 008 007 006 005 004 003 002 001 000                                
                                             1   227410000
                                         1   1   61340000 55380000
                                     1   1   1   71870000 35970000 5490000
                                 1   1   1   1   19850000 34670000 29890000 23830000
                             1   1   1   1   1   35200000 28980000 32200000 10490000 3460000
                         1   1   1   1   1   1   25800000 24780000 24580000 26210000 5930000 1120000
                     1   1   1   1   1   1   1   19310000 30820000 27740000 25630000 2820000 790000 820000
                 1   1   1   1   1   1   1   1   35630000 31020000 29490000 6850000 600000 500000 2760000 500000
             1   1   1   1   1   1   1   1   1   19120000 31140000 25610000 23050000 790000 840000 820000 810000 710000
         1   1   1   1   1   1   1   1   1   1   23200000 27450000 26510000 21290000 630000 680000 610000 570000 680000 660000
     1   1   1   1   1   1   1   1   1   1   1   20130000 22080000 22720000 22020000 750000 760000 810000 820000 840000 830000 850000
 1   1   1   1   1   1   1   1   1   1   1   1   16590000 23930000 22440000 20770000 690000 730000 700000 730000 740000 750000 640000 710000

This is the freelist benchmark, for the nodeallocation, position dependent variant of the data structure.

Release build, on a 12th Gen Intel processor, 4 big cores, 8 little cores.

The little cores really get the short end of the stick, at least with the thread sets currently in use. They’re really getting almost no work done at all.

The threads sets currently in use were designed on symmetric topologies.

Obviously here we have an interest in also testing the little and the big cores on their own.

On the previous laptop, with a 5th Gen Intel, I was getting about 27,000,000 iterations on a single core.

Here, on a big core, I’m getting 227,410,000. That’s 10x, and I can’t help but think something special is going on, rather than this being really genuine.

When I move to two cores, on i5 I was getting 10,000,000 iterations per core.

Here, I’m getting 61,340,000. That’s 6x still, though.

The laptop I think began to thermally throttle once it was past the first few tests, or even earlier - the processor unit had hit and was being held at 100C, which is scary.

I need to sort out decent thermal throttling, because 100C is way too hot. I want to throttle at about 60C.

2023-04-16

Processes

So, I’ve been thinking about how to test position independent data structures.

With the usual data structures, which are position dependent, I spin up threads.

With position independence, the expectation is to be using shared memory across multiple processes, each of course with its own virtual memory, and with the shared memory mapped to different virtual addresses.

So I need now to spin up processes.

Windows doesn’t have fork, all it can do is run a new executable from disk, so in the past, when I was working on this code, in Linux I used execve.

Now I’m looking at ditching Windows support in the test and benchmark code, I became free to use fork, and it occurred to me I fact had to, to support embedded systems - such systems would not have executable files for me to execute.

Then I discovered after a fork the child can run and only run functions which are safe to call inside a signal handler (not because they are being so called, but because the constraints imposed by having forked are identical to that of being inside a signal handler) - now, bear in mind here, the child will inherent the shared memory mapping from the parent and I think then will have identical virtual memory addresses, so we’re going to need to close the shared memory and open it again - but we can’t, because the functions to do so are not safe for use inside a signal handler.

So I have to execve anyway.

However, by the time I realized this, I had also realized an embedded platform which lacks files is also going to lack processes.

It’s been a voyage of discovery, folks! :-)

2023-04-22

Progress

Been getting the test binary back on its feet, as it’s now needed for execve.

I have a standard way for libraries to report versioning, which I needed to bring up standard for libstds and libtest, and I needed to rename everything in libtest - I was writing libtest when I should have been writing test. That took a bit of doing!

Then needed to get the test binary itself back into shape - it was all over the place; I’ve excluded for now the code for where it runs as a spawned child process, that’s the next task.

So I now have this nice output to show for my efforts;

> ../../bin/test
test -d [n] -h -i [n] -m [n] -r -v
  -h     : help
  -r     : run (causes test to run; present so no args gives help)
  -v     : version
  -s [n] : test duration in integer seconds       (default : 3)
  -i [n] : number of times to run this test suite (default : 1)
  -m [n] : total memory for tests, in megabytes   (default : 256)
> ../../bin/test -v
test 7.2.0 (debug, user-mode)
libstds 1.0.0 (debug, Linux, user-mode, x86_64, GCC)
libtest 7.2.0 (debug, Linux, user-mode, x86_64, GCC)
liblfds 7.2.0 (debug, Linux, user-mode, x86_64, GCC >= 4.7.3)

Just noticed the arguments help text doesn’t properly match up the help text :-)



Home Blog Forum Mailing Lists Documentation GitHub Contact

admin at liblfds dot org