twenty hours of debugging, problem now solved

I have just spent twenty hours (broken up by a night of sleep) debugging.

There was something amiss on aarch64 – benchmarks using DWCAS were segfaulting.

I uncovered one bug with the freelist elimination layer, which I fixed. Found another with thread startup ordering, and another with pointer/counter being the wrong way round still in 7.0.1 and 7.1.1.

Still, this didn’t fix the problem.

The particular example of the problem I focused on was this : the freelist benchmark for 7.0.1 with two threads. There’s two freelist elements, two threads, each pops and then pushes, in a tight loop. Problem is one of the threads was popping a NULL (empty freelist).

This in principle was impossible.

Well, I’ve just figured it out. After twenty hours of doing nothing but working on this, to figure it out. I’m pretty relieved and happy 🙂

So, on aarch64, I implement my own DWCAS, using LL/SC. Have to, because libatomic is totally messed up right now. I don’t use memory barriers in the DWCAS itself – for me, they’re separate, in the code itself, rather than in the DWCAS.

This means though that the DWCAS is issuing “ldxp” to load, and then “stxp” to store.

If any other thread writes to the load/store location after the ldxp, the stxp will fail – so we’re protected against other thread writing between our load and store.

But…

…we are NOT protected against LOADING THE WRONG VALUE IN THE FIRST PLACE.

So we have (say) no memory barriers. We load the original value, this prior to the DWCAS. Someone else then does a write. *We don’t see it*, because we have not issued a load barrier – and then we enter into the DWCAS. We load the wrong value – we’ve not seen their write yet – and then we compare it with the compare value we loaded prior to the DWCAS, and lo and behold, they are the same *and no one has written to the load/store location during the DWCAS*. So we store!! and that messes things up completely.

There has to be a load barrier between the load of the compare and the DWCAS.

It’s plain now – extra plain – I need data structure implementations per processor family.