Monthly Archives: May 2012

So…

Current thought is support everything on Intel and ARM (using LL/SC on ARM) and have partial support for the others (no freelist, no ringbuffer, no freelist-rather-than-SMR variants of queue and stack; so add-onlys, SMR queue and SMR stack).

CPU support

So…

x86/x64 has DCAS – supported
ARM has double-word LL/SC and viable LL/SC – supported
SPARC has CAS only – supported fully only if I have SMR support re-use (rather than just free); partial support (add-onlys, queue and stack; no freelist, no ringbuffer) with SMR supporting only free
MIPS and PowerPC – LL/SC emulating CAS only – so same as SPARC, unless I implement false sharing avoidance for malloc
Alpha – haven’t looked, prolly same as MIPS/PowerPC

LL/SC – so close and yet so far

LL/SC as LL/SC looks like a bust.

You can only use it to emulate CAS.

To use LL/SC as LL/SC you have to be able to do a store to memory between the LL and SC.

I’ve been looking at ARM, and in fact it turns out that the exclusive access “bit” used by LL/SC is in fact a region the size of which varies by platform by ranges from 8 to 2048 bytes.

So if the STR target is in the same block as the LL/SC target, the STR inbetween the LL and SC breaks the LL/SC! so you loop forever.

And in fact the same basic problem happens on MIPS, with it’s LL/SC – it’s “exclusive bit” is a cache line. Now, you can easily ensure your STR target isn’t in the same cache line as the LL/SC target – but you *can’t* so easily ensure your STR target isn’t false sharing the cache line!

You’d have to get internal cache topology from somewhere and fiddle your allocs to ensure you never cache line share with the LL/SC target for that data structure instance.

In fact, that false sharing problem affects Intel with its CAS, but false sharing with CAS is just a performance hit, not an endless loop.

So, ARM can’t use LL/SC, PowerPC has false cache sharing to worry about. SPARC has only single word CAS. That leaves Alpha (about which I don’t care) and PowerPC. I think PowerPC is cache line based as well and will also fail from false sharing.

In fact though… thinking about it, the ARM approach to this is in fact genius. All you have to do for your given data structure instance state is align it and pad it to 2048 bytes!

ARM, I kiss you!

It’s the other CPUs which fail on false sharing which are the bastards!

lols

SMR ATM is doing *two* mallocs per release candidate…! ouch!

LL/SC

Just remembered… SPARC doesn’t support LL/SC. It has only single word CAS. If any of this is to run on SPARC, it has to use SMR.

Wikipedia says on PowerPC if you access the same cache line as the LL/SC target, you break the LL/SC. ARM manual says “physical address”, which implies pointer resolution rather than cache line resolution.

Maybe a problem with LL/SC where with the current LL/SC implementation we do a store to memory between the LL and SC. Problem is if it is cache line granularity, if the location we store to is on the same cache line as the pointer we’re LL/SCing, it’ll break the LL/SC.

With the freelist, the store target is the next pointer of the freelist element to push, which being a seperate alloc is unlikely to be directly on the same cache line, but there could be a cache line collision. We would have to check the address of the freelist state and the address of the freelist element (when we allocate it) and get cache arrangement information so we can manually check for cache line sharing.

Circling the runway

So, plans.

SMR freelists are out. Remove the work done to make SMR support arbitrary re-use rather than free.

Add LL/SC versions of all data structures. Need to do this for SMR, too. That should be straightforward. Freelist has been done, stack is easy, queue isn’t so obvious.

Implement atomic_exchange with LL/SC.

Fix up benchmark to handle multiple data structure variants.

Intel will use DCAS+PAC freelists, everyone else will use LLSC freelists.

Everyone will have SMR for malloc/free.

Improve SMR by having release candidate malloc incorporated into data element malloc.

What to do about exponential backoff for LL/SC?

LL/SC Freelist Benchmark on ARM

Release 7 Lock-Free Freelist Benchmark #1

M
S
P P P P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
1 10209274,1020927,0,1.00

1 10260734,1026073,0,1.00
1 1 19428672,971434,218,0.95
1 1 1 28801894,960063,1565,0.94
1 1 1 1 28722706,718068,479326,0.70

1 1 1 1 29389876,734747,417447,0.72

Release 7 Lock-Free Freelist Benchmark #2

M
S
P P P P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
1 10333676,1033368,0,1.00

1 10443788,1044379,0,1.00
1 1 20580442,1029022,46,0.99
1 1 1 30055000,1001833,1798,0.96
1 1 1 1 29497692,737442,492491,0.71

1 1 1 1 29474256,736856,492047,0.71

There’s something mixed up with logical processor set selection. There’s also something messed up with WordPress. It always was like this, but if you manually went down every line and deleted and replaced the newline, it’d look nromal. That doesn’t work now. I’ll look tomorrow and see if I can find a fix. (While we’re on the subject, Firefox can’t handle large text areas. It frezes for about two seconds every ten or so seconds, while typing… the latest update also broke drag’n'drop. I played a game of Company of Heroes earlier. We lost because of bugs – where you end up not being able to build buildings. I think one of the motivations I have to write decent software is NOT to be like all of these things which frustrate and annoy me).

Performance is about 10% better than DCAS+PAC, which is about how much faster CAS is than DCAS. Scaling is about 5% better. All in all, win! I’m going to drop SMR freelists. This is great news, since element re-use really complicated the SMR code.

SMR Freelists

As things stand, a thread will check for SMR generation advance every 1000 read section exits and every 10 elements added to the release candidate list.

Generation advance is blocked if a thread remains inside a read section. In practise this basically doesn’t happen, so it means in practise you need to allocate for your freelist an extra ten elements per thread.

This is awkward – it requires knowing in advance how many threads may use the freelist. One solution is to have guaranteed pop, whereby if the freelist is empty, malloc() is called. This isn’t useful for people doing audio and similar truly time sensitive tasks.

Really, the answer is to use DCAS/PAC freelists on Intel and LLSC freelists on all other platforms. SMR is useful for avoiding ABA and permitting the use of free(), but we don’t need free() for freelists.

Good progress

Turns out a whole bunch of everything was broken, due to all the code reworking I’ve been doing, including a Grand Argument Re-Ordering.

I’m making my way through the tests, getting everything back into shape.

Danger, danger Will Robinson!

This thing is going to break my heart :-)

In the last hour or so I figured out a bug in the SMR code.

When a thread comes to scan its own release candidate list and finds it can release elements, it calls the callback function for that element. With the freelist, this means pushing the element back onto the freelist. To do this, we need a thread SMR state and we use the one of the thread performing the scan.

The thing is, the function to exit for read section has the potential to enter into the release candidate scan! which means we come back into that code while we’re still executing.

So I fixed that and it has fixed the freelist test bug!

And generated another!!

MY HEART!!

The second freelist test now fails, missing elements. Time to investigate.