Monthly Archives: June 2012

Bye bye freelist

With the move to pre-allocation freelist and stack end up being identical.

When you have library internal alloc, the stack hides its internal book-keeping elements and the freelist will allocate its own elements, so you can separate the APIs and have some API improvement by doing so.

Registration it is

You gotta register with a DSI to use it.

Choice

Okay.

As I see it, I gotta makes a choice.

It’s either SMR per DSI, with a thread having to register with every DSI it uses (and indeed, having a separate thread activity record per SMR), or multiple DSIs per SMR where there’s relatively few SMRs, and the user passing in the thread activity record to every necessary API call and the user getting callbacks on DSI delete to tell him when the delete has actually been completed.

The former, apart from the requirement to register with a DSI, almost hides SMR. The user of course has to allocate the SMR state and pass it in and he has a callback for when elements are released, but that’s it.

The latter, which requires a thread only to register with an SMR where that SMR can then be used with many DSIs, has the complexity and uglyness of it’s DSI delete behaviour.

I dunno. What a choice. I’m feeling the former, thinking the latter. Wish I had a mailing list now to ask users what they want.

more more more SMR

I’m thinking about removing in-API support for SMR.

Leave the SMR API in the library, but don’t try to build it into the data structure APIs.

This would mean the user has to wrap all necessary API calls with enter/exit read_section and also to call release_element where necessary.

Then leave it up to the user to arrange his SMR states, except with the inherent limitations that a DSI must use one and only one SMR state and a thread activity record can be registered with only one SMR state.

Frankly, it’s still complex.

Could go back to having one API for SMR-using and one for not…

More thought

You can’t actually make a thread activity state wholly separate of the SMR state. Each thread needs to indicate, on a per-SMR basis, whether it’s exited a read section. You could have a list of SMR states in the thread activity state, but that means each read section exit requires a (probably short, but still) list walk.

An DSI must use and only use the same SMR state at all times. So you can’t put one in TLS and use that; you have to pass it into the DSI at init time.

So as it stands, if it’s TLS, the a thread can register with only one SMR and a DSI inherently uses and only uses one SMR.

That means we can’t use TLS, because that restriction is too onerous (imagine we have a setup where NUMA nodes are setup as essentially separate machines, each which has it’s own SMR state, using a queue to communicate between the two “machines” – that couldn’t be allowed, because the threads on one machine could dequeue since they can’t register with the other SMR state).

We could have a thread activity state in many SMRs if we were prepared to do the short list walk when we exit the read section. That would permit the use of TLS and so the loss of the thread activity record arguments which currently disfigure all the function calls. But this now would require function calls. That’s not okay.

So, jees, right now we have almost the worst of all worlds. We need to pass in thread activity states to each function call and we have multiple DSIs per SMR state so we have the awful DSI delete behaviour. At least I’ve avoided the mistake of having a single SMR (which would on multiple NUMA machines be in the wrong memory).

Closure

Well, I think I’m getting some closure now.

A thread has to register it’s activity record with an SMR state. There’s no avoiding that unless you have somewhere a check which is performed every single time the thread wants to enter a read section that the activity record has been registered.

So if there’s an SMR per DSI, you have to register with each DSI.

I just can’t see that being okay. It’s too cumbersome; and that means we HAVE to endure the awful delete behaviour induced by SMR where an SMR state is shared between DSIs. It’s awful, but what else is there? the alternative is thread registration with every DSI.

Upon closer examination, a thread activity record cannot currently be part of multiple SMRs. It almost can – but there is a flag in there which the SMR state updates after a scan, e.g. there is per-SMR state in the thread activity record. This could I think be re-arranged without undue grief, which would permit the use of TLS; each thread has its activity record in TLS, a thread registers with the SMR states it wishes to work with.

This has to happen, the current passing around of thread activity records is a concrete eyesore.

And then we have to implement the God-awful delete behaviour under SMR.

No…

Doesn’t save us.

Problem is, you have to register with an SMR state – I mean, there has to be a one-off function call to let the SMR state know about the thread activity record. That or the SMR API on *every* call is going to check a flag in the thread activity record to see if it’s been registered and perform the registration if not and that’s not on the cards.

So if you’re using lots of SMR states, your thread has to register with each and every one – we’re almost back to where we started, with the exception that we now have a single thread activity record in TLS, so you don’t have to pass it around as an argument, and the SMR state is held in the DSI state, so you don’t need to pass the SMR state around either.

But you still need to register each thread using a DSI with that DSI.

I can imagine this being okay for some threading arrangements, e.g. single thread starts, creates all DSIs, starts threads, they all in their init register with those DSIs. In that arrangement it’s hardly painful at all.

Might be onto something…

I might be getting something.

First, you assert that a thread can only register with a single SMR. It’s a loss of flexibility, but it seems okay – you’re only going to want a thread to work with a NUMA local SMR anyway.

The SMR library has single thread SMR state stored in TLS. A thread SMR state now will also store a pointer to it’s SMR state, so the SMR state doesn’t need to be passed around – only the thread SMR state – and that doesn’t need to be passed around, because it’s in TLS.

So this gets you argument-less SMR, where many DSIs share a single SMR.

But then I realised something important; thread SMR states are almost independent from SMR states. By this I mean to say a thread SMR state is really more like an activity record for a thread – it can belong concurrently in any number of SMR states.

(Actually, I’m not 100% sure of this – but fairly close – and I’m running for now with the idea it could be made so).

The only real connection between the SMR state and the (now so named) thread activity record is that an SMR state needs to keep track of all thread activity records registered with it, so they can be scanned. So the thread activity record needs to be present in many add-only atomic linked lists and where we have pre-allocation if we wanted that we’d need to provide an add-only atomic linked list element state for each SMR we register with.

That would then allow us to have a single thread activity record, taken from TLS so no argument passing, where the DSI now selects the SMR state (e.g. it’s own, since there would be one SMR state per DSI) so again no argument passing.

So, questions are; is a thread activity record truely able to be used concurrently in many SMR states? and if so, can we reasonably arrange the list element states?

So what’s this all come to?

Well, well.

After 96 hours thinking and only thinking about (bloody!) SMR, where am I?

Stuck between a pair of equally unacceptable choices.

I can’t have a single global SMR (which would otherwise be lovely and play perfectly with TLS) because of NUMA. There has to be the ability to have an SMR per NUMA node.

When I have an SMR per NUMA node, a thread only has to register once, which is okay enough, although also you have to pass your thread SMR state around as an argument all the time, which sucks, but I can’t see a really performant way of avoiding that.

(There are prolly some things you could do – like asserting a thread can only register with a single SMR, so you can use TLS for the thread SMR state; TLS looks okay – fast enough and available on all platforms and that will mean no passing the thread SMR state around, HUGE hooray!)

But the kicker, the killer, is and always is that with an SMR which has more than one DSI’s elements, when one of those DSI is deleted, you can’t delete the SMR state (someone else is using it) and you can’t get the elements out until they naturally emerge, because the per-thread lists those elements are in are single-threaded and are being used by the owning thread.

TLS / NUMA

If we’re using __thread, which limits SMR to Windows and Linux, but prolly a fair few Linux platforms after all, I’m thinking that although the variable will be per-thread and we imagine a NUMA node per SMR, it will be allocated from the local node, so it’ll be okay.

That makes SMR per NUMA much better for the API (no passing around of the thread SMR state) but we still have the DSI delete problem.

Note also with __thread we can store the whole thread SMR state and access it directly; no need to make a function call, then deref a pointer.

(OTOH, the POSIX API is I imagine supported on most all platforms).

Hmm, hunting around on-line I get the impression modern GCC supports __thread for all the CPUs which support atomic ops. I presume this is OS independent. This means no loss of embedded support. (Also that reminds me that I need to add explicit support for Solaris).