Memory Barriers (part 1)
Understanding memory barriers is a bit like receiving a multi-stage immunization :-) it's a lot to adjust to, so you have the first article which takes the reader part of the way, and the second article which finishes the job. The first article usually needs two or three days to be really mentally disgested. Recommended reading is the first article, and then waiting until you find certain questions naturally coming to mind - the second article answers those questions.
Sockets, Physical Packages, Physical Cores, Logical Cores and Threads
A socket holds a single physical package which holds any number of physical cores. A physical core may hold any number of logical cores. A thread runs on a logical core. The problems described herein occur when the reader and writer threads are on logical cores in separate physical cores. A single physical core knows everything that's going on in its own logical cores, such that the problems describe herein simply do not happen.
To throw into stark relief the nature of the problem memory barriers exist to address, I can say this about multi-physical-core systems : a write to memory by a given physical core is not guaranteed to ever be seen by any other physical core in the system.
So if you have a C program with say some global variables, and a thread writes to one of them, it can be that no other thread ever sees the write that was made.
Similarly, if one thread declares a variable locally, or allocates some store from the heap, and then shares that store with another thread, any writes made by either thread may never be seen by the other thread.
This is due to a processor hardware concept known as store buffers. What these amount to is that a physical core may locally hold on to a write for an indefinite period, before it even becomes visible to the cache coherency protocol (and so to other physical cores). Additionally, even when writes do haphazardly make their way out into the system, the order in which they do so is completely arbitrary.
Clearly, there is a need in multi-physical-core systems to be able to guarantee that writes from one physical core will be seen by another physical core, and in a particular order.
The solution consists of two components; memory barriers, and atomic operations.
There are three types of memory barrier. There is the load barrier, the store barrier, and the full barrier.
To begin with, we need only consider the store barrier.
All barriers are placed in the flow of code, much like any statement. In the code snippet below, STORE_BARRIER is a macro which expands to whatever magic rune the given platform uses to specify a store barrier.
int a, b; a = 5; STORE_BARRIER; b = 10;
Now, the initial, natural, intuitive and absolutely and diabolically incorrect view of this is that the store barrier will force all writes which have occurred before it out to memory (or at least out to the cache coherency protocol, which amounts to the same thing), such that other physical cores will see them.
This is not the case. Be absolutely clear about this. A store barrier never causes any writes to memory.
The and the only thing a store barrier does, is make it such that when the physical core finally does make a write to memory (which it may still never do, and so all of its writes may still never be seen by any other physical core), all those write operations issued before the store barrier will be published to memory before those store operations which were issued after.
So in the example snippet, neither the write to a nor the write to b may ever be seen by any other physical core - but when the physical core does finally make a write (if it does!), the write to a will be published before the write to b.
However, this functionality is vital. It addresses the ordering problem. Consider a linked list, where an element in the list has a payload of a void pointer, which the user can set as he wishes before the element is linked into the list.
When the element is linked to the list, it must be that the void pointer of user data is published to memory before the pointer adjustment to the list such that the element is linked, or we will have an element which is now in the list and by being in the list is visible to other physical cores, but where the void pointer of user data is not yet published to memory, and so other physical cores will be reading some random value for the void pointer of user data. Catastrophe/hiliarity ensue.
So we now have an ability through memory barriers to control the ordering of writes - but they only take effect when the physical core actually makes a write to memory. The second part of the solution is the ability to induce the physical core to make a write to memory.
Atomic operations, such an an atomic add, force a write to memory, always.
This in turn brings all the previously issued store barriers into effect.
So if we consider the linked list example, when we come to add a new element, we set the user data, issued a store barrier, and then use an atomic operation to link the new element into the list - and this atomic operation by writing to memory forces the earlier memory barriers to be honoured, which ensures the user data is written to memory prior to the list element being joined the list by the atomic operation.
There is one final, vital matter to be aware of.
That a physical core has actually written a value to memory - actually, really, genuinely has done so, because we cast the necessary magic runes with store barriers and atomic writes to guarantee that it is so - this still does not mean other physical cores will actually be using the values which were written.
The problem is that other physical cores, in a load-side mirror image of store buffers, are never guranteed to actually be reading the values which are currently in memory. They may continue to use values they read earlier and have cached for an indefinite period of time.
It's all fun in processor land.
This behaviour is caused by the invalidation queue. When a write to memory occurs, any physical core which has in its caches local copies of the values in memory which have been written to receives a message, an invalidation request, telling it that its local copies are now out of date - but these messages may not be processed in a timely manner (they queue up, in the invalidation queue), and so the physical core can continue to use the earlier, incorrect copy that it already holds for an indefinite period of time.
If we consider the linked list example, it would mean that even though we in our new element set the user data, then issued a store barrier, then used an atomic operation to link the new element into the list, another physical core might already hold an earlier copy of the user data (perhaps we used that memory for something else earlier on, or maybe the OS cached some data in the background we didn't know about) and so when it comes to access the new element in the list and the user data in that element, it will end up using that earlier copy of the user data because it has not yet noticied the invalidation request telling it that its local copy is out of date. Catastrophe/hilarity, etc.
This is where load barriers come in.
The and the only thing a load barrier does, is make it such that when the physical core finally does complete a load from memory (which it may still never do, so none of these loads necessarily happen), all those load operations issued before the load barrier will be completed (which is to say, the invalidation request queue will be fully processed, to ensure this is so) before those load operations which were issued after the load barrier.
In short, issuing a load barrier ensures that when a load finally completes after the barrier, that the invalidation request queue will prior to that completion be serviced, which in turn ensures that physical core really will see whatever completed stored have been published by other cores.
So before we as a thread reading the linked list access the void pointer of user data, we must always issue a load barrier; this forces the invalidation request queue to be serviced before we read the user data.
(It's worth remembering that processors are much better at deferring stores than they are loads; a store can be held in the store buffers and the processor can keep going, but a load means the processor really has to go and find out what that value is to be able to do any work which uses that value, which of course means if you issued a load barrier beforehand, that the processor will end up having to honour that load barrier so there's no need, as there is with store barriers, to perform an atomic operation to force the load barrier to be honoured.)
A full barrier is a combination of a load barrier and a store barrier. They exist mainly for API completeness, I think - I think I've used one once, as I normally need only load or store, depending on what I'm doing in the code.