Category Archives: concurrency

hardware memory models for programmers

Nearly all computer systems now a days have hardware which have multi core chips and shared memory. in this post I will provide a high level overview of TSO (total store order) model which is most common on all x86 processors. It ends with an abstract model software developers can use to reason about low-level concurrency code running on modern multi-core shared memory systems.

Hardware designs have advanced at an incredible pace in last 10 yrs, modern multi-core systems have numerous performance optimizations that have observable consequence for multi-threaded concurrent programs. At a very high level a multi-core chip could look like the diagram below

microarchitecture

Each core modern processor have multiple execution units, which allow instruction level parallelism (ILP) techniques such as pipe-lining  superscalar execution and branch prediction to overlap the execution of many instructions.  CPU speeds have out paced memory system by orders of magnitude, to amortize the cost to access memory each core has it private caches and additional cache which is shared among all cores on the chip. The internal details of each core is complex as shown in the image below

SandyBridgeMicroarchitecture

From a correctness point of  view most of the internal components are invisible or cannot be observed by software, one of the components to keep a note of is the store buffers of each core. Even with the hierarchy of fast caches to main memory , the core often stall while accessing memory, to further hide this latency each core has it own private load and store buffer which basically a FIFO queue which buffer pending writes and reads from memory. We will see later that store buffer does have observable consequences for multi-threaded programs.

why memory model

Memory model allow programmers to reason about the correctness of their programs, it also help programmer get most out the performance optimizations modern multi-core systems  can make.   In a multi-core shared memory system, the reads and writes of multiple threads is non-deterministic and allow many correct execution while disallowing some. What is allowed and what is disallowed varies across processor implementations. Here we will simply focus on x86 processors, from a software perceptive we can observe what is allowed behavior by analysing the results of memory reads and writes. Lets now explore the allowed globally visible states from program below running on a x86 intel processor (intel core i3)

// x and y are initialised to 0
Core-0 Core-1
x = 1 y = 1
r0 = y r1 = x

The interesting thing in the example above is there is only a writer to any given memory location at any time. On a multi-core processor what could the end results of x and y be?  Intuitively we can consider all possible in program-order executions,   see that any of the values below could be expected.

step-1 step-2 step-3 step-4 (r0, r1)
core-0: x=1 core-0: r0=y core-1: y=1 core-1: r1=x (0,1)
core-1: y=1 core-1: r1=x core-0: x=1 core-0: r0=y (1,0)
core-1: y=1 core-0: x=1 core-1: r1=x core-0: r0=y (1,1)
core-1: y=1 core-0: x=1 core-0: r1=y core-1: r0=x (1,1)
core-0: x=1 core-1: y=1 core-0: r0=y core-1: r1=x (1,1)
core-0: x=1 core-1: y=1 core-1: r1=x core-0: r0=y (1,1)

The sequential semantics that is intuitive for most us is called sequential consistency model.
Sequential consistency: the result of any execution is the same as if the read and write operations by all processes were executed in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program [Lamport, 1979].

As we can see it the table above , assuming each processor executed in-program order , There is no interleaving execution that could produce a state where (r0,r1) = (0,0). If the state were to occur (r0,r1) = (0,0) , it would appear from a software point of view as if either core-0 or core-1 effectively executed their instructions out of sequence, clearly this would not satisfy sequential consistency conditions. As we will as demonstrated in the program below on x86 processor are not sequentially consistent as the state (r1,r2) = (0, 0) is in fact valid execution !

[gist https://gist.github.com/isaiah-perumalla/5809246 /]

Since we are focusing on the behavior of the hardware, we want to ensure the gcc compiler does not reorder any of the instruction,

 asm volatile ("" ::: "memory");

directive on lines 10 and 18 ensure compiler optimization will not reorder the statements.

Compiling the above code with gcc

 gcc -O2 x86mmodel.c -o x86mmodel -pthread 

running the code we get the result below
Screenshot-mmodel7

A memory model is a specification that shows the allowed behavior of multi-threaded programs running on shared memory , multi core processors so programmer know what to expect when writing programs on these systems, the behavior is defined simply  in terms of memory reads and writes of memory locations  without any references to caches or memory subsystems. From a correctness point of view cache coherence protocol on the hardware make the cache subsystem functionally invisible to the programmer. That is by simply analysing the results of memory read and memory writes if a system has caches or not. (Note from a performance perspective its crucial to know the memory access patterns and how the caches are utilized by the program, you can infer cache structure by timing your programs which use different memory access patterns).  What is visible to the programmer however is the effects of private store buffers of each core as we demonstrated in the example the write to memory by each core are queued its private store buffer, while each core can read its most recent buffered write ,  the writes in the buffer are invisible to other cores. So before each core’s buffered writes have propagated to the memory subsystem, each core could already executed the next read instruction,  this effectively has the same effect as a memory read instruction getting ahead of an older memory write instruction, according to the x86 TSO memory model this is valid execution. So it clear from this example a programmer model of the hardware must account for the store buffer. To understand and reason about the x86 memory model the following simpler hardware model is sufficient for a programmer.
Memorymodel

In the abstract model above the programmer can simply view each core (hardware thread in this context) as having it own private store FIFO buffers, all writes get queued into the store buffer and each core can ‘snoop its own buffer’ , that is read its most recent buffered write directly from its buffer (other cores will not see the buffered writes until it is propagated to memory). As the example demonstrated the x86 memory model in a mulit-core execution the store buffers are made visible to the programmer, the consequence of this is the x86 model, has slightly weaker memory model than sequential consistency called the Total store order. In a TSO model following operations must be visible to all cores as if they were executed sequentially

  • Load  are not reordered with other loads
  • Store  are not reordered with other stores as every write is queued in a FIFO buffer
  • Loads can be reordered with earlier stores to different memory location. Load and stores to same memory location will not be reordered
  • Memory ordering obeys causality (ie stores are seen in consistent order by other processors)

What can programmers do to strengthen the model ?

Since a Store followed by a Load can appear out of order to other cores, however programmers may wish to ensure load and store instructions to be ordered across all cores , the x86 processor have a ‘mfence’ (memory fence) instruction. the mfence instruction will ensure a core will not only reorder the instructions, but also will ensure that any buffered stores in the cores private store buffer are propagated to the memory subsystem before executing instructions after mfence. with this instruction we can strengthen the memory consistency model when required by the software. By adding the mfence instruction to our previous example , it should never be possible to get a state where (r1,r2) = (0,0)

[gist https://gist.github.com/isaiah-perumalla/5815957/]

The above program should not terminate, as there can never be a case where (r1,r2) = (0,0).

x86 processor also have a number of atomic instructions, such as XCHG which swaps the data between two memory locations atomically at the hardware . All atomic operations implicitly also drain the store buffer and prevents core from reordering instructions. In the next post we will go through practical examples of using memory models to reason about correctness and improve performance.