hardware memory models for programmers

Nearly all computer systems now a days have hardware which have multi core chips and shared memory. in this post I will provide a high level overview of TSO (total store order) model which is most common on all x86 processors. It ends with an abstract model software developers can use to reason about low-level concurrency code running on modern multi-core shared memory systems.

Hardware designs have advanced at an incredible pace in last 10 yrs, modern multi-core systems have numerous performance optimizations that have observable consequence for multi-threaded concurrent programs. At a very high level a multi-core chip could look like the diagram below


Each core modern processor have multiple execution units, which allow instruction level parallelism (ILP) techniques such as pipe-lining  superscalar execution and branch prediction to overlap the execution of many instructions.  CPU speeds have out paced memory system by orders of magnitude, to amortize the cost to access memory each core has it private caches and additional cache which is shared among all cores on the chip. The internal details of each core is complex as shown in the image below


From a correctness point of  view most of the internal components are invisible or cannot be observed by software, one of the components to keep a note of is the store buffers of each core. Even with the hierarchy of fast caches to main memory , the core often stall while accessing memory, to further hide this latency each core has it own private load and store buffer which basically a FIFO queue which buffer pending writes and reads from memory. We will see later that store buffer does have observable consequences for multi-threaded programs.

why memory model

Memory model allow programmers to reason about the correctness of their programs, it also help programmer get most out the performance optimizations modern multi-core systems  can make.   In a multi-core shared memory system, the reads and writes of multiple threads is non-deterministic and allow many correct execution while disallowing some. What is allowed and what is disallowed varies across processor implementations. Here we will simply focus on x86 processors, from a software perceptive we can observe what is allowed behavior by analysing the results of memory reads and writes. Lets now explore the allowed globally visible states from program below running on a x86 intel processor (intel core i3)

// x and y are initialised to 0
Core-0 Core-1
x = 1 y = 1
r0 = y r1 = x

The interesting thing in the example above is there is only a writer to any given memory location at any time. On a multi-core processor what could the end results of x and y be?  Intuitively we can consider all possible in program-order executions,   see that any of the values below could be expected.

step-1 step-2 step-3 step-4 (r0, r1)
core-0: x=1 core-0: r0=y core-1: y=1 core-1: r1=x (0,1)
core-1: y=1 core-1: r1=x core-0: x=1 core-0: r0=y (1,0)
core-1: y=1 core-0: x=1 core-1: r1=x core-0: r0=y (1,1)
core-1: y=1 core-0: x=1 core-0: r1=y core-1: r0=x (1,1)
core-0: x=1 core-1: y=1 core-0: r0=y core-1: r1=x (1,1)
core-0: x=1 core-1: y=1 core-1: r1=x core-0: r0=y (1,1)

The sequential semantics that is intuitive for most us is called sequential consistency model.
Sequential consistency: the result of any execution is the same as if the read and write operations by all processes were executed in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program [Lamport, 1979].

As we can see it the table above , assuming each processor executed in-program order , There is no interleaving execution that could produce a state where (r0,r1) = (0,0). If the state were to occur (r0,r1) = (0,0) , it would appear from a software point of view as if either core-0 or core-1 effectively executed their instructions out of sequence, clearly this would not satisfy sequential consistency conditions. As we will as demonstrated in the program below on x86 processor are not sequentially consistent as the state (r1,r2) = (0, 0) is in fact valid execution !

[gist https://gist.github.com/isaiah-perumalla/5809246 /]

Since we are focusing on the behavior of the hardware, we want to ensure the gcc compiler does not reorder any of the instruction,

 asm volatile ("" ::: "memory");

directive on lines 10 and 18 ensure compiler optimization will not reorder the statements.

Compiling the above code with gcc

 gcc -O2 x86mmodel.c -o x86mmodel -pthread 

running the code we get the result below

A memory model is a specification that shows the allowed behavior of multi-threaded programs running on shared memory , multi core processors so programmer know what to expect when writing programs on these systems, the behavior is defined simply  in terms of memory reads and writes of memory locations  without any references to caches or memory subsystems. From a correctness point of view cache coherence protocol on the hardware make the cache subsystem functionally invisible to the programmer. That is by simply analysing the results of memory read and memory writes if a system has caches or not. (Note from a performance perspective its crucial to know the memory access patterns and how the caches are utilized by the program, you can infer cache structure by timing your programs which use different memory access patterns).  What is visible to the programmer however is the effects of private store buffers of each core as we demonstrated in the example the write to memory by each core are queued its private store buffer, while each core can read its most recent buffered write ,  the writes in the buffer are invisible to other cores. So before each core’s buffered writes have propagated to the memory subsystem, each core could already executed the next read instruction,  this effectively has the same effect as a memory read instruction getting ahead of an older memory write instruction, according to the x86 TSO memory model this is valid execution. So it clear from this example a programmer model of the hardware must account for the store buffer. To understand and reason about the x86 memory model the following simpler hardware model is sufficient for a programmer.

In the abstract model above the programmer can simply view each core (hardware thread in this context) as having it own private store FIFO buffers, all writes get queued into the store buffer and each core can ‘snoop its own buffer’ , that is read its most recent buffered write directly from its buffer (other cores will not see the buffered writes until it is propagated to memory). As the example demonstrated the x86 memory model in a mulit-core execution the store buffers are made visible to the programmer, the consequence of this is the x86 model, has slightly weaker memory model than sequential consistency called the Total store order. In a TSO model following operations must be visible to all cores as if they were executed sequentially

  • Load  are not reordered with other loads
  • Store  are not reordered with other stores as every write is queued in a FIFO buffer
  • Loads can be reordered with earlier stores to different memory location. Load and stores to same memory location will not be reordered
  • Memory ordering obeys causality (ie stores are seen in consistent order by other processors)

What can programmers do to strengthen the model ?

Since a Store followed by a Load can appear out of order to other cores, however programmers may wish to ensure load and store instructions to be ordered across all cores , the x86 processor have a ‘mfence’ (memory fence) instruction. the mfence instruction will ensure a core will not only reorder the instructions, but also will ensure that any buffered stores in the cores private store buffer are propagated to the memory subsystem before executing instructions after mfence. with this instruction we can strengthen the memory consistency model when required by the software. By adding the mfence instruction to our previous example , it should never be possible to get a state where (r1,r2) = (0,0)

[gist https://gist.github.com/isaiah-perumalla/5815957/]

The above program should not terminate, as there can never be a case where (r1,r2) = (0,0).

x86 processor also have a number of atomic instructions, such as XCHG which swaps the data between two memory locations atomically at the hardware . All atomic operations implicitly also drain the store buffer and prevents core from reordering instructions. In the next post we will go through practical examples of using memory models to reason about correctness and improve performance.

Do mocks create high coupling ?

In a recent discussion with a colleague, he bought up the age old myth that mock objects are bad because they cause brittle tests and pointed me to this post. My view is if using mock objects are causing brittle tests it is a hint that there is a issue with the design  of objects collaborating, which needs to be addressed. Mock object are a great tool that aid in the design of OO systems if you subscribe the “mystical view” of object oriented programming as described by these books RDD and GOOS Ralph Johnson notes “The mystical view is that an OO system is one that is built out of objects that communicate by sending messages to each other, and computation is the messages flying from object to object.”

lets start by looking at the example in this post

module Codebreaker
  describe Game do
    describe "#start" do
      it "sends a welcome message" do
        output = double('output')
        game = Game.new(output)
        output.should_receive(:puts).with('Welcome to Codebreaker!')
      it "prompts for the first guess"

with implementation code as show below.

module Codebreaker
  class Game
    def initialize(output)
      @output = output
    def start
      @output.puts("Welcome to Codebreaker!")

While it very clear in the above example the implementation code simply duplicate the test code, and severely inhibit any re factoring with out breaking the test code. As the author of the blog post points out any of the following implementations are valid but yet they would break the test

@output.write("Welcome to Codebreaker!\n")

@output.print("Welcome to Codebreaker!\n")

"Welcome to Codebreaker!".split.each{|c| output.write(c)}; output.write "\n"

@output << "Welcome to Codebreaker!\n"

The key here is to “listen to the test”, there is duplication in test and implementation,  the test is highlighting a serious flaw in the inter-object protocol. One of the things I picked up over the years in various discussions  at London Xtreme Tuesday group is that an object should send messages to its collaborator in terms of its domain language. In the example above an instance of the Game object sends the message


to its collaborator. now :puts is a meaningless message in terms of the Game. With the mystical style of OO the domain model is in the message flying between the object.
One way to fix the above code is as following. start with an end-to-end test to verify the end result for a given scenario.

the unit test would be as follows

module Codebreaker
  describe Game do
    describe "#start" do
      it "notifies game has started" do
        gameEventListener= double('gameEventlistener')
        game = Game.new(gameEventListener)
      it "prompts for the first guess"

The key thing to note here is that the message :new_game_started is meaningful in terms of the Game objects domain. The role (gameEvenListener) of its collaborator is now made explicit in the test. We can now have an implementation of this role with write out the appropriate welcome message when it receives :new_game_started . While this is very simplistic example and may not seem like there is a huge impact on example, In larger systems I find well designed inter-object object protocols leads to simpler flexible system where behaviour can easily change by composing different objects.

In chapter 6 of the RDD book the authors describe different control styles in a system.  I find that  mock objects is a tool that guides our design to a “delegated control style” , one of the key characteristic of this style is

“Dialogs between objects are at higher-level.Collaborations between a coordinator and the objects it coordinates are typically higher-level requests for work and not simple requests to store or retrieve data. Communications are more abstract and more powerful. Instead of asking objects to do small chores or divulge small bits of information, coordinators typically delegate larger-grained requests.”

At the same time if we distribute responsibilities across too many different objects we have the following side effects “Complicated collaborations between delegator and delegates. This can happen when not enough context is passed along with a delegated request. Lots of collaborations but not much work getting done.”

When using mock object the test will be very sensitive to these issues and will highlight these facts by being tightly coupled to implementation details of the object or by awkward tests

To summarize the mains points are we

  • need to understand the philosophies and ideas of OO  to effectively use and appreciate mock objects
  • Using mock object is a great feedback tool that can be used  to design high level dialogues between objects, brittle tests is usually an indication that the object design needs to be reviewed.
  • in an OO system the domain model is in fact in the messages sent between objects.
  • Object must send messages to it peers in terms of its domain language.

Designing inter-object protocols using mocks

The intent of my recent paper, was to demonstrate how mock objects could be used to discover roles, and design the communication protocols between objects. The following content did not make it to the final version, I think these are important points so i will describe them here.
Distinguish between an Object’s Internals and its Peers
It is important to distinguish what the internals of an object are and who are its peers that it communicates with.  An object’s internals are usually created and used within an object. In my example the fact that the Register object collaborates with the Sale object to calculate the receipt total is hidden. Peer objects, on the other hand, aren’t hidden inside the object, they are a passed in to the object, either through the constructor as a dependency or through a setter (if the peer object is a policy or notification object).
All interactions internal to an object should be hidden from the client code that uses the object; likewise a unit test should not be concerned with the internal details of the object under test.
Exposing an objects internals details to simply test every aspect of a class in pure isolation, causes the tests to be overly coupled to the implementation details of the object. You will find tests using mock objects highlight these design issues very quickly, one hint is when you find that the production code, simply mirrors the expectations you wrote in the test code, this  obviously makes tests very brittle since they overly coupled to the implementation details of the object under test.
This could be addressed by changing the behavior of the mock framework to ignore all calls between the object under test and its collaborator unless explicitly specified.  However this does not address the underlying weakness in the design of the protocol between the object under test and it collaborators, it fact it simply hides all complex inter-object protocols.
Nat Pryce coined the phrase “A composite object should be simpler than the sum of its parts.”  An object should provide a simpler API to its clients, rather than simply exposing all its internal objects through its API. I find this is a good heuristic to follow when deciding what should be internal to an object and what its peers should be. On the other hand we should ensure we are not hiding the wrong information inside an object. Well-designed objects should be context independent and not tied to its environment; objects tightly coupled to its environment will be difficult to instantiate in a unit test. I find the rapid feedback provided by tests is invaluable when designing objects.

Why did I chose to use NMock ?

A lot of people ask me why i chose NMock, for this introductory article, I felt NMock 2 is the best choice of API,  because it works best within the context of the article. It has an expectation-based API, which makes designing object communication the focus of testing. Its expectation API acts as a domain specific embedded language that clearly, precisely, and declaratively describes object communication. I also find that NMock2 provides a highly detailed explanation of what has failed, which allows for quick error diagnosis and resolution.

I’m sure many people will have different opinions on this but I have avoided using mock frameworks that use the popular Arrange, Act, Assert (AAA) style because I find that it does not get you started off by thinking about the contracts or communication protocols between objects. With AAA-style mocks I find it’s easy to overlook design flaws in your inter-object communication.

The main drawback of NMock is the use of strings to identify expected methods makes refactoring harder.  This becomes less of an issue when a code based is designed around well-defined roles and responsibilities, containing narrow role based interfaces, which are used more locally.

Why GemStone/S

Most business software systems are built using the object oriented paradigm, one of the key concepts of OO is encapsulation and that the data and behavior be packaged together and the distinction between the two be hidden within the object. In every company and client i have worked for one of the biggest assumptions made is the idea that programs and data should be stored separately, storing data separately violates encapsulation at the most fundamental level and it greatly limits the adaptivity of a system. we aren’t able to completely leverage all the benefits of Object Orientation.

“However, doing encapsulation right is a commitment not just to abstraction of state, but to eliminate state oriented metaphors from programming.” – Alan Kay

Now to eliminate the distinction between databases and programs, we need a technology that seamlessly integrates the capabilities of the programs and storage and multi user capabilities of databases. This is exactly what GemStone/S is, its a Smalltalk dialect, that has object persistance and transactional capabilities baked into the Virtual Machine !. A GemStone/S repository has living objects (there is no seperation between code and data, Code are objects too). GemStone/S is not simply an object oriented database because it not only can stores objects but also can execute them in a muti-user environment. What GemStone/S is, is its a true object server.

So what does this buy us ?

  • No more object-relational mapping (no need for ActiveRecord, Hibernate, GLORP like ORM’s)
  • Makes the software incredibly adaptable, normally if we change the structure of the data defined in a class, we would also need to modify the structure of the database, mapping files etc to reflect the change, so GemStone/S Eliminates need for the synchronization of code and data, because there is no distinction between the two.
  • Objects can evolve as the business needs change, Its quite simply to change the shape of a class in a GemStone/s repository.
  • Allows us the ability to create executable business models and objects that cross cut applications, so pure objects can be shared across mutiple applications. The object can expose a specific interface with a set of messages to each application. This way there is a single point of reference for a particular object and also a single point chage for a particular object. (note none of the data is ever exposed to the app, every thing is done through the object interface by sending messages)
  • Check out this case study where a large shipping company used GemStone/Smalltalk to create a pure OO solution.