Introduction

This article explains some of the concepts that every developer working on multi-proc architecture should understand. These concepts should be clear for those working on IA64 (weak memory model) or coding on multi-core programming language (Task Parallel Library).

Concepts

Re-ordering

The compiler or the processor can optimize code segments and re-order them differently than intended. For example, for the given sequence of statements:

y= 1 ;

the above could instead be executed in the following order:

y= 1 ;

This does not create a problem on a single core, but on a multi-core/processor it might be an issue, especially if the second thread requires the code to execute in a particular sequence. Unless we use some sort of memory-barrier, developers should not assume that the code will execute in the same sequence as it was coded.

Memory Barrier or Fence

Memory Barrier basically means guaranteed flush and prevents re-ordering, for example if we introduce a barrier between the above two statements...

Barrier (Thread.MemoryBarrier() API in .NET)

...then any statement above or below the barrier could not cross the barrier. In other words, it is guaranteed that any statements before the barrier will remain before the barrier, and similarly the statements after the barrier will remain after it, hence no re-ordering.

The memorybarrier API ( Thread.MemoryBarrier() ) by default has full fence, i.e. release and acquire semantics , if you want to use only release or acquire, you can use Thread.VolatileRead or Thread.VolatileWrite API in .NET.

Full fence memory barrier ensures that the values are flushed to all CPU caches.

Cache Coherency

x86 and x64 have a strong(er) memory model, and any update/write to a cache invalidates duplicate cached instances in the other available core/processor(s); Therefore volatile keyword is redundant on x86 and x64 machines (unless you want to avoid re-ordering, see volatile explanation below).

On IA64, cache coherency on write operation is not automatic and therefore explicit memory barriers are needed for write operations to flush to other core caches (this is where volatile comes handy).

Note : .NET 2.0 memory model for IA64 has release semantics for write operations, but it is not CLI complaint as this is not mentioned in ECMA specs, so I am not sure if the same is implemented in monoetc (Maybe someone can confirm this?). If you want to port the code across different memory model implementations, do use memory barrier to be on the safe side.

Volatile and Locks

Volatile keyword's definition on MSDN is not exactly right, volatile (and locks for that matter) are implemented using memory barriers and therefore not only does it use release and acquire semantics (see http://msdn.microsoft.com/en-us/library/aa490209.aspx ), it also does not allow re-ordering. Volatile declared objects are optimized to use read barriers for read operations and write barriers for write operations.

Store

Following is an example of store where value 5 is assigned to variable or place holder "x":

x = 5 ;

Load and Store

lock (syncroot){x = new someObject();}

Here someObject is created and assigned to x . This could create a problem (though a very small chance) when x is accessed concurrently. This is because x might be accessed un-initialized because of a race condition where write operation could be delayed. (See here for more explanation). To fix this, do use memorybarrier right after the assignment statement on a weak memory model (see the store reordering for more explanation).

lock (syncroot){x = new someObject(); Thread.MemoryBarrier();}

Store Reordering

This can be explained with a simple singleton implementation:

private Singleton() {} public Instance { get { if (!initialized) { lock (slock) { if (!initialized) { instance = new Singleton(); initialized = true ; return instance;

On x86 and x64 writes are not reordered i.e. " initialize=true " could not come before " instance = new Singleton() " statement, if it were to happen then there might have been a race condition and other thread could see " initialized " set to true and access un-initialized " instance " object.

On IA64 MM implementation writes could be reordered, therefore it is possible that intialized=true could be executed before the " instance " initialization statement. To fix this all we need is a memory barrier so as to prevent re-ordering.

instance = new Singleton(); Thread.MemoryBarrier(); initialized = true ;

In the load and store example above, it might be possible that other CPU thread might see un-initialized x (this could be because of the broken memory model implementation as explained here and here ).

Therefore if store ordering is important or if you are unsure about the underlying memory model do use memory barrier (better volatile write) after the assignment operation to sync the value with other CPUs.

Load Re-ordering

As explained by Joe Duffy , under rare circumstances (atleast theoretically) pending writes may not be seen by other processors, there might be a delay before it's made available to other processors (could this be under heavy stress ???), so again there might be a possibility that loads might be re-ordered because of this side affect (cache coherency is not 100% true and might involve a race condition). So if you really want to be sure, do use memory barriers or VolatileWrite / VolatileRead to flush the data to other caches before you proceed.

Singleton Pattern

The famous double check locking technique:

if (instance == null ) lock ( object ) if (instance == null ) instance = new Singleton(); return instance;

As explained, the above code may not work as expected on IA64 machines because there are no implicit release semantics on store/write operation on IA64. Other CPU can see " instance " not equal to null , but may contain junk value as store is not yet flushed to the CPU's cache. (In .NET 2.0 memory model the above is not true cause write operations have release semantics, but this is not CLI complaint so if you are using Mono/Rotor etc., take care and do use explicit release semantics.)

We can fix the above by making " instance " of type " volatile " but then there might be performance penalty since every time we access " instance " variable there will be a volatile read which is redundant and expensive. All we need volatile declaration is for "store" operation so it could flush the value to all the caches and prevent re-ordering.

One way to fix this is by adding memory barrier right after the load and store assignment:

if (instance == null ) lock ( object ) if (instance == null ) instance = new Singleton(); Thread.MemoryBarrier(); return instance;

Again a full fence memory barrier has both read and write semantics, this could further be optimized by using lazy load instantiation via the Interlocked API (see Joe's blog for more detail). The new LazyInit class in .NET 4 uses a similar implementation:

public T Value { get { if (m_value == null ) { T newValue = m_init(); if (Interlocked.CompareExchange( ref m_value, newValue, null ) != null && newValue is IDisposable) { ((IDisposable)newValue).Dispose(); return m_value;

Interlocked API are atomic and implemented using memory barriers thus achieving the best performance (locks are slower than Interlocked or memory barrier calls). Interlocked compareexchange is a lot faster as it is converted directly to a system call while locks are expensive (if I read correctly, it requires a kernel mode transition). Using interlocked API, the Spinlocks too are implemented.

If you are still confused with the memory model, read this .

References

  • Book: Concurrent Programming On Windows
  • Blogs: http://www.bluebytesoftware.com/blog/CommentView,guid,8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37.aspx
  • http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx
  • What happens if, say, 4 threads on 4 cores run this code simultaneously? 4 instances of the singleton will get created, probably not what was intended. You should delete this paragraph altogether, the code shown does not do what you think it does and you can not remove the lock.
    I suggest that you read the following paper An Introduction to Programming with C# Threads (Microsoft Research) [ ^ ]
    Sign In · View Thread
    Thanks for the link.
    Though 4 different instances shall be created but only one instance will be shared amongst all those 4 threads in 4 different cores. It's actually through this mechanism that LazyInit class is based on in .net 4 (the only difference is that they dispose off the objects). Maybe i should update the article.
    http://www.bluebytesoftware.com/blog/PermaLink,guid,a2787ef6-ade6-4818-846a-2b2fd8bb752b.aspx [ ^ ]
    public struct LazyInit<T> where T : class {
    private Initializer<T> m_init;
    private T m_value;
    public LazyInit(Initializer<T> init) {
    m_init = init;
    m_value = null;
    public T Value {
    get {
    if (m_value == null) {
    T newValue = m_init();
    if (Interlocked.CompareExchange(ref m_value, newValue, null) != null &&
    newValue is IDisposable) {
    ((IDisposable)newValue).Dispose();
    return m_value;
    Sign In · View Thread There are two errors in Interlocked example.
    First: Interlocked is the class, you forget to call the method. I suppose it is CompareExchange.
    Second:
    if(instance == null)
    System.Threading.Interlocked.CompareExchange(ref instance, new Singleton(), null);
    return instance;
    In this case, a new Singleton() will be created if instance == null. But, as there is no lock, between if (instance == null) and the CompareExchange call, a Singleton instance can be created just in between. Of course, this is not very common, and CompareExchange will avoid the first one to be replaced by the second, but for me is almost the same error as doing:
    if (instance == null)
    instance = new Singleton();
    Sign In · View Thread My bad, Yes it's CompareExchange. I will update the article. Thanks for pointing it out.
    Regarding multiple objects being created via interlocked API, I think I need to explain why and where to use CompareExchange as this is pointed by some one else also in the comments section. The idea is to lazily load objects and in the event there is a race call dispose on the duplicate objects, but at the end only one object will be shared by all threads across the cpus
    public struct LazyInit<T> where T : class {
    private Initializer<T> m_init;
    private T m_value;
    public LazyInit(Initializer<T> init) {
    m_init = init;
    m_value = null;
    public T Value {
    get {
    if (m_value == null) {
    T newValue = m_init();
    if (Interlocked.CompareExchange(ref m_value, newValue, null) != null &&
    newValue is IDisposable) {
    ((IDisposable)newValue).Dispose();
    return m_value;
    http://www.bluebytesoftware.com/blog/PermaLink,guid,a2787ef6-ade6-4818-846a-2b2fd8bb752b.aspx [ ^ ]
    modified on Tuesday, June 16, 2009 10:41 PM

    Sign In · View Thread You say that 'X' would be invalid or garbage, but I'm not quite clear why that would be the case. I recognize that on weak-cache-coherency architectures it's possible that not all processors will 'see' a write simultaneously, but I would expect that all object references would be aligned to fit in a single cache word, and that at any given moment any particular processor would either see a write as having occurred or not occurred (and definitely not see a pointer comprised of some bits of 'old' value and some bits of 'new' value).
    Certainly, it would be possible for a thread to copy and null out Y, see that X doesn't hold the old value of Y, and start using the referenced object for its own purposes despite another thread being in the process of copying Y to X, but that could happen under any of the architectures described here.
    Out of curiosity, how much performance penalty would there be in practice if compilers were to protect the entry and exit of every function with a memory barrier sufficient to ensure that if globally-addressable memory could be accessed by both the function and the caller, all access before the function call would occur before any accesses in the function, and all access in the function would occur before any accesses afterward? For inline functions, the required semantics would be weaker than those imposed by strict memory barriers, and I would think requiring compilers to honor such semantics would avoid the need to have programmers specify explicit memory barriers whose performance penalty might be worse.
    Sign In · View Thread For your first question, answer is both yes and no. This is broken in the old .net ECMA and fixed in the new .net 2.0 memory model.
    From http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx [ ^ ]
    This is a common technique for avoiding a lock on the read of ‘a’ in the typical case . It works just fine on X86. But it would be broken by a legal but weak implementation of the ECMA CLI spec. It’s true that, according to the ECMA spec, acquiring a lock has acquire semantics and releasing a lock has release semantics. However, we have to assume that a series of stores have taken place during construction of ‘a’. Those stores can be arbitrarily reordered, including the possibility of delaying them until after the publishing store which assigns the new object to ‘a’. At that point, there is a small window before the store.release implied by leaving the lock . Inside that window, other CPUs can navigate through the reference ‘a’ and see a partially constructed instance. Also some of what i have mentioned is not so true because the new .net 2 . 0 memory model have release semantics when doing write operations (IA64), so memorybarrier etc is not required, but then the NEW MEMORY MODEL IS NOT CLI COMPLAINT, so i wonder how it is implemented in MONO etc and therefore I suggested to use memorybarrier if in doubt. Remember default memorybarrier has both release and acquire semantics, if performance is an issue use VolatileRead or VolatileWrite instead of using full fence <a href= " http://www.bluebytesoftware.com/blog/PermaLink,guid,543d89ad-8d57-4a51-b7c9-a821e3992bf6.aspx" >http: // www.bluebytesoftware.com/blog/PermaLink,guid,543d89ad-8d57-4a51-b7c9-a821e3992bf6.aspx</a>[<a href="http://www.bluebytesoftware.com/blog/PermaLink,guid,543d89ad-8d57-4a51-b7c9-a821e3992bf6.aspx" target="_blank" title="New Window">^</a>] Performance can be a big issue here cause memory barrier require cpu communication, cpu bus intervention etc, and the performance could degrade proportionally by the number of cores you have.
    Sign In · View Thread
    rohits1979 wrote:
    Those stores can be arbitrarily reordered, including the possibility of delaying them until after the publishing store which assigns the new object to ‘a’.

    If the compiler were required to ensure that all publicly-visible effects within the function would occur before any publicly-visible effects "after" it, the scenario you pose could not occur, since the store to "a" is semantically after the return from the New() function.
    BTW, I find myself somewhat dubious about multi-processing architectures which don't automatically maintain cache coherency, since there is almost certainly a lot of code out there which would be thread-safe if things run in order, but isn't quite safe when running threads with multiple CPU's with separate caches. I would expect that random slight flakiness might become even more common/'normal' than it is today.
    Sign In · View Thread It is all based on memory model implementation, like I explained .Net 2.0 MM for IA 64 has release semantics for write operations (something like what you mentioned) but it is not guaranteed that all ECMA based MM implementation would do the same as ECMA does not dictates anything like that. So yes the problem started with the ECMA specification and therefore was broken.
    It is because of this that developers should understand the internal and/or issue with the MM implementation or architecture difference.
    Sign In · View Thread I guess part of my question would be what advantage there would be to building machines with an architecture whose design favors the creation of bugs that are essentially impossible to model or test for, but which may strike at random with arbitrary and unforeseeable consequences. I can understand that it's easier to make machines faster when there are looser cache coherency requirements than when there are tighter ones, but I find it hard to see any architectural benefit from allowing threads running on separate processors to both have free read/write access to 'kinda sorta shared' memory. If one is coding in .net, will the system at least be nice enough to set up default processor affinity such that programs that don't explicitly declare that they can handle the memory semantics of mutli-processor execution on IA64 will be limited to a single CPU?
    BTW, on a somewhat related note, I sometimes wish Microsoft could be a little more specific in its comments about thread safety or lack thereof. For example, I don't believe anything in the documentation specifies that it is safe to have one thread reading from a socket while another thread is writing to it, but in many systems such behavior is essentially required; it would be nice to have an official blessing that specified what combinations of operations could safely be performed simultaneously. Leaving such questions unanswered tends to encourage coding practices that will 'usually' work, but may fail randomly and/or in future implementations.
    Sign In · View Thread Lock statement in C# creates full memory fence before and after lock according to this article: http://www.albahari.com/threading/part4.aspx [ ^ ]
    Assuming the article is correct, there is no need for any further explicit memory barriers calls in double checked locking.
    Can somebody correct me if I'm wrong?
    Sign In · View Thread