Introduction
This article explains some of the concepts that every developer working on multi-proc architecture should understand. These concepts should be clear for those working on IA64 (weak memory model) or coding on multi-core programming language (Task Parallel Library).
Concepts
Re-ordering
The compiler or the processor can optimize code segments and re-order them differently than intended. For example, for the given sequence of statements:
y=
1
;
the above could instead be executed in the following order:
y=
1
;
This does not create a problem on a single core, but on a multi-core/processor it might be an issue, especially if the second thread requires the code to execute in a particular sequence. Unless we use some sort of memory-barrier, developers should not assume that the code will execute in the same sequence as it was coded.
Memory Barrier or Fence
Memory Barrier basically means guaranteed flush and prevents re-ordering, for example if we introduce a barrier between the above two statements...
Barrier (Thread.MemoryBarrier() API in .NET)
...then any statement above or below the barrier could not cross the barrier. In other words, it is guaranteed that any statements before the barrier will remain before the barrier, and similarly the statements after the barrier will remain after it, hence no re-ordering.
The memorybarrier API (
Thread.MemoryBarrier()
) by default has full fence, i.e.
release and acquire semantics
, if you want to use only release or acquire, you can use
Thread.VolatileRead
or
Thread.VolatileWrite
API in .NET.
Full fence memory barrier ensures that the values are flushed to all CPU caches.
Cache Coherency
x86 and x64 have a strong(er) memory model, and any update/write to a cache invalidates duplicate cached instances in the other available core/processor(s); Therefore volatile keyword
is
redundant on x86 and x64 machines (unless you want to avoid re-ordering, see volatile explanation below).
On IA64, cache coherency on write operation is not automatic and therefore explicit memory barriers are needed for write operations to flush to other core caches (this is where volatile comes handy).
Note
:
.NET 2.0 memory model for IA64 has release semantics for write operations, but it is not CLI complaint as this is not mentioned in ECMA specs, so I am not sure if the same is implemented in monoetc (Maybe someone can confirm this?). If you want to port the code across different memory model implementations, do use memory barrier to be on the safe side.
Volatile and Locks
Volatile keyword's definition on MSDN is not exactly right, volatile (and locks for that matter) are implemented using memory barriers and therefore not only does it use release and acquire semantics (see
http://msdn.microsoft.com/en-us/library/aa490209.aspx
), it also does not allow re-ordering. Volatile declared objects are optimized to use read barriers for read operations and write barriers for write operations.
Store
Following is an example of store where value 5 is assigned to variable or place holder "x":
x =
5
;
Load and Store
lock
(syncroot){x =
new
someObject();}
Here
someObject
is created and assigned to
x
. This could create a problem (though a very small chance) when
x
is accessed concurrently. This is because
x
might be accessed un-initialized because of a race condition where write operation could be delayed. (See
here
for more explanation). To fix this, do use memorybarrier right after the assignment statement on a weak memory model (see the store reordering for more explanation).
lock
(syncroot){x =
new
someObject(); Thread.MemoryBarrier();}
Store Reordering
This can be explained with a simple singleton implementation:
private
Singleton() {}
public
Instance {
get
{
if
(!initialized) {
lock
(slock) {
if
(!initialized) {
instance =
new
Singleton();
initialized =
true
;
return
instance;
On x86 and x64 writes are not reordered i.e. "
initialize=true
" could not come before "
instance = new Singleton()
" statement, if it were to happen then there might have been a race condition and other thread could see "
initialized
" set to
true
and access un-initialized "
instance
" object.
On IA64 MM implementation writes could be reordered, therefore it is possible that
intialized=true
could be executed before the "
instance
" initialization statement. To fix this all we need is a memory barrier so as to prevent re-ordering.
instance =
new
Singleton();
Thread.MemoryBarrier();
initialized =
true
;
In the load and store example above, it might be possible that other CPU thread might see un-initialized
x
(this could be because of the broken memory model implementation as explained
here
and
here
).
Therefore if store ordering is important or if you are unsure about the underlying memory model
do use memory barrier
(better volatile write) after the assignment operation to sync the value with other CPUs.
Load Re-ordering
As explained by
Joe Duffy
, under rare circumstances (atleast theoretically) pending writes may not be seen by other processors, there might be a delay before it's made available to other processors (could this be under heavy stress ???), so again there might be a possibility that loads might be re-ordered because of this side affect (cache coherency is not 100% true and might involve a race condition). So if you really want to be sure, do use memory barriers or
VolatileWrite
/
VolatileRead
to flush the data to other caches before you proceed.
Singleton Pattern
The famous double check locking technique:
if
(instance ==
null
)
lock
(
object
)
if
(instance ==
null
)
instance =
new
Singleton();
return
instance;
As explained, the above code may not work as expected on IA64 machines because there are no implicit release semantics on store/write operation on IA64. Other CPU can see "
instance
" not equal to
null
, but may contain junk value as store is not yet flushed to the CPU's cache. (In .NET 2.0 memory model the above is not true cause write operations have release semantics, but this is not CLI complaint so if you are using Mono/Rotor etc., take care and do use explicit release semantics.)
We can fix the above by making "
instance
" of type "
volatile
" but then there might be performance penalty since every time we access "
instance
" variable there will be a volatile read which is redundant and expensive. All we need volatile declaration is for "store" operation so it could flush the value to all the caches and prevent re-ordering.
One way to fix this is by adding memory barrier right after the load and store assignment:
if
(instance ==
null
)
lock
(
object
)
if
(instance ==
null
)
instance =
new
Singleton();
Thread.MemoryBarrier();
return
instance;
Again a full fence memory barrier has both read and write semantics, this could further be optimized by using lazy load instantiation via the
Interlocked
API (see Joe's
blog
for more detail). The new
LazyInit
class in .NET 4 uses a similar implementation:
public
T Value {
get
{
if
(m_value ==
null
) {
T newValue = m_init();
if
(Interlocked.CompareExchange(
ref
m_value, newValue,
null
) !=
null
&&
newValue
is
IDisposable) {
((IDisposable)newValue).Dispose();
return
m_value;
Interlocked
API are atomic and implemented using memory barriers thus achieving the best performance (locks are slower than Interlocked or memory barrier calls). Interlocked
compareexchange
is a lot faster as it is converted directly to a system call while locks are expensive (if I read correctly, it requires a kernel mode transition). Using
interlocked
API, the
Spinlocks
too are implemented.
If you are still confused with the memory model, read
this
.
References
Book: Concurrent Programming On Windows
Blogs:
http://www.bluebytesoftware.com/blog/CommentView,guid,8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37.aspx
http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx
What happens if, say, 4 threads on 4 cores run this code simultaneously? 4 instances of the singleton will get created, probably not what was intended. You should delete this paragraph altogether, the code shown does not do what you think it does and you can not remove the lock.
I suggest that you read the following paper
An Introduction to Programming with C# Threads (Microsoft Research)
[
^
]
Sign In
·
View Thread
Thanks for the link.
Though 4 different instances shall be created but only one instance will be shared amongst all those 4 threads in 4 different cores. It's actually through this mechanism that LazyInit class is based on in .net 4 (the only difference is that they dispose off the objects). Maybe i should update the article.
http://www.bluebytesoftware.com/blog/PermaLink,guid,a2787ef6-ade6-4818-846a-2b2fd8bb752b.aspx
[
^
]
public struct LazyInit<T> where T : class {
private Initializer<T> m_init;
private T m_value;
public LazyInit(Initializer<T> init) {
m_init = init;
m_value = null;
public T Value {
get {
if (m_value == null) {
T newValue = m_init();
if (Interlocked.CompareExchange(ref m_value, newValue, null) != null &&
newValue is IDisposable) {
((IDisposable)newValue).Dispose();
return m_value;
Sign In
·
View Thread
There are two errors in Interlocked example.
First: Interlocked is the class, you forget to call the method. I suppose it is CompareExchange.
Second:
if(instance == null)
System.Threading.Interlocked.CompareExchange(ref instance, new Singleton(), null);
return instance;
In this case, a new Singleton() will be created if instance == null. But, as there is no lock, between if (instance == null) and the CompareExchange call, a Singleton instance can be created just in between. Of course, this is not very common, and CompareExchange will avoid the first one to be replaced by the second, but for me is almost the same error as doing:
if (instance == null)
instance = new Singleton();
Sign In
·
View Thread
My bad, Yes it's CompareExchange. I will update the article. Thanks for pointing it out.
Regarding multiple objects being created via interlocked API, I think I need to explain why and where to use CompareExchange as this is pointed by some one else also in the comments section. The idea is to lazily load objects and in the event there is a race call dispose on the duplicate objects, but at the end only one object will be shared by all threads across the cpus
public struct LazyInit<T> where T : class {
private Initializer<T> m_init;
private T m_value;
public LazyInit(Initializer<T> init) {
m_init = init;
m_value = null;
public T Value {
get {
if (m_value == null) {
T newValue = m_init();
if (Interlocked.CompareExchange(ref m_value, newValue, null) != null &&
newValue is IDisposable) {
((IDisposable)newValue).Dispose();
return m_value;
http://www.bluebytesoftware.com/blog/PermaLink,guid,a2787ef6-ade6-4818-846a-2b2fd8bb752b.aspx
[
^
]
modified on Tuesday, June 16, 2009 10:41 PM
Sign In
·
View Thread
You say that 'X' would be invalid or garbage, but I'm not quite clear why that would be the case. I recognize that on weak-cache-coherency architectures it's possible that not all processors will 'see' a write simultaneously, but I would expect that all object references would be aligned to fit in a single cache word, and that at any given moment any particular processor would either see a write as having occurred or not occurred (and definitely not see a pointer comprised of some bits of 'old' value and some bits of 'new' value).
Certainly, it would be possible for a thread to copy and null out Y, see that X doesn't hold the old value of Y, and start using the referenced object for its own purposes despite another thread being in the process of copying Y to X, but that could happen under any of the architectures described here.
Out of curiosity, how much performance penalty would there be in practice if compilers were to protect the entry and exit of every function with a memory barrier sufficient to ensure that if globally-addressable memory could be accessed by both the function and the caller, all access before the function call would occur before any accesses in the function, and all access in the function would occur before any accesses afterward? For inline functions, the required semantics would be weaker than those imposed by strict memory barriers, and I would think requiring compilers to honor such semantics would avoid the need to have programmers specify explicit memory barriers whose performance penalty might be worse.
Sign In
·
View Thread
For your first question, answer is both yes and no. This is broken in the old .net ECMA and fixed in the new .net 2.0 memory model.
From
http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx
[
^
]
This
is
a common technique
for
avoiding a
lock
on the read of ‘a’
in
the typical
case
. It works just fine on X86. But it would be broken by a legal but weak implementation of the ECMA CLI spec. It’s
true
that, according to the ECMA spec, acquiring a
lock
has acquire semantics and releasing a
lock
has release semantics.
However, we have to assume that a series of stores have taken place during construction of ‘a’. Those stores can be arbitrarily reordered, including the possibility of delaying them until after the publishing store which assigns the
new
object
to ‘a’. At that point, there
is
a small window before the store.release implied by leaving the
lock
. Inside that window, other CPUs can navigate through the reference ‘a’ and see a partially constructed instance.
Also some of what i have mentioned
is
not so
true
because the
new
.net
2
.
0
memory model have release semantics
when
doing write operations (IA64), so memorybarrier etc
is
not required, but then the NEW MEMORY MODEL IS NOT CLI COMPLAINT, so i wonder how it
is
implemented
in
MONO etc and therefore I suggested to use memorybarrier
if
in
doubt. Remember
default
memorybarrier has both release and acquire semantics,
if
performance
is
an issue use VolatileRead or VolatileWrite instead of
using
full fence
<a href=
"
http://www.bluebytesoftware.com/blog/PermaLink,guid,543d89ad-8d57-4a51-b7c9-a821e3992bf6.aspx"
>http:
Performance can be a big issue here cause memory barrier require cpu communication, cpu bus intervention etc, and the performance could degrade proportionally by the number of cores you have.
Sign In
·
View Thread
rohits1979 wrote:
Those stores can be arbitrarily reordered, including the possibility of delaying them until after the publishing store which assigns the new object to ‘a’.
If the compiler were required to ensure that all publicly-visible effects within the function would occur before any publicly-visible effects "after" it, the scenario you pose could not occur, since the store to "a" is semantically after the return from the New() function.
BTW, I find myself somewhat dubious about multi-processing architectures which don't automatically maintain cache coherency, since there is almost certainly a lot of code out there which would be thread-safe if things run in order, but isn't quite safe when running threads with multiple CPU's with separate caches. I would expect that random slight flakiness might become even more common/'normal' than it is today.
Sign In
·
View Thread
It is all based on memory model implementation, like I explained .Net 2.0 MM for IA 64 has release semantics for write operations (something like what you mentioned) but it is not guaranteed that all ECMA based MM implementation would do the same as ECMA does not dictates anything like that. So yes the problem started with the ECMA specification and therefore was broken.
It is because of this that developers should understand the internal and/or issue with the MM implementation or architecture difference.
Sign In
·
View Thread
I guess part of my question would be what advantage there would be to building machines with an architecture whose design favors the creation of bugs that are essentially impossible to model or test for, but which may strike at random with arbitrary and unforeseeable consequences. I can understand that it's easier to make machines faster when there are looser cache coherency requirements than when there are tighter ones, but I find it hard to see any architectural benefit from allowing threads running on separate processors to both have free read/write access to 'kinda sorta shared' memory. If one is coding in .net, will the system at least be nice enough to set up default processor affinity such that programs that don't explicitly declare that they can handle the memory semantics of mutli-processor execution on IA64 will be limited to a single CPU?
BTW, on a somewhat related note, I sometimes wish Microsoft could be a little more specific in its comments about thread safety or lack thereof. For example, I don't believe anything in the documentation specifies that it is safe to have one thread reading from a socket while another thread is writing to it, but in many systems such behavior is essentially required; it would be nice to have an official blessing that specified what combinations of operations could safely be performed simultaneously. Leaving such questions unanswered tends to encourage coding practices that will 'usually' work, but may fail randomly and/or in future implementations.
Sign In
·
View Thread
Lock statement in C# creates full memory fence before and after lock according to this article:
http://www.albahari.com/threading/part4.aspx
[
^
]
Assuming the article is correct, there is no need for any further explicit memory barriers calls in double checked locking.
Can somebody correct me if I'm wrong?
Sign In
·
View Thread