Why ordering constraints are never limited to loads or stores

Many modern architectures provide fence instructions that are specific to only loads or stores. For example, SPARC provides an assortment of such fence instruction variants, as does Alpha. We have not proposed to provide similar facilities in the C++ memory model or atomic operations library.

Although there are cases in which load-only or store-only ordering are useful, we believe that the correctness constraints are exceedingly subtle, even relative to the other constructs we have been discussing. Hence we believe it makes sense to consider support for these only if there is evidence that such restricted ordering is likely to be much cheaper on a reasonable selection of processors. (This is apparently not the case on PowerPC, ARM, X86, MIPS, or Itanium. I have not seen measurements for SPARC, where it might matter. I think wmb is a bit cheaper than mb on most Alphas, but that's of limited interest.)

To see why read-only and write-only fences are tricky to use correctly, consider the canonical use case for acquire/release ordering constraints. We use an atomic flag variable x_init to hand off data (an ordinary variable x from one thread to another:

Thread 1:
<Initialize x>
x_init.store_release(true);

Thread 2:
if (x_init.load_aquire())
    <use x>
At first glance, this should be the canonical use case that requires only store ordering in thread 1. However, that is actually only safe under very limited conditions. To explain that, we consider two cases, depending on whether the uses of x are entirely read-only or not.

Recipient writes to object

This case is highly problematic.

Clearly, and unsurprisingly, it is unsafe to replace the load_acquire with a version that restricts only load ordering in this case. That would allow the store to x in thread 2 to become visible before the initialization of x by thread 1 is complete, possibly losing the update, or corrupting the state of x during initialization.

More interestingly, it is also generally unsafe to restrict the release ordering constraint in thread 1 to only stores. To see this, consider what happens if the initialization of x also reads x, as in

x.a = 0; x.a++;
x_init.store_write_release(true);
and the code that uses x in thread 2 updates it, with e.g.
if (x_init.load_acquire())
    x.a = 42;
If the release store in thread 1 were restricted to only ensure completion of prior stores (writes), the load of x.a in thread 1 (part of x.a++) could effectively be reordered with the assignment x_init, and could hence see a value of 42, causing x.a to be initialized to 43.

This admittedly appears rather strange, since the assignment of 43 to x.a would have to become visible before the load, on which it depends, is completed. However, we've argued separately that it doesn't make sense to enforce ordering based on dependencies, since they cannot be reliably defined at this level. And machine architectures do not even consistently guarantee ordering based on dependencies through memory. Thus this is an unlikely and very surprising result, but not one we could easily (or apparently even reasonably) disallow in the presence of a store-only fence. And clearly, it is surprising enough that it is important to disallow.

One might argue that there are nonetheless interesting cases in which the initial operations are known to involve only stores, and hence none of this applies. There are three problems with this argument:

  1. In most cases any actual object initialization or the like will involve calls that cross abstraction boundaries. Interface specifications however do not normally specify whether a constructor call, for example, performs a load. Hence the programmer typically has no way to know whether or not (s)he is dealing with such a case.
  2. A number of operations, such as bit-field stores, perform loads that are not apparent to all but the most sophisticated programmers, and can be affected by this.
  3. The memory model otherwise allows transformations to, say, load a common subexpression value from a struct field if it had to be spilled. Thus not all such loads will even be programmer visible. (The offending transformations might of course be applied to a compilation unit that doesn't reference atomic operations, and was compiled by a third party.)

Recipient only reads x

In this case, it appears to be safe to limit the release and acquire constraints to loads and stores respectively.

However, I don't believe the argument here requires only that the operation here be logically read-only; it must truly not write to the object. Since I believe we generally allow library classes (e.g. containers) to avoid synchronization during construction, and then lock "under the covers" for read-only operations (e.g. for caching) this appears to be difficult to guarantee.