Should atomic operations be ordered based on dependencies?

Consider the simple program snippet, where x is an "atomic pointer to integer" variable, r1 is a local int * variable, and r2 is a local int variable:
r1 = x.load_relaxed();
r2 = *r1;

Most processors guarantee that, for a naive translation of this code, the second load becomes visible after the first, since there is a data dependency between the two. Indeed Java almost requires such a guarantee in order to allow efficient implementation of final fields.

Such a guarantee is occasionally useful. It would guarantee that, for the above example, if another thread initialized an object, and then stored a pointer to it into a previously null x in a way that ensured the ordering of the two stores, then we could not see a non-null r1 with an uninitialized value for r2.

Note however that all such guarantees become vacuous if only load_acquire and store_release are used. In the most interesting cases, dependency-based ordering allows us to use load_relaxed instead of load_acquire. The resulting performance impact varies from none on X86, SPARC, or PA-RISC to that of a full memory fence on Alpha. PowerPC and Itanium fall in between.

Our most recent C++ atomics (N2145) and memory model (N2171) proposals provide no dependency-based ordering guarantees. Java provides no dependence-based guarantees for non-volatile (unordered) accesses, except for final field guarantees, which rely on dependence-based reasoning at most in the implementation.

The point of this note is to discuss the issues that bear on the C++ decision. Much of this relies on an earlier discussion led by Bill Pugh and Jeremy Manson in the context of the Java memory model, some of which is reproduced in the POPL 2005 paper by Manson, Pugh, and Adve. However, the Java discussion is in the context of ordinary variable accesses, and the motivation there relies heavily on not impeding compiler optimizations. Since the C++ rules apply only to "relaxed" (unordered) accesses to atomic variables, which are expected to be infrequent, these arguments are less convincing for C++.

Dependency-based ordering only makes sense for unordered atomics

The discussion here only makes sense if at least one of the operations involved in a dependency is an unordered atomic operation. If only ordinary operations are involved, ordering is not observable without introducing a data race, which would result in undefined semantics. In fact, if A depends on B, any implied ordering is observable only if either

Static dependencies cannot enforce ordering

Consider the following variant of our original example:
r1 = x.load_relaxed();
r3 = &a + r1 - r1;
r2 = *r3;
Statically, the value of r3 appears to depend on r1. However, many compilers will optimize this to at least
r1 = x.load_relaxed();
r3 = &a;
r2 = *r3;
removing the dependency in the process. As a result, much common hardware (e.g. PowerPC, ARM, Itanium) will no longer guarantee that the first and last load appear to be executed in order. And it is unclear that even later stages of the compiler should be required to preserve the order; one of the reasons for introducing load_relaxed was to allow compiler reordering around the operation.

As we stated earlier, we expect load_relaxed to be infrequent in practice. Thus it might not be unreasonable to disallow optimizations involving it. But note that the above optimization is performed entirely on the middle statement, which does not mention load_relaxed, and in fact might occur in a function call, whose implementation is in a separate translation unit.

Thus requiring memory ordering based on static dependencies would impede optimization in other translation units, which might not even mention load_relaxed. This is clearly unacceptable.

Data vs. control dependencies

Some processors enforce ordering constraints only for certain kinds of dependencies, e.g. for data dependencies, or they have even more specific rules about which kinds of dependencies enforce ordering. Although this is reasonable at the hardware level, these notions make little sense at the programming language level, since compilers routinely convert between different types of dependencies.

For example,

r1 = x.load_relaxed();
if (r1 == 0)
  r2 = *r1;
else
  r2 = *(r1 + 1);
might be transformed to
r1 = x.load_relaxed();
if (r1 == 0)
  r3 = r1;
else
  r3 = r1 + 1;
r2 = *r3;
Again, the transformations that potentially break the dependency may happen in a separate compilation unit. Thus restricting them appears impractical.

Dynamic dependencies

Based on a suggestion by Peter Dimov, we have instead pursued a notion of dependency that
  1. Is based purely on observable dynamic behavior, and is therefore not subject to breakage as a result of compiler transformations, and
  2. Does not distinguish between control and data dependencies, for the same reason.
We'll say that a memory operation B depends on another memory operation A if either depends on the value produced by A. This has the advantages that

Different instances must be distinct

Unfortunately, this definition is still incomplete for somewhat subtle reasons. The open question is what constitutes "a memory operation". Consider
r1 = x.load_relaxed();
if (r1) {
    r2 = y.a;
} else {
    r2 = y.a;
}
Either of the loads of the ordinary variable y are clearly conditionally executed.

If we identify the two loads of y since they both perform the same action, we end up with what almost certainly an unusable programming model. This would mean that these loads are not dependent on the initial load_relaxed, and hence are not ordered with respect to it.

To see why this is unusable, consider instead the variant that involves function calls:

r1 = x.load_relaxed();
if (r1) {
    f(&y);
} else {
    g(&y);
}
Assume that the implementations of f() and g() are defined elsewhere and not visible to the author of this code. If we identified the above loads in the original example, and both f() and g() happen to perform the same loads, then presumably we would have to say that the accesses to y by f() and g() are also unordered with respect to the load_relaxed. On the other hand, if they access different fields, the accesses would be ordered.

This leaves us in a position in which the author of the client code cannot tell whether memory accesses are ordered without inspecting the implementation of called functions. Even worse, ordering may be only partially guaranteed because the two functions happen to perform one common memory access. We do not believe this is viable.

By a similar argument, we need to treat two memory operations as distinct if they correspond to the same source statement, but are reached via different call chains. For example, if the functions f() and g() both call h(y), which is implemented as

int h(int *y) {
    return *y;
}
then the load is always performed by the same statement, but nothing substantive has changed.

Why this seriously breaks optimizations

If we reconsider the example from above
r1 = x.load_relaxed();
if (r1) {
    r2 = y.a;
} else {
    r2 = y.a;
}
and treat the two loads of y.a as distinct (as we must), then they depend on, and ordered with respect to, the load_relaxed. And for a naively compiled version, most hardware would implicitly enforce that.

However, if the above were compiled to

r1 = x.load_relaxed();
r2 = y.a;
as many compilers would, then the dependence has been removed, and the two loads are no longer ordered. Hence such optimizations would have to be disallowed.

It quickly becomes clear that this is intolerable, when we consider that this code may again be distributed among several compilation units. Consider:

int h(int r1) {
    if (r1) {
        return y.a;
    } else {
        return y.a;
    }
}

int f() {
    ...
    r1 = x.load_relaxed();
    r3 = h(r1); 
    ...
}
The dependency, and hence ordering, between the load_relaxed and the y.a loads is broken if h is optimized in the obvious way to just unconditionally return y.a. But h() could easily reside in a separate compilation unit, possibly one that hasn't been recompiled since the addition of atomics.

Our conclusion is that this approach is also not viable.

What might work?

It is possible to construct similar examples in which the dependent operation is a store. Restricting the dependent operation to another atomic operation doesn't improve matters much either; we can construct examples in which the offending optimization occurs somewhere in the middle of the call chain between the two, and hence could still occur in a compilation unit that doesn't mention atomics.

It may be feasible to restrict dependency-based ordering to the important special case in which the dependent operation is a load or store to a field of an object pointed to by the result of an initial load_relaxed. But this seems to still run afoul of some, admittedly much more obscure, optimizations, which would otherwise be legal, and could be carried out in separate compilation units. Consider:

r2 = x.load_relaxed();
r3 = r2 -> a;
If we have some reason to believe, e.g. based on profiling, that the resulting value in r2 is usually the same as in r1, it may well be profitable to transform this to
r2 = x.load_relaxed();
r3 = r1 -> a;
if (r1 != r2) r3 = r2 -> a;
since it usually breaks the dependency between the two loads. But by doing so, it also breaks the hardware-based ordering guarantee.

Thus even this case doesn't appear very clear-cut.

If we tried to extend this to dependencies through array subscripts, things become more dubious still. If we have

r1 = x.load_relaxed();
r2 = a[r1 -> index % a_size];
and assume that some other thread t' initializes an object q, sets a[r1 -> index], and then performs a store_release to set x to p, are we guaranteed that r2 will contain the value assigned by t' to the array element, instead of an earlier value?

Usually the answer would be "yes". However if the compiler happens to be able to prove that a only has a single element, e.g. because the program is built for a particularly small configuration, the answer unexpectedly becomes "no".

We could easily add a templatized library function loads and dereferences an atomic pointer, guaranteeing only that the dereferenced value was completely read after the pointer dereference. But that's of somewhat limited utility and error-prone.