Treatment of out-of-thin-air values

Simple formulations of programming language memory models often allow "out-of-thin-air" values to be produced. To see how this might happen, consider the following canonical example (Example 1), where variables x and y are initially zero and atomic, variables starting with "r" are local ("register") variables, and assignment denotes unordered (raw, relaxed) loads or stores:

Thread 1:
r1 = x;
y = r1;

Thread 2:
r2 = y;
x = r2;

The code executed by each thread simply copies one global to another; we've just made the intermediate temporaries explicit to emphasize that both a load and a store is involved.

The question is whether each load can "see" the value written by the other thread's store. If this is possible, then we can get an apparently consistent execution in which each store stores a value of 42, which is then read by the load in the other thread, justifying the stored value.

At some level this appears "obviously absurd". But here are two arguments that, although we probably want to disallow the behavior, it is not as absurd as might appear at first glance:

If the hardware uses naively implemented "load value prediction", and does not go out of its way to avoid this scenario and, e.g. based on different past executions, predicts that x and y contain 42, the prediction might be successfully confirmed, and actually result in this outcome. (We know of no hardware that actually does this. I believe that even on Alpha which for other reasons does not enforce memory ordering based on data dependence, this cannot actually happen. This is only a vague plausibility argument.)
If we change the above example slightly, the analogous behavior needs to be allowed. Consider instead Example 2:
```
Thread 1:
r1 = x;
y = r1;

Thread 2:
r2 = y;
x = 42;
```
Since the assignments represent unordered operations, we would like to allow either the compiler or hardware to reorder the two statements in the second thread. Once that happens, we get the questionable behavior if the first thread executes entirely between the two statements. This corresponds to exactly the same visibility of stores as in the original example. The only difference is the "data dependency" between the two statements in the second thread.

Unfortunately, this already makes it clear that this issue is tied to dealing with dependencies in the memory model, which we show elsewhere is difficult to do. In particular, the approach originally used in N2052 doesn't seem to stand up to close scrutiny, and has thus been dropped from the N2171 revision.

(Adding the dependency based ordering to only the "precedes" relation, as suggested earlier, doesn't help, as becomes apparent below.)

To see why things are so bad, consider another variant (example 3) of the above example:

Thread 1:
r1 = x;
y = r1;

Thread 2:
r2 = y;
if (r2)
  x = 42;
else
  x = 42;

This differs from the preceding version in that each assignment to x is fairly clearly dependent on y, for any reasonable definition of what that means. (Recall that considering only data dependencies doesn't work, since a data dependent assignment could be transformed to a switch statement with many constant assignments. Somehow looking at "the next assignment to x" no matter where it occurs syntactically also doesn't work, for the reasons given in the dependency discussion, and since we could easily replace x = 42; by x = 41; x = 42;.)

The problem of course is that example 3 can be easily optimized to example 2, and most people would expect a good compiler to do so. Allowing r1 = r2 = 42 in example 2, but not example 3, seems untenable and, based on the previous dependency discussion, probably infeasible.

As a result, we now have a second example, which has basically the same structure as example 1 and the same dependency pattern. The difference is really just that in one case the dependency can be "optimized away", and in the other case it cannot.

A different proposal

This seems to leave us several different options:

Provide a more elaborate dynamic definition of which stores might be visible early, along the lines of the corresponding Java definition. This doesn't fit well here, and is arguably the least well understood part of the Java memory model specification.
Possibly switch to something closer to Vijay Saraswat et al's work. Converting this to a form in which it would be palatable as an ISO standard also seems like a daunting task.
Allow "out-of-thin-air" results. Unlike Java, this is probably possible. But it seems undesirable, since it makes it harder to reason about even perfectly well-behaved and well-defined programs. I believe it gives implementations flexibility to produce needlessly counterintuitive results, with no significant performance benefit. I know of no architectures or useful program transformations that are otherwise safe (i.e. don't store speculative values), and generate "out-of-thin-air" results.
Try for a different approach to the problem that at least partially sidesteps the hardest issues.

Here I propose a combination of the last two:

Simplify the main memory model description, so that it allows "out-of-thin-air" values.
In the definition of unordered atomics, which is the only context in which "out-of-thin-air" results can be generated, explicitly prohibit them, as best we can.

I suggest doing the latter with words along the following lines:

An unordered atomic store shall only store a value that has been computed from constants and program input values by a finite sequence of program evaluations, such that each evaluation observes the values of variables as computed by the last prior assignment in the sequence. The ordering of evaluations in this sequence must be such that

If an evaluation B observes a value computed by A in the same thread, then B must not be sequenced before A.
If an evaluation B observes a value computed by A in a different thread, then B must not happen before A.
If an evaluation A is included, then all evaluations that assign to the same variable and are sequenced before or happen-before A must be included.

[Note: This requirement disallows "out-of-thin-air", or "speculative" stores of atomics. Since unordered operations are involved, evaluations may appear in this sequence out of thread order. For example, with x and y initially zero,

Thread 1: r1 = y.load_relaxed(); x.store_relaxed(r1);

Thread 2: r2 = x.load_relaxed(); y.store_relaxed(42);

is allowed to produce r1 = r2 = 42. The sequence of evaluations justifying this starts consists of:

y.store_relaxed(42); r1 = y.load_relaxed(); x.store_relaxed(r1); r2 = x.load_relaxed();

On the other hand,

Thread 1: r1 = y.load_relaxed(); x.store_relaxed(r1);

Thread 2: r2 = x.load_relaxed(); y.store_relaxed(r2);

may not produce r1 = r2 = 42, since there is no sequence of evaluations that results in the computation of 42.]

This has the large advantage that we simplify the core memory model and move the complexity to the section on unordered atomics, where it belongs.

It has the potential disadvantage that my characterization above of what constitutes an "out-of-thin-air" results may well again be wrong. On the other hand, even if we discover that it is, and there is no adequate replacement, I think this rule is sufficiently esoteric that it may be acceptable just to state it as "no speculative unordered atomic stores", and then leave the note.

Jeremy Manson has already pointed out that the above rule fails to preclude

Thread 1:
r1 = x;
if (r1 == 42)
  y = r1;

Thread 2:
r2 = y;
if (r2 == 42)
  x = 42;

from assigning 42 to r1 and r2, in spite of the fact that the initial 42 still comes "out-of-thin-air". This is clearly unacceptable if x and y are ordinary variables, since this code is data-race free, and should thus be sequentially consistent. And we guarantee that r1 = r2 = 0 for ordinary variables.

Whether this is acceptable for relaxed atomics, or whether we should specifically state that relaxed atomics will give sequentially consistent results if the relaxed atomics do not "race" is unclear to me. My impression is that users will not care because there is no reason to use relaxed atomics unless there is a race, and implementors will not care, since no reasonable implementation of atomics will exhibit weaker properties than for ordinary variables.

N2171 incorporates the first half of the change we peoposed here, i.e. the last trace of dependency-based ordering was dropped from N2052.