Why Undefined Semantics for C++ Data Races?

Our proposed C++ memory model gives completely undefined semantics to programs with data races. Many people have questioned whether this is necessary, or whether we can provide some guarantees in even this case. Here we give the arguments for completely undefined semantics.

Some arguments are already given in the introductory "Rationale" section of committee paper N1942. This is a more detailed exploration.

The basic arguments for undefined data race semantics in C++ are:

Unlike Java or .NET we can get away with it. We do not have to worry about untrusted sand-boxed code exploiting such "undefined behavior" as a security hole. We also do not currently have to worry about preserving type-safety in the presence of data races, since we don't have it anyway. Other inherent "holes" in the type system, such as C-style arrays, are far more likely to result in type-safety violations in practice.
I believe it is usually the status quo. Pthreads states "Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it." I've seen no evidence that the original intent of win32 threads was any different, though recent Microsoft tools may enforce stronger implementation properties.
I am unconvinced that we are doing the programmer a favor by providing stronger semantics. Data races among ordinary variables are a bug. If the programmer intended for there to be a data race, we will provide atomic variables, which convey that intent to both the compiler and any human readers of the code. They also allow the programmer to be explicit about ordering assumptions, which typically require careful consideration to ensure correctness.
By leaving data race semantics undefined, we allow implementations to generate error diagnostics if a race is encountered. I believe this would be highly desirable. Constructing such tools that can operate beyond a debug environment on current hardware is clearly challenging, but may be feasible in some cases, and hardware support seems at least conceivable here.
Perhaps more importantly, by requiring programmers to avoid unannotated races, we reduce false positives in more practical debug-time race detection tools.
It appears likely that we have to leave the semantics of data races undefined in at least some cases. Consider the case in which I declare a function scope object P with a vtable pointer, store a pointer to P in a global x, another thread reads x, and then calls one of its virtual member functions. On many architectures (e.g. PowerPC, ARM), a memory fence would be required as part of object construction in order to preclude a wild branch as a result of the virtual function call.
There appears to be a consensus that we do not want to incur the fence overhead in such cases. This means that we need to allow wild branches in response to some kinds of data races. Allowing this in some cases, but not allowing it in all cases, would undoubtedly complicate the specification.
Current compiler optimizations often assume that objects do not change unless there is an intervening assignment through a potential alias. Violating such a built-in assumption can cause very complicated effects that will be very hard to explain to a programmer, or to delimit in the standard. And I believe this assumption is sufficiently ingrained in current optimizers that it would be very difficult to effectively remove it.

As an example of the last phenomenon, consider a relatively simple example, which does not include any synchronization code. Assume x is a shared global and everything else is local:

unsigned i = x;

if (i < 2) {
    foo: ...
    switch (i) {
    	case 0:
		...;
		break;
    	case 1:
		...;
		break;
	default:
		...;
    }
}

Assume the code at label foo is fairly complex, and forces i to be spilled and that the switch implementation uses a branch table. (If you don't believe the latter is a reasonable assumption here, assume that "2" is replaced by a larger number, and there are more explicit cases in the switch statement.)

The compiler now performs the following reasonable (and common?) optimizations:

It notices that i and x (in its view of the world) contain the same values. Hence when i is spilled, there is no need to store it; the value can just be reloaded from x.
It notices that the switch expression is either 0 or 1, and hence eliminates the bounds check on the branch table access and the branch to the default case.

Now consider the case in which, unbeknownst to the compiler, there is actually a race on x, and its value changes to 5 during the execution of the code labelled by foo. The results are:

When i is reloaded for the evaluation of the switch expression, it gets the value 5 instead of its original value.
The branch table is accessed with an out-of-bounds index of 5, resulting in a garbage branch target.
We take a wild branch, say to code that happens to implement rogue-o-matic (the canonical form of C++ "undefined behavior").

Although it appears difficult to get existing compilers to exhibit this kind of behavior for a variety of reasons, I think compiler writers will generally refuse to promise that it cannot happen. It can be the result of very standard optimization techniques. As far as I know, precluding this behavior with certainty would require significant optimizer redesign. I do not see how to avoid this kind of behavior without fundamentally Even if we were to prohibit reloading of a local from a global shared variable, we would still have to explain similar behavior for the analogous program which tests x and then uses x as the switch expression. That becomes slightly easier to explain, but it still seems very confusing.

Some such redesign is necessary anyway to avoid speculative writes. But I think that is usually limited to a few transformations which can currently introduce such writes. I suspect the redesign required here would be significantly more pervasive. And I do not see how to justify the cost.

The one benefit of providing better defined race semantics seems to be somewhat easier debuggability of production code that encounters a race. But I'm not sure that outweighs the negative impact on race detection tools.