Transactional Memory Should Be an Implementation Technique, Not a Programming Interface
Hans-J. Boehm |
Transactional memory is often advocated as an easier-to-use replacement for locks that avoids any possibility of deadlock. Recently, as more care has been exercised in precisely specifying its semantics, a number of researchers have observed that probably the most attractive semantics for transactional memory systems is based on “single global lock atomicity”, i.e. on the semantics of a single global lock. We argue that this should be taken one step further: The synchronization operations seen by the programmer should really just be locks, possibly with some syntactic sugar for easier programming with a single global lock.
Use as a deadlock-free lock replacement does not require any rollback primitive, or any other constructs that expose properties of the implementation. And it appears that such extensions add considerable complexity. Instead, the implementation should strive to optimize coarse-grain locks, for example by implementing them using transactional techniques.
Probably the most widely used approach to parallelizing applications in shared memory environments is to write programs as multiple threads, sharing a complete address space. This approach is neither universally optimal, nor likely to disappear from the scene. It is probably unique in allowing relatively painless incremental parallelization of hot sections of existing sequential code.
Existing multithreaded applications typically use locks to ensure mutual exclusion during access to concurrent data structures. There is a fairly simple way to view threads and shared variables in this environment [2, 8, 5]:
T1: lock(l); T1: tmp1 = c; (sees 0) T1: c = tmp1 + 1; (stores 1) T1: unlock(l); T2: lock(l); T2: tmp2 = c; (sees 1) T2: c = tmp2 + 1; (stores 2) T2: unlock(l);
If instead T2 did not acquire the lock, but simply incremented c, a possible sequentially consistent execution would be
T1: lock(l); T1: tmp1 = c; (sees 0) T2: tmp2 = c; (sees 0) T1: c = tmp1 + 1; (stores 1) T2: c = tmp2 + 1; (stores 1) T1: unlock(l);
We repeat this presentation here to emphasize that it is surprisingly simple, especially in light of the confusion that has historically surrounded the basic programming rules for threads and locks. [6, 7, 4, 8] All mainstream language specifications now appear to be converging on this approach2 [30, 23, 8, 18].
In spite of its relative simplicity, it appears to be widely believed that this model is already pushing the complexity envelope for the kind of mainstream parallel programming we will need with ubiquitous multicore processors, and our emphasis should be on simplifying it further, rather than complicating it.
The use of locks for mutual exclusion unfortunately has a well-known down side. If two threads acquire the same two locks in opposite order, they can cause a deadlock, with each thread holding exactly one of the locks. Although, at first glance, this may appear easily avoidable by insisting on a fixed lock ordering, this is hard to do in practice.
Consider a piece of C++ code that performs a pointer assignment x = y while holding a lock l. It is increasingly common to use reference counted shared pointers, such as shared_ptr [10] from the Boost library or the committee draft for C++0x, the next C++ standard [19]. In that case, the assignment operator invokes user-defined destructor operations on the old value of x. These in turn may recursively invoke further destructors for objects indirectly referenced from the old value of x. The code containing the original pointer assignment generally has no knowledge of the types of these objects or the locks their destructors may acquire. Yet it must somehow ensure that all of these locks follow l in the lock ordering.
A more elaborate Java-based illustration of the problem is the observer pattern example from [21].3
It is well-known that this problem can be avoided by replacing lock-based mutual exclusion with atomic sections that ensure that code inside atomic sections cannot interfere with atomic section code in other threads. Unlike lock-based synchronization, the programmer does not specify a specific lock object to use; mutual exclusion is provided across all atomic sections in the program. In Java, one might write “atomic { S }” instead of “synchronized ( lock_obj ) { S }”.
Constructs along these lines can be implemented with widely different techniques. They have been studied primarily in the context of transactional memory systems (cf. [17, 29, 16, 1]) that allow multiple threads to proceed concurrently into atomic sections, possibly rolling back one or more of them when conflicting accesses are discovered.
But atomic sections may also be implemented in other ways. For example, they could be implemented in Java simply by declaring a globally accessible object the_lock, shared by all code in a program, and then replacing all atomic sections “atomic { S }” by “synchronized (the_lock) { S }”. This is of course not likely to result in a scalable implementation, and there has been significant amount of research on either semi-automatically [24] or fully automatically [13, 31, 9] assigning distinct finer-grained locks to atomic sections. Although it is not always easy to reproduce the scalability of transactional memory techniques with these approaches, they do reduce the contention-free cost back down to that of locks.
Although atomic sections address an important problem with locks, they so far have enjoyed only minimal use in real applications. The fundamental reasons for this are:
Here we focus on the first issue, though we are clearly also motivated by the other two.
Since our goal is to introduce atomic sections as a simplification of locks, it makes little sense to significantly complicate the semantic model in the process.
The “sequential consistency for data-race-free programs” approach appears to be fundamentally as simple as possible. Sequential consistency has long been viewed as the most natural model of parallel program execution [20]. Any attempt to extend it to programs containing data races (as [20] originally proposed) imposes expensive constraints on both hardware and compiler, since they must ensure that memory references to possibly-non-local locations must be performed in order, requiring memory fence instructions, and inhibiting common compiler optimizations, such as common subexpression elimination. [5, 8]. These restrictions are rarely useful to portable code, in part because the resulting language semantics depends on hardware access granularity.
If we view an atomic section as the acquisition of a single global lock, i.e. as having “single global lock atomicity” semantics (cf. [32, 25]), then it is trivial to extend this model to atomic sections. Effectively, we apply the same rules as above, with the following additional restriction on the interleaving of steps in a sequentially consistent execution: An atomic section entry may not appear in the interleaving until all prior atomic section entries by other threads also have matching exits in the interleaving. This is essentially the same rule we would use for a single global reentrant lock.
Note that this addresses all of the other semantic issues normally discussed in connection with transactional memory semantics. For example it addresses all of the safety properties cataloged in [32], without requiring them to be specifically addressed, or even mentioning them. It provides no guarantees for those properties that make assertions about programs with data races (“repeatable reads”, “intermediate updates”, “intermediate reads”). It does provide the other guarantees, such as “publication safety”, and “privatization safety”, since we promise sequential consistency in those cases.
This approach also has the advantage (also enjoyed by and pointed out by [32]) that the interaction between atomic sections and conventional synchronization mechanisms such as locks, condition variables, Java volatiles and C++0x atomic<T> variables are perfectly well defined, and combinations are useful.
Most transactional memory systems (e.g. [16] or the Intel STM [1, 26]) provide additional facilities that leverage the ability of traditional transactional memory implementations to roll back partially executed transactions. Typically these take the form of a retry and/or explicit abort statement. We argue here that although they are undoubtedly useful at times, they do not appear to belong in an interface designed to provide basic mutual exclusion.
We basically see two kinds of applications for these facilities, which we address in turn:
It appears to be fairly common to use explicit transaction aborts to recover from failure conditions detected during the transaction. Rolling back the transaction is an easy way to, for example, restore data structure invariants. Effectively the transaction rollback facility provides a convenient mechanism for “exception safety”.
Although it is useful to roll back a computation in the event of failure, this use appears completely unrelated to the use of transactions for isolation between threads, i.e. as a lock replacement. In particular, it is clearly equally useful, and implementable at less cost, in single-threaded environments, or when operating on data structures private to a thread. Roll-back for failure atomicity also introduces a number of other tricky issues:
Although we agree that it would be nice to provide failure atomicity by allowing rollback initiated with an explicit abort action, it appears sufficiently distinct from the concurrency control aspects of transactional memory, and introduces enough added difficulties, that we do not consider it further here.4
Some transactional memory systems (e.g. [16]) provide a retry construct that allows transactions to be used for thread communications. When retry is executed, the transaction is aborted, and re-executed once one of the previously read values changes. This effectively allows a thread to wait until a condition is satisfied as a result of actions by another thread.
This facility effectively provides a replacement for the condition variables normally used with locks. Unfortunately, it does so in a way that appears to destroy the composition properties that are often used to justify transactional memory to start with.
For example, a call to a function f() may need to perform such communication when it needs to log an error. In a lock-based program, it is usually acceptable to call f() while an unrelated lock is held. In a system based on atomic sections, it is unclear how this can work. If we were to write
atomic { f(); } f() { do_something(); if (error) { request_logging_by_other_thread(); wait_for_other_thread(); } }
the other thread should not see the request until the atomic section completes. But the atomic section can’t complete until the other thread sees and processes the request. The communication here fundamentally violates the notion that atomic sections should appear indivisible, and no other thread should see an intermediate state.
Interestingly, if we use the single global lock interpretation of atomic sections, but implement request_logging_by_other_thread() using conventional locks and condition variables, with a different lock, such code continues to work correctly, as it does in a purely lock-based environment.
Thus it appears to us that retry is of limited utility: It is essentially only useful for code that “knows” it will never be executed inside a transaction. This is at odds with normal coding practices, in which most code is part of data abstractions that can be used for either thread-private data, or shared data for which the client ensures isolation/mutual exclusion.
We expect that more general purpose code requiring explicit thread communication will continue to be written with locks and condition variables. We see the primary benefit of atomic sections as simplifying the large amounts of code, often written by less experienced programmers, that do not directly perform this kind of inter-thread communication, though they may rely on a library to do so.
Thus we do not believe that it makes sense to complicate an atomic section interface with a facility like retry that is really intended to solve problems addressed by more expert programmers, but does not give them the tools to produce a complete solution.
As is pointed out in [22], the atomic section semantics we advocate differ from what transactional memory experts might expect, in that they may block progress of a thread where a transactional implementation of atomic sections would not. Consider the following example, closely related to one in [22]:
Thread 1: atomic { while(1) {} } Thread 2: atomic {} print "Hello";
In our semantics, this is allowed to never print “Hello”, since thread 1 may start running first, acquire the global lock, and thus prevent thread 2 from ever completing its atomic section.
Although this may be surprising, it actually reflects an intentional lack of progress guarantees in the existing Java memory model, which was motivated by completely different considerations. A non-preemptive uniprocessor scheduler would also fail to make progress here, and we concluded during the Java memory model discussions that such implementations should be allowed.
There is no apparent way to modify the preceding example to ensure that “Hello” can only be printed in a transactional, and not a lock-based implementation, unless we introduce either nested synchronization or have some way to force a fair execution. In the absence of those, it appears impossible to programmatically distinguish the two implementation styles. In our example, thread 2 cannot test whether the infinite loop in thread 1 has started without accessing a variable set inside the atomic section containing the loop. But if we introduced such a variable, this access would introduce a data race, nested synchronization, or a conflicting atomic section that could be delayed indefinitely even with a transactional implementation.
In the presence of nested synchronization, our semantics do assign nontrivial meaning to empty atomic sections. Assume v is a synchronization variable, e.g. a Java volatile variable, and x is an ordinary variable, and both are initially zero in:
Thread 1: atomic { v = 1; x = 1; } Thread 2: while (!v) {} atomic {} x = 2;
Under our semantics, this does not contain a data race, since x = 2 cannot be executed until after the atomic section in thread 1 completes. And indeed a lock-based implementation will guard against the race. However, a transactional implementation that treats an empty atomic section as a no-op would not prevent the race.
However, as we illustrated earlier, even transactional implementations generally have to revert to locking in the presence of nested thread communication, such as here. This is done both by the Intel C++ STM, and in [32]. We expect that this mechanism can be used to ensure lock-like semantics in such cases.
(Note that thread creation is viewed as synchronization for this purpose. If a thread is created inside an atomic section with single global lock semantics, an empty atomic section in the child is guaranteed to wait for the parent to leave its atomic section. This appears to be a moderately common idiom with locks, which assigns nontrivial meaning to empty critical sections.)
Thus, in all cases, the single global lock semantics seem fundamentally compatible with both lock-based and transactional memory implementations.
However, all of this clearly needs further investigation, and precise statements and proofs.
Atomic sections, as we have defined them here, add very little to the interface seen by programmers. The same code could have been written with an explicit global reentrant lock. They effectively provide a convention and a bit of “syntactic sugar”.
However, by encouraging a programming style that relies on such a convention, essentially the ultimate coarse-grain locking convention, we are clearly relying on clever implementations to recover reasonable scalability. It’s not yet clear to us what the best implementations will look like, but certainly transactional implementations, presumably with a fall-back to locks to handle IO and nested synchronization, are likely to be part of the solution. To our knowledge, this is consistent with the approach taken by [27] or Azul Systems’ use of transactional memory as an optimization for Java locks.
In addition to comments on earlier drafts from the anonymous reviewers and Terence Kelly, this paper benefitted from discussions with many others, including Ali-Reza Adl-Tabatabai, Dhruva Chakrabarti, Pramod Joisha, Tatiana Shpeisman, and Adam Welc, most of whom are believed to disagree with parts of it.