################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
	      Proceedings of the USENIX 2nd Symposium on
	Operating Systems Design and Implementation (OSDI '96)
	       Seattle, Washington, October 28-31, 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


"Online Data-Race Detection via Coherency Guarantees"

By Dejan Perkovic and Peter J. Keleher

=============================================================================


Online Data-Race Detection via Coherency Guarantees
Dejan Perkovic and Peter J. Keleher 
Department of Computer Science
University of Maryland
College Park, MD  20742-3255
(keleherdejanp)@cs.umd.edu


We present the design and evaluation of an on-the-fly data-race-detection
technique that handles applications written for the lazy release consistent
(LRC) shared memory model.


We require no explicit association between synchronization and shared
memory.  Hence, shared accesses have to be tracked and compared at the
minimum granularity of data accesses, which is typically a single word.

The novel aspect of this system is that we are able to leverage information
used to support the underlying memory abstraction to perform on-the-fly
data-race detection, without compiler support.  Our system consists of a
minimally modified version of the CVM distributed shared memory system, and
instrumentation code inserted by the ATOM code re-writer.

We present an experimental evaluation of our technique by using our system
to look for data races in four unaltered programs.  Our system correctly
found read-write data races in a program that allows unsynchronized read
access to a global tour bound, and a write-write race in a program from a
standard benchmark suite.  Overall, our mechanism reduced program
performance by approximately a factor of two.


Introduction

While potentially very useful, data-race detection mechanisms have yet to
become widespread.  Part of the problem is surely the restricted domain in
which most such mechanisms operate, i.e. parallelizing compilers.  Such
restrictions are deemed necessary because data-race detection is generally
NP-complete , and exponential searches over a domain the size of the number
of shared accesses in a program execution are prohibitively expensive.

This paper presents the design and evaluation of an online data-race
detection technique for explicitly parallel shared-memory applications.
This technique is applicable for shared memory programs written for the
lazy-release-consistent (LRC) 92 memory model.  Our work differs from
previous work 879190909192 in that data-race detection is performed both
on-the-fly and without compiler support.  In common with other dynamic
systems, we address only the problem of detecting data races that occur in
a given execution, not the more general problem of detecting all races
allowed by program semantics .

Our general approach is to run applications on a modified version of
the Coherent Virtual Memory (CVM) 95 
system, a distributed shared memory (DSM)
system that supports LRC.
DSMs support the abstraction of shared memory for parallel applications
running on CPUs connected by general-purpose interconnects, such as
networks of workstations or distributed memory machines like the IBM SP-2.

The key intuition of this work is the following: 

LRC implementations already maintain enough ordering information to make a
constant-time determination of whether any two accesses are concurrent.

Hence, a DSM that implements LRC can perform the entire process on-the-fly
with acceptable overhead.

Modifying CVM to implement data-race detection consisted of (i) adding
instrumentation to detect read accesses, (ii) integrating this information
into existing CVM structures that already contain analogous information
about write accesses, and (iii) running a simple race-detection algorithm
at existing global synchronization points.  The task of this last point is
made much easier by leveraging off of ordering information already
maintained to support consistency guarantees.

We used this technique to check for data races in implementations of four
common parallel applications.  Our system correctly found races in two.
TSP, a program that solves the Traveling Salesman Problem, has a large
number of data races that result from unsynchronized read accesses to a
global tour bound.  The reads are left unsynchronized to improve
performance; out-of-date tour bounds may cause redundant work to be
performed, but do not violate correctness.  Water-Nsquared, of the Splash2
benchmark suite, had a data race that constituted a real bug.  This bug has
been reported to the Splash authors and fixed in their current version.

While overhead is still potentially exponential, we describe a variety of
techniques that greatly reduce the number of comparisons that have to be
made.  Those portions of the race-detection procedure that have the largest
theoretical complexity are only the third or fourth-most expensive portion
of the overall technique for the applications that we studied.
Specifically, we show that i) we can statically eliminate over 99 of all
load and store instructions as potential race participants, ii) we
dynamically eliminate over 70 of all program execution from consideration
by using LRC ordering information, and iii) the slowdown from using
data-race detection in our system is approximately a factor of two for the
applications studied.  While this overhead is clearly too high for the
system to be used all of the time, it is low enough for use when
traditional debugging techniques are insufficient.

Problem Definition

The goal of this work is to create a system that detects race conditions
online.  Since our strategy relies on LRC consistency, our system is
clearly applicable only for applications that will run properly on
release-consistent systems, i.e.  properly-labeled 90 or DRF1 90
applications.  The following definitions are assumed throughout the rest of
the paper.


A data race is defined as a pair of memory accesses in some execution, such
that:
- Both access the same shared variable, 
- At least one is a write,
- The accesses are not ordered by system-visible synchronization or
  program order.

In the sense discussed by Netzer , the races found by our system are actual
data races, i.e.  they are races that occur while the program is running on
our system.  In common with most other implemented systems, both with and
without compiler support, we make no claim to detect all feasible data
races, i.e.  all data races allowed by the semantics of the program.  As
such, a program running to completion on our system without data races is
not a guarantee that subsequent executions will be free of data races as
well.  In practice, however, we expect most data races to reveal themselves
when given an appropriate input set.

Figure shows a portion of a single execution in which two processes access
shared variable , and synchronize through synchronization variable .  Both
- and - are feasible data races, assuming we have no knowledge of the value
of flag. However, if flag is equal to zero during an execution, - is the
only actual data race, the only data race that occurs and would therefore
be caught by our system.

The access pair - is still potentially a bug for some other execution, but
would not be flagged by CVM during this execution.  Shared accesses and do
not constitute a data race, as they are ordered by 's unlock and 's lock.

In order for our system to distinguish between - and - in Figure , the
system must be able to detect and understand the semantics of all
synchronization used by the programs.  In practice, this requirement means
that programs must use only system-provided synchronization.  Any
synchronization implemented on top of the shared memory abstraction is
invisible to the system, and could result in spurious race warnings.

However, the above requirement is no stricter than that of the underlying
DSM system.  Programs must use system-visible synchronization in order to
run on any release-consistent system.  Our data-race detection system
imposes no additional consistency or synchronization constraints.

Given the above definition for data races, our system will detect all data
races that occur during a given execution.

Lazy Release Consistency and Data Races

Lazy Release Consistency

Lazy release consistency 92 is a variant of eager release consistency (ERC)
90, a relaxed memory consistency that allows the effects of shared memory
accesses to be delayed until selected synchronization accesses occur.
Simplifying matters somewhat, shared memory accesses are labeled either as
ordinary or as synchronization accesses, with the latter category further
divided into acquire and release accesses.  Acquires and releases may be
thought of as conventional synchronization operations on a lock, but other
synchronization mechanisms can be mapped on to this model as well.
Essentially, ERC requires ordinary shared memory accesses to be performed
only when a subsequent release by the same processor is performed.  ERC
implementations can delay the effects of shared memory accesses as long as
they meet this constraint.

Under LRC protocols, processors further delay performing modifications
remotely until subsequent acquires by other processors, and the
modifications are only performed at the other processor that performed the
acquire.  The central intuition of LRC is that competing accesses to shared
locations in correct programs will be separated by synchronization.  By
deferring coherence operations until synchronization is acquired,
consistency information can be piggy-backed on existing synchronization
messages.

To do so, LRC divides the execution of each process into intervals, each
identified by an interval index.  For example, Figure shows an execution of
two processors, each of which have two intervals.

Interval 1 of , denoted , contains a release synchronization access and a
write to shared variable .  Each time a process executes a release or an
acquire, a new interval begins and the current interval index is
incremented.  Intervals of different processes are related by a
happens-before-1 partial ordering 90:

- intervals on a single processor are totally ordered by program order,
- interval  precedes interval if begins with the acquire corresponding to
the release that concluded interval, and 
- the transitive closure of the above.

LRC protocols append consistency information to all synchronization
messages.  This information consists of structures describing intervals
seen by the releaser but not the acquirer.  For example, the message
granting the lock to in Figure contains information about all intervals
seen by at the time of the release that had not yet been seen by , i.e. .

Data Race Detection in an LRC System

The happens-before-1 relation orders intervals, and by implication,
accesses within intervals.  Since happens-before-1 is a combination of
synchronization order (the release by precedes the acquire by ), and
program order, it is clear that the write to in of Figure precedes (via the
happens-before-1 relation) the write in (interval 2 of ).

We can now re-define Definition  as follows:

A data race is defined as a pair of memory accesses in some execution, such
that:

- Both access the same shared variable, 
- At least one is a write,
- The accesses are not ordered with respect to happens-before-1.

More informally, a data race is a pair of accesses that do not have
intervening synchronization, such that at least one of the accesses is a
write.  In Figure , if the second write of were to variable , it would
constitute a data race with the access in , because intervals and
are concurrent (not ordered).

In general, detecting data races requires comparing each access against
every other access.  With an LRC system, however, we can limit comparisons
only to accesses in pairs of concurrent intervals.  For example, interval
pair - in Figure is not concurrent (among others), and so we do not have to
check further in order to determine if there is a data race formed by
accesses of these intervals.  Furthermore, for each concurrent interval
pair, we only perform word-level comparisons if we have first verified that
the pages accessed by the two intervals overlap.

For example, assume that the second write by in Figure is to a variable
that is located on the same page as.  A comparison of pages accessed by
concurrent intervals and would reveal that they access overlapping pages,
and hence we would need to perform a bitmap comparison in order to
determine if the accesses constitute false sharing or true sharing (i.e.  a
data race).  In this case, the answer would be false sharing because the
accesses are to different locations.  However, if 's second write were to ,
a variable on a completely different page, our comparison of pages accessed
by the two intervals would reveal no overlap.  No bitmap comparison would
be performed, even though the intervals are concurrent.

Implementation

We implemented our data-race detection on top of CVM 95, a software DSM
that supports multiple protocols and consistency models.  Like commercially
available systems such as TreadMarks 94, CVM is written entirely as a
user-level library and runs on most UNIX-like systems.  Unlike TreadMarks,
CVM was created specifically as a platform for protocol experimentation.

The system is written in C++, and opaque interfaces are strictly enforced
between different functional units of the system whenever possible.  The
base system provides a set of classes that implement a generic protocol,
lightweight threads, and network communication.  The latter functionality
consists of efficient, end-to-end protocols built on top of UDP.

New shared memory protocols are created by deriving classes from the base
Page and Protocol classes.  Only those methods that differ from the base
class's methods need to be defined in the derived class.  The underlying
system calls protocol hooks before and after page faults, synchronization,
and I/O events take place.  Since many of the methods are inlined, the
resulting system is able to perform within a few percent of a severely
optimized system, TreadMarks, running a similar protocol.  However, CVM was
designed to take advantage of generalized synchronization interfaces, as
well as to use multi-threading for latency toleration.  We therefore expect
the performance of the fully functional system to improve over the existing
base.  In order to simplify the comparison process, however, we do not use
either of these techniques in this study.

We made only three modifications to the basic CVM implementation: (i) we
added instrumentation to collect read and write access information, (ii) we
added lists of pages read (read notices) to message types that already
carry analogous information about pages written, and (iii) we added an
extra message round at barriers in order to retrieve word-level access
information, if necessary.

We use the ATOM code-rewriter to instrument shared accesses with calls to
analysis routines.  ATOM allows executable binaries to be analyzed and
modified.  We use ATOM to identify and instrument all loads and stores that
may access shared memory.  Although ATOM is available only for DEC Alpha
systems, similar tools are becoming more common.  EEL provides similar
support for Sparc and MIPS systems, and several machine vendors are working
on such tools as well.

The actual instrumentation consists of a procedure call to an analysis
routine that sets a bit in a per-page bitmap if the instruction accesses
shared memory.  Information about which pages were accessed, together with
the bitmaps themselves, is placed in known locations for CVM to use during
the execution of the application.  All data structures, including bitmaps,
are statically allocated in order to reduce runtime cost.

The overall procedure for detecting data races is the following:

CVM synchronization messages carry information about process intervals.
Each interval contains one or more write notices that specify pages written
during that interval.  We augmented interval structures to also carry read
notices, or lists of pages read during that interval.  Interval structures
also contain version vectors that identify the logical time associated with
the interval, and permit checks for concurrency.

Worker processes in any LRC system append consistency information
describing all local intervals to barrier arrival messages.  At each
barrier, therefore, the barrier master has complete and current information
on all intervals in the entire system.  This information is sufficient for
the master to locally determine the set of all pairs of concurrent
intervals.  Although the algorithm must potentially compare the version
vector of each interval of a given processor with the vector of each
interval of every other processor, synchronization and program order allow
many of the comparisons to be bypassed.  Version vector comparison is a
constant time process, requiring only two integer comparisons.

For each pair of concurrent intervals, the read and write notices are
checked for overlap.  A data race might exist on any page that is either
written in two concurrent intervals, or read in one interval and written in
the other.  Such interval pairs, together with a list of overlapping pages,
are placed on the check list.  Barrier release messages carry the check
list to all system processes.  Each read or write notice has a
corresponding bitmap that describes precisely which words of the page were
accessed.  These bitmaps are returned to the barrier master for each page
and interval on the check list.  The barrier master compares bitmaps from
overlapping pages in concurrent intervals.  Bitmap comparison is a constant
time process, dependent on page size.  In the case of a read-write or
write-write overlap, the algorithm has determined that a data race exists,
and prints the address of the affected variable.

We currently use a very simple interval comparison algorithm to find pairs
of concurrent intervals, primarily because the major system overhead is
elsewhere.  The upper bound on the number of intervals per processor pair
that the comparison algorithm must compare is , where is the maximum number
of intervals of a single processor since the last barrier.  The algorithm
needs only to examine intervals created during the last barrier epoch.  By
definition, these intervals are separated from intervals in previous epochs
by synchronization, and are therefore ordered with respect to them.  Since
each interval potentially needs to be compared against every other interval
(of another process in the current epoch), the total comparison time per
barrier is bounded by , where is the number of processes.  In practice,
however, the number of comparisons is usually quite small.

Applications that use only barriers have two intervals per process per
barrier epoch.  More than two intervals per barrier are only created
through additional peer-to-peer synchronization, such as exclusive locks.
However, peer-to-peer synchronization also imposes ordering on intervals of
the synchronizing processes.  For example, a lock release and subsequent
acquire order intervals prior to the release with respect to those
subsequent to the acquire.  Since an ordered pair of intervals can not be
concurrent, the same act that creates intervals also removes many interval
pairs from consideration for data races.  Hence, programs with many
intervals between barriers usually also have ordering constraints that
reduce the number of concurrent intervals.


Performance


We evaluated the performance of our prototype by searching for data races
in implementations of four common shared-memory applications: FFT (Fast
Fourier Transform), SOR (Jacobi relaxation), TSP (branch and bound
traveling salesman problem), and Water (a molecular dynamics simulation
from the Splash2 benchmark suite.  All applications were run on DECstations
with four 250 Mhz Alpha processors, connected by a 155 MBit ATM.  We used
only a single processor per machine in order to avoid bus contention.

Table summarizes the application inputs, synchronization types, the number
of intervals per barrier, and the overall slowdown for eight-processor
runs.  ``Memory size'' is the size of the shared data segment.  ``Intervals
Per Barrier'' is the average number of intervals created between barriers.
As the number of interval comparisons is potentially proportional to the
number of intervals squared, this metric gives a rough idea of the
worst-case cost of running the comparison algorithm.  This number is
greater than 1 for FFT and SOR because our barrier implementation requires
two interval structures per barrier.  Other synchronization mechanisms
require only a single interval per synchronization.  As the next section
will show, the comparison algorithm is at most only the third most costly
form of overhead in our applications.

``Slowdown'' is the runtime slowdown for each of the applications, compared
with an uninstrumented version of the application running on an unaltered
version of CVM.  Over the four applications, execution time slows only by
an average factor of 2.2.  This number compares quite favorably even with
systems that exploit extensive compiler analysis.

Figure shows the overhead added by the race-detection mechanism relative to
the running time of the unaltered binary, for each application.  For
example, the execution time of the instrumented FFT binary is 108 longer
than that of the uninstrumented binary.  ``CVM Mods'' is the overhead added
by the modifications to CVM, primarily setting up the data structures
necessary for proper data-race detection and the additional bandwidth used
by the read notices.  ``Proc Call'' is the procedure call overhead for our
instrumentation.  ATOM will not currently inline instrumentation; only
procedure calls can be inserted into existing code.  The ATOM team is
working to eliminate this restriction, and the ``Proc Call'' column shows
how much of the total overhead could be eliminated as a result.  ``Access
Check'' is the additional time spent inside the procedure call determining
whether an access is to shared memory, and setting the proper bit if so.
``Intervals'' refers to the time spent using the interval comparison
algorithm to identify concurrent interval pairs with overlapping page
accesses.  ``Bitmaps'' describes the overhead of the extra barrier round
required to retrieve bitmaps, together with the cost of the bitmap
comparisons.

The two largest components of the overhead are the access checks and
modifications to CVM.  The overheads of the interval comparison algorithm
and the bitmap checks are usually fairly small.

As we will see in the next section, TSP has a higher rate of calls to the
runtime analysis routines than the other applications, hence the higher
instrumentation overhead.  The comparison algorithm adds more overhead for
Water than the other applications because of the large degree of
fine-grained synchronization.

The following subsections describe the above overheads in more detail.

Instrumentation Costs

We instrumented each load and store that could potentially be involved in a
data race.  The instrumentation consists of a procedure call to an analysis
routine, and hence adds ``Proc Call'' and ``Access Check'' overheads.  By
summing these columns from Figure , we can see that instrumentation
accounts for an average of 68 of the total race-detection overhead.

This overhead can be reduced by instrumenting fewer instructions.  This
goal is difficult because shared and private data are all accessed using
the same addressing modes, and even share some base registers.  However, we
eliminate most stack accesses by checking for use of the frame pointer as a
base register.  The fact that all shared data in our system is dynamically
allocated allows us to eliminate any instructions that access private data
by indirection through the base register that points to statically
allocated data.  Finally, we do not instrument any instructions in shared
libraries because none of our applications pass segment pointers to any
libraries.  This is the case with the majority of the scientific programs
where data race detection is most important.  However, we can easily
instrument ``dirty'' library functions, if necessary.

Table breaks down load and store instructions into the categories that we
are able to statically distinguish.  The first four columns show the number
of loads and stores that are not instrumented because they access the stack
or statically-allocated data, or are in library routines, including CVM
itself.  The fifth column shows the remainder.  These instructions could
not be eliminated as possible data-race participants and are therefore
instrumented by ATOM to make a procedure call to an analysis routine each
time the memory access is executed.

On average, we are able to statically determine that over 99 of the loads
and stores in our applications are to non-shared data.  As an example, the
FFT binary contained 131,668 load and store instructions.  Of these,
124,716 instructions are in libraries.  A further 1285 instructions access
data through the frame pointer, and hence reference stack data. Another
3910 are in the CVM system itself.  Finally, 1496 instructions access data
through a register pointing to the base of statically allocated global
memory.  We can eliminate these instructions as well, since CVM allocates
all shared memory dynamically.  In the entire binary, only 261 memory
access instructions remain that could possibly reference shared memory, and
hence form part of a data race.

Nonetheless, the last two columns of Table show that the majority of
run-time calls to our analysis routines are for private, not shared, data.
``Inst. Accesses Per Second'' refers to the number of instrumented loads
and stores executed per second, and the number of these calls to our
instrumentation routines that turn out to be for shared or private data.
The high rate of instrumented accesses for TSP explains the large ``Access
Check'' overhead for TSP in Figure .  Accesses to shared data are
distinguished from accesses to private data by comparing the address with
that of the shared data segments.


The Cost of the Comparison Algorithm

The comparison algorithm has three tasks.  First, the set of concurrent
interval pairs must be found.  Second, this list must be winnowed down to
those interval pairs for which an overlap of pages is found (i.e.  one
interval of a pair reads from page and the other interval in the pair
writes to page ).  Each such pair of concurrent intervals exhibits
unsynchronized sharing.  However, the sharing may be either false sharing,
i.e.  the loads and stores to page are to different locations in (not a
data race), or true sharing, in which the loads and stores overlap at least
one location (data race).

The first column of Table shows the percentage of intervals that are
involved in at least one such concurrent interval pair.  This number ranges
from zero for SOR, where this is no unsynchronized sharing (true or false),
to 93 for TSP, where there is a large amount of both true and false
sharing.  Note that the number of possible interval pairs is quadratic with
respect to the number of intervals, so even if this stage eliminates only 7
of all intervals, as we do for TSP, we may be eliminating a much higher
percentage of interval pairs.

The second column of Table shows that an average of only 6 of all bitmaps
must be retrieved from constituent processors in order to identify data
races by distinguishing false from true sharing.  As page access lists of
concurrent intervals will only overlap in cases of false sharing or actual
data races, the percentage of intervals and bitmaps involved in comparisons
is fairly small.


The Cost of CVM Modifications

Figure shows that almost 22 of our overhead comes from ``CVM Mods'', or
modifications made to the CVM system in order to support the race-detection
algorithm.  This overhead consists of the cost of setting up additional
data structures for data-race detection, and the cost of the additional
bandwidth consumed by read notices.

The third column of Table shows the bandwidth overhead of adding read
notices to synchronization messages.  Individual read and write notices are
the same size, but there are typically at least five times as many reads as
writes, and read notices consume a proportionally larger amount of space
than write notices.  The bandwidth overhead for Water is much larger than
for the other applications because of the fine-grained synchronization, and
hence the large number of intervals.

The bandwidth consumed by read notices prevents us from running larger
input sets because current message sizes are already at system maximums.
Our communication code can, and eventually will, be modified in order to
accommodate larger messages.

Discussion


Reference Identification

The system currently prints the shared segment address for each detected
race condition, together with the interval indexes.  In combination with
symbol tables, this information can be used to identify the exact variable
and synchronization context.

Identifying the specific instructions involved in a race is more difficult
because it requires retaining program counter information for each shared
memory access.  This information is available at runtime, but such a scheme
would require saving program counters for each shared access until a future
barrier analysis phase determined that the access was not involved in a
race.  The storage requirements for retaining this information would
generally be prohibitive, and would also add runtime overhead.

A second approach is to use the conflicting address and corresponding
barrier epoch from an initial run of the program as input to a second run.
During the second run, program counter information can be gathered for only
those accesses to the conflicted address that originate in the barrier
epoch determined to involve the data race.

While runtime overhead and storage requirements can thereby be drastically
reduced, the data race must occur in the second run exactly as in the
first.
This will happen if the application has no general
races , i.e.
synchronization order is deterministic.
This is not the case in either of the two applications for which we found
data races.
A solution is to modify CVM so as to save synchronization ordering
information from the first run, and to enforce the same ordering in the
second run.


Scalability


Figure shows runtime slowdown versus number of processors.  Slowdown
actually decreases as we increase the number of processors.  This seemingly
anomalous result has two causes.  First, interval and bitmap comparison
overhead is serialized at the master process, and hence observable overhead
from these sources remains constant as system size is increased.
Instrumentation costs, however, occur in parallel with the shared accesses.
As system size increases, therefore, per-process computation and observable
instrumentation overhead decreases.

Second, the combination of modest problem sizes, fast processors, and the
large page size of the DECstations result in our applications getting very
modest speedups even with the unmodified version of the single-writer
protocol used in this study.  Hence, at least some of the overhead of the
race detection algorithm is probably masked by DSM overhead.  However, none
of these limitations are intrinsic to our approach.  Our problem sizes are
small because of message size limitations.  We are modifying the underlying
communication layer to alleviate this problem.  The large page size
exacerbates the problems of false sharing associated with single-writer
protocols protocols.  We based our prototype on CVM's single-writer
protocol in order to minimize complexity, but our algorithm will work
identically with CVM's multi-writer protocol.

Finally, comparison to determine if two intervals are concurrent is a
constant-time process, as each interval is marked with a vector timestamp
8994.  Comparison of two concurrent intervals to determine whether their
page lists overlap is currently in the size of the lists, as they are
usually very small (i.e.  less than ten).  If we encountered applications
where these lists grew large, we could perform the comparison in time
linear with respect to the number of pages in the system by implementing
page lists using bitmaps.

Global Synchronization

The interval comparison algorithm is currently run only at global
synchronization operations, i.e.  barriers.  The applications and input
sets in this study use barriers frequently enough, or otherwise synchronize
infrequently enough, that the number of intervals to be compared at
barriers is quite manageable.  Nonetheless, there certainly exist
applications for which global synchronization is not frequent enough to
keep the number of interval comparisons to a small number.  Ideally, the
system would be able to incrementally discard data races without global
cooperation, but such mechanisms would increase the complexity of the
underlying consistency protocol 94.  If global synchronization is either
not used, or not used often enough, we can exploit CVM routines that allow
global state to be consolidated between synchronizations.  Currently, this
mechanism is only used in CVM for garbage collection of consistency
information in long-running, barrier-free programs.


Accuracy

Adve 91 discusses three potential problems in the accuracy of race
detection schemes in concert with weak memory systems, or systems that
support memory models such as lazy release consistency.

The first issue is whether to report all data races, or only those that
would also occur during sequentially-consistent executions of the program.
Their example (somewhat simplified) is shown in Figure , where the notation
op (loc) val indicates a read or write operation performed on location loc,
that respectively reads or writes value val.

If the missing synchronization operations were present, there would not be
any races.  's read of qPtr, 37, would return 100 instead of some older
value (37 in this case), and 's subsequent writes would be to locations 100
and above.  However, given that the synchronization is not present, only
the qPtr and qEmpty races would have occurred on sequentially consistent
hardware.  If had propagated to , any sequentially consistent system must
also have sent the results of 100.  This is not the case with weak memory
systems, which can usually reorder the effects of write operations between
synchronization points at will.  Hence, the races between 37 and 37, etc.,
only occur on weak memory systems.

This is an instance of a more general problem, i.e. whether to return all
data races, or only ``first'' data races, those that are not affected or
caused by any prior race.  Our system currently reports all races, but
could be modified to report only first races without requiring more
information to be gathered.  Determining whether one race is affected by
another effectively consists of deciding whether a happens-before-1
relationship exists between any of the operations in one race and any of
the operations in another.  Since barrier operations are semantically
equivalent to releases by all arriving processors to the barrier master,
followed by the barrier master releasing to all other processors, any race
in a prior barrier epoch must necessary affect all races in subsequent
epochs.  Hence, all ``first'' races must occur in the same barrier epoch.
Modifying our system to perform this check online is a trivial extension.


The second problem with accuracy of dynamic race-detection algorithms is
reliability of ordering information in the presence of races.  Race
conditions could cause wild accesses to random memory locations,
potentially corrupting interval ordering information or access bitmaps.
This problem exists in any dynamic race-detection algorithm, but we expect
it to occur infrequently.

A final accuracy problem identified by Adve is that of systems that attempt
to minimize space overhead by buffering only limited trace information,
possibly resulting in some races remaining undetected.  Our system only
discards trace information when it has been checked for races, and hence
does not suffer this limitation.

Further Performance Enhancements


There are several ways that overhead can be further reduced.  First, the
ATOM team has promised a new version that allows instrumentation code to be
inlined.  The Shasta project has already demonstrated a version of ATOM
with this feature.  Figure shows that an average of 6.7 of our overhead is
caused by the procedure call.  This overhead will be eliminated when we get
the new version of ATOM.

Second, we currently instrument both load and store instructions.  This is
necessary because our system is currently built on top of a single-writer
LRC protocol .  Converting our system to use the multi-writer protocol
would allow us to exploit existing diffs, which summarize per-page
modifications, to extract write accesses.  We would then be able to
dispense with the monitoring of store instructions.  Since approximately 68
of the overhead is from instrumentation, and 25 of all data accesses are
stores, we should be able to eliminate at least 17 of overall overhead.

A disadvantage of this approach is a slightly weaker correctness guarantee.
Diffs only contain modifications to shared data.  If a shared value is
overwritten with the same value, the data location will not be in the diff,
and any data race involving this location may not be detected.

Finally, Table shows that nearly 68 of the calls to our instrumentation
routines turn out to be for private data.  Our current analysis tracks
references only through the same basic block.  If the value defined before
that point is used to reference an unknown data location, we conservatively
assume that the location is shared, and hence instrument the access.
Inter-procedural analysis would allow us to eliminate many of these
``false'' instrumentations, and reduce overall overhead.  This analysis can
be done with the current ATOM system, but will be much easier with a
version promised in the near future.


Related Work

There has been a great deal of published work in the area of data race
detection.  However, as previously mentioned, most prior work has dealt
with applications and systems in more specialized domains.  Bitmaps have
been used to track shared accesses before 90, but we know of no other
implementation of on-the-fly data-race detection for explicitly-parallel,
shared-memory programs without compiler support.

Our work is closely related to work already alluded to in Section , a
technique described (but not implemented) by Adve et al. 91.  The authors
describe a post-mortem technique that creates trace logs containing
synchronization events, information allowing their relative execution order
to be derived, and computation events.  Computation events correspond
roughly to CVM's intervals.  Computation events also have READ and WRITE
attributes that are analogous to the read and write page lists and bitmaps
that describe the shared accesses of an interval.  These trace files are
used off-line to perform essentially the same operations as in our system.
We differ in that our minimally-modified system leverages off of the LRC
memory model in order to abstract this synchronization ordering information
online.  We are therefore able to perform all of the analysis online as
well, and do away with trace logs, post-mortem analysis, and much of the
overhead.

We have also just become aware of unpublished work on execution replay in
TreadMarks that could be used to implement race-detection schemes.  The
approach of the Reconstruction of Lamport Timestamps (ROLT) technique is
similar to the technique we described in Section for identifying the
instructions involved in races.  Minimal ordering information saved during
an initial run is used to enforce exactly the same interleaving of shared
accesses and synchronization in a second run.  During the second run, a
complete address trace can be saved for post-mortem analysis, although the
authors do not discuss race detection in detail.  The advantage of this
approach is that the initial run incurs minimal overhead, ensuring that the
tracing mechanism does not perturb the normal interleaving of shared
accesses.

The ROLT approach is complementary to the techniques described in this
paper.  The primary thrust of our work is in using the underlying
consistency mechanism to prune enough information online that post-mortem
analysis is not necessary.  As such, our techniques could be used to
improve the performance of the second phase of the ROLT approach.
Similarly, our system could be augmented to include an initial
synchronization-tracing phase, allowing us to reduce our perturbation of
the parallel computation.

Conclusions


We have presented a new on-the-fly data race detection technique that
allows data-race detection in explicitly-parallel, shared-memory programs.
Our technique abstracts synchronization ordering from consistency
information already maintained by lazy-release-consistent DSM systems.  We
are able to use this information to eliminate most access comparisons, and
to perform the entire data-race detection online.

The primary costs of data-race detection in our system are in tracking
shared data accesses.  We use ATOM to instrument load and store
instructions with calls to our library.  We are able to statically
eliminate more than 99 of all loads and stores in binaries by identifying
accesses to stack variables and statically-allocated global variables.
Nonetheless, the majority of the runtime calls to our library are for
non-shared accesses.  Overall, our applications slow down by an average
factor of approximately two.

We used our system to analyze four shared-memory programs, finding data
races in two.  One of the programs, TSP, allows data races in order to
improve performance without violating correctness.  The data race in the
other program, which was from a standard benchmark suite, was a bug.  We
believe that the utility of our techniques, in combination with the
generality of the programming model that we support, can help data-race
detection to become more widely used.

Additional information on CVM is
available at: https://www.cs.umd.edu/projects/cvm.html.

Acknowledgments
We would like to thank Carla Ellis and the anonymous referees for all their
helpful suggestions and feedback on earlier drafts of this paper.