Workshop Program

All sessions will take place in Club Regent unless otherwise noted.

The full papers published by USENIX for the workshop are available as a download or individually below to workshop registrants immediately and to everyone beginning June 24, 2013. Everyone can view the abstracts immediately. Copyright to the individual works is retained by the author[s]. 

Download Paper Archives
Attendee Files 

 

Monday, June 24, 2013

8:30 a.m.–9:00 a.m. Monday

Continental Breakfast

Market Street Foyer

9:00 a.m.–10:00 a.m. Monday

Panel

Tools in the Real World

Panelists: Niall Dalton, Calxeda; Brandon Lucia, University of Washington and Microsoft Research; Tipp Moseley, Google; Paul Peterson, Intel Corporation

Available Media
10:00 a.m.–10:15 a.m. Monday

Break with Refreshments

Market Street Foyer

10:15 a.m.–11:15 a.m. Monday

Performance

Parallelization Primitives for Dynamic Sparse Computations

Tsung-Han Lin, Stephen J. Tarsa, and H.T. Kung, Harvard University

We characterize a general class of algorithms common in machine learning, scientific computing, and signal processing, whose computational dependencies are both sparse, and dynamically defined throughout execution. Existing parallel computing runtimes, like MapReduce and GraphLab, are a poor fit for this class because they assume statically defined dependencies for resource allocation and scheduling decisions. As a result, changing load characteristics and straggling compute units degrade performance significantly. However, we show that the sparsity of computational dependencies and these algorithms’ natural error tolerance can be exploited to implement a flexible execution model with large efficiency gains, using two simple primitives: selective push-pull and statistical barriers. With reconstruction for compressive time-lapse MRI as a motivating application, we deploy a large Orthogonal Matching Pursuit (OMP) computation on Amazon’s EC2 cluster to demonstrate a 19x speedup over current static execution models.

Available Media

Three Fingered Jack: Tackling Portability, Performance, and Productivity with Auto-Parallelized Python

David Sheffield, Michael Anderson, and Kurt Keutzer, University of California, Berkeley

We present Three Fingered Jack, a highly productive approach to writing high-performance software in Python. Our system applies traditional dependence analysis and reordering transformations to a restricted set of Python loop nests. It does this to uncover and exploit vector and thread parallelism. Then it generates native code for multiple platforms. On four benchmark kernels and two applications, our system achieves 0.4 — 13.5x  and 0.97 — 113.3x  of hand written but not highly optimized C++ performance on Intel and ARM CPUs, respectively.

Available Media
11:15 a.m.–11:30 a.m. Monday

Break

Market Street Foyer

11:30 a.m.–12:30 p.m. Monday

Programming Models

Collection-focused Parallelism

Micah J. Best and Daniel Jacobsen, University of British Columbia; Nicholas Vining, Gaslamp Games; Alexandra Fedorova, Simon Fraser University

Constructing parallel software is, in essence, the process of associating ‘work’ with computational units. The definition of work is dependent upon the model of parallelism used, and our choice of model can have profound effects on both programmer productivity and run-time efficiency. Given that the movement of data is responsible for the majority of parallelism overhead, and accessing data is responsible for the majority of parallelism errors, data items should be the basis for describing parallel work. As data items rarely exist in isolation and are instead parts of larger collections, we argue that subsets of collections should be the basic unit of parallelism. This requires a semantically rich method of referring to these sub-collections. Sub-collections are not guaranteed to be disjoint, and so an efficient run-time mechanism is required to maintain correctness. With a focus on complex systems, we present some of the challenges inherent in this approach and describe how we are extending Synchronization via Scheduling (SvS) and other techniques to overcome these difficulties. We discuss our experiences incorporating these techniques into a modern video game engine used in an in-development title.

Available Media

Constrained Data-Driven Parallelism

Tim Harris, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Mark Moir, Oracle Labs

In data-driven parallelism, changes to data spawn new tasks, which may change more data, spawning yet more tasks. Computation propagates until no further changes occur. Benefits include increasing opportunities for fine-grained parallelism, avoiding redundant work, and supporting incremental computations on large data sets. Nonetheless, data-driven parallelism can be problematic. For example, convergence times of data-driven single-source shortest paths algorithms can vary by two orders of magnitude depending on task execution order. We propose constrained data-driven parallelism, in which programmers can impose ordering constraints on tasks. In particular, we propose new abstractions for defining groups of tasks and constraining the execution order of tasks within each group. We sketch an initial implementation and present experimental results demonstrating that our approach enables new efficient data-driven implementations of a variety of graph algorithms.

Available Media

Towards Performance-Portable, Scalable, and Convenient Linear Algebra

Philippe Tillet, Technische Universität Wien; Karl Rupp, Argonne National Laboratory; Siegfried Selberherr, Technische Universität Wien; Chin-Teng Lin, National Chiao Tung University

The rise of multi- and many-core architectures also gave birth to a plethora of new parallel programming models. Among these, the open industry standard OpenCL addresses this heterogeneity of programming environments by providing a unified programming framework. The price to pay, however, is that OpenCL requires additional low-level boilerplate code, when compared to vendor-specific solutions, even if only simple operations are to be performed. Also, the unified programming framework does not automatically provide any guarantees on performance portability of a particular implementation. Thus, device-specific compute kernels are still required for obtaining good performance across different hardware architectures.

We address both, the issue of programmability and portable performance, in this work: On the one hand, a high-level programming interface for linear algebra routines allows for the convenient specification of the operations of interest without having to go into the details of the underlying hardware. On the other hand, we discuss the underlying generator for device-specific OpenCL kernels at runtime, which is supplemented by an auto-tuning framework for portable performance as well as with work partitioning and task scheduling for multiple devices.

Our benchmark results show portable performance across hardware from major vendors. In all cases, at least 75 percent of the respective vendor-tuned library was obtained, while in some cases we even outperformed the reference. We further demonstrate the convenient and ecient use of our high-level interface in a multi-device setting with good scalability.

Available Media
12:30 p.m.–2:00 p.m. Monday

FCW Luncheon

Imperial Ballroom

2:00 p.m.–2:45 p.m. Monday

Data Structures

Parallel Synchronization-Free Approximate Data Structure Construction

Martin Rinard, MIT CSAIL

We present approximate data structures with construction algorithms that execute without synchronization. The data races present in these algorithms may cause them to drop inserted or appended elements. Nevertheless, the algorithms 1) do not crash and 2) may produce a data structure that is accurate enough for its clients to use successfully. We advocate an approach in which the approximate data structures are composed of basic tree and array building blocks with associated synchronization-free construction algorithms. This approach enables developers to reuse the construction algorithms, which have been engineered to execute successfully in parallel contexts despite the presence of data races, without having to understand the details of why they execute successfully.

We evaluate the end-to-end accuracy and performance consequences of our approach by building a space-subdivision tree for the Barnes-Hut N-body simulation out of our presented tree and array building blocks. The resulting approximate data structure construction algorithm eliminates synchronization overhead and anomalies such as excessive serialization and deadlock. The algorithm exhibits good performance (running 14 times faster on 16 cores than the sequential version) and good accuracy (the accuracy loss is four orders of magnitude less than the accuracy gain associated with increasing the accuracy of the Barnes-Hut center of mass approximation by 20%).

Available Media

Using Elimination and Delegation to Implement a Scalable NUMA-Friendly Stack

Irina Calciu, Brown University; Justin Gottschlich, Intel Labs; Maurice Herlihy, Brown University

Emerging cache-coherent non-uniform memory access (cc-NUMA) architectures provide cache coherence across hundreds of cores. These architectures change how applications perform: while local memory accesses can be fast, remote memory accesses suffer from high access times and increased interconnect contention. Because of these costs, performance of legacy code on NUMA systems is often worse than their uniform memory counterparts despite the potential for increased parallelism.

We explore these effects on prior implementations of concurrent stacks and propose the first NUMA-friendly stack design that improves data locality and minimizes interconnect contention. We achieve this by using a dedicated server thread that performs all operations requested by the client threads. Data is kept in the cache local to the server thread thereby restricting cache-to-cache traffic to messages exchanged between the clients and the server. In addition, we match reciprocal operations (pushes and pops) by using the rendezvous elimination algorithm before sending requests to the server. This has the dual effect of avoiding unnecessary interconnect traffic and reducing the number of operations that change the data structure. The result of combining elimination and delegation is a scalable and highly parallel stack that outperforms all previous stack implementations on NUMA systems.

Available Media
2:45 p.m.–3:00 p.m. Monday

Break with Refreshments

Market Street Foyer

3:00 p.m.–4:00 p.m. Monday

Keynote Address

Parallelism in the Cloud

Eric Brewer, University of California, Berkeley, and Google

Parallelism in the cloud comes in two basic flavors: low-latency queries and batch processing. These flavors are both fundamental but quite different in their needs.  By separating the two we can build simpler systems that support both well. In particular, we cover the design of a new OS for cloud computing that brings these concepts all the way down to the metal, where the job shifts from virtualizing resources for typically local human users to more direct support for these two classes of parallelism under the remote control of a cluster-wide scheduler.

Parallelism in the cloud comes in two basic flavors: low-latency queries and batch processing. These flavors are both fundamental but quite different in their needs.  By separating the two we can build simpler systems that support both well. In particular, we cover the design of a new OS for cloud computing that brings these concepts all the way down to the metal, where the job shifts from virtualizing resources for typically local human users to more direct support for these two classes of parallelism under the remote control of a cluster-wide scheduler.

Available Media
5:00 p.m.–6:30 p.m. Monday

Monday Happy Hour and HotPar ’13 Poster Session

Whether this is your first time at a USENIX event or your tenth, please stop by the Monday Happy Hour to start your week off right. Take advantage of this opportunity to meet your peers while enjoying refreshments and hor d’oeuvres and checking out the HotPar ’13 posters. Each HotPar ’13 paper will be presented, as well as additional posters listed on the Poster Session page.

 

Tuesday, June 25, 2013

8:15 a.m.–9:15 a.m. Tuesday

Continental Breakfast

Market Street Foyer

9:00 a.m.–10:00 a.m. Tuesday

Keynote Address

Challenges of Future Computing

William J. Dally, Stanford and nVidia

Future computers of all sizes, from cell phones to supercomputers, share challenges of power and programmability to realize their potential. The end of Dennard scaling has made all computing power limited, so that performance is determined by energy efficiency. With improvements in process technology offering little increase in efficiency, innovations in architecture and circuits are required to maintain the expected performance scaling. The large-scale parallelism and deep storage hierarchy of future machines poses programming challenges. This talk will discuss these challenges in more detail and introduce some of the technologies being developed to address them.

Future computers of all sizes, from cell phones to supercomputers, share challenges of power and programmability to realize their potential. The end of Dennard scaling has made all computing power limited, so that performance is determined by energy efficiency. With improvements in process technology offering little increase in efficiency, innovations in architecture and circuits are required to maintain the expected performance scaling. The large-scale parallelism and deep storage hierarchy of future machines poses programming challenges. This talk will discuss these challenges in more detail and introduce some of the technologies being developed to address them.

Available Media
10:00 a.m.–10:15 a.m. Tuesday

Break with Refreshments

Market Street Foyer

10:15 a.m.–11:15 a.m. Tuesday

Bugs

Characterizing Real World Bugs Causing Sequential Consistency Violations

Mohammad Majharul Islam and Abdullah Muzahid, University of Texas at San Antonio

With the ubiquitous availability of parallel architectures, the burden falls on programmers’ shoulders to write correct parallel programs that have high performance and portability across different systems. One of the major issues that complicates this task is the intricacies involved with the underlying memory consistency models. Sequential Consistency (SC) is the simplest and most intuitive memory model. Therefore, programmers usually assume SC for writing parallel programs. However, various concurrency bugs can lead to violations of SC. These subtle bugs make the program difficult to reason about and virtually always lead to incorrectness. This paper provides the first (to the best of our knowledge) comprehensive characteristics study of SC violation bugs that appear in real world codebases.

We have carefully examined pattern, manifestation, effect, and fix strategy of 20 SC violation bugs. These bugs have been selected from randomly chosen 127 concurrency bugs from a variety of open source programs and libraries (e.g. Mozilla, Apache, MySQL, Gcc, Cilk, Java, and Splash2). Our study reveals interesting findings and provides useful guidance for future research in this area.

Available Media

But How Do We Really Debug Transactional Memory Programs?

Justin E. Gottschlich, Rob Knauerhase, and Gilles Pokam, Intel Labs

With recent announcements of hardware transactional memory (HTM) systems from IBM and Intel, HTM will soon be available for widescale adoption. Such platforms, combined with tested and stable software transactional memory systems, are likely to make real transactional memory (TM) systems available for the first time, which promises to be a more attractive alternative than lock-based parallel programming in terms of programmability and performance.

With these first-ever real systems come several open questions. Perhaps one of the most obvious is, "how does one debug a TM program that uses real hardware?" While prior research in this area exists, there are, to the best of our knowledge, no commercially-available TM debuggers and only a handful of research projects exploring such possibilities, many of which use simulated HTMs that may utilize unrealistic hardware. In this paper, we motivate the need for more than traditional and ad hoc debugging support. We then propose a novel record-and-replay system, based on existing research prototypes and demonstrate its usefulness by reviewing several use cases in TM programming, restricting use to real features in IBM and Intel’s HTMs.

Available Media

Property-Driven Cooperative Logging for Concurrency Bugs Replication

Nuno Machado, Paolo Romano, and Luís Rodrigues, INESC-ID, Instituto Superior Técnico, Universidade Técnica de Lisboa

This paper presents property-driven partial logging, a novel information partition scheme that takes into account load balancing and shared variable properties to optimize the replay of Java concurrency bugs through partial log combination. Preliminary evaluation with standard benchmarks and a real-world application provides initial evidence of the feasibility of our approach.

Available Media
11:15 a.m.–11:30 a.m. Tuesday

Break

Market Street Foyer

11:30 a.m.–12:30 p.m. Tuesday

Panel

Practitioners vs. Academics

Panelists: Niko Matsakis, Mozilla; Russell Williams, Adobe Systems; Martin Rinard, Massachusetts Institute of Technology

Available Media
12:30 p.m.–2:00 p.m. Tuesday

FCW Luncheon

Imperial Ballroom

2:00 p.m.–3:00 p.m. Tuesday

Semantics

Determinism Is Overrated: What Really Makes Multithreaded Programs Hard to Get Right and What Can Be Done About It

Junfeng Yang, Heming Cui, and Jingyue Wu, Columbia University

Our accelerating computational demand and the rise of multicore hardware have made parallel programs, especially shared-memory multithreaded programs, increasingly pervasive and critical. Yet, these programs remain extremely difficult to write, test, analyze, debug, and verify. Conventional wisdom has attributed these difficulties to nondeterminism, and researchers have recently dedicated much effort to bringing determinism into multithreading. In this paper, we argue that determinism is not as useful as commonly perceived: it is neither sufficient nor necessary for reliability.We present our view on why multithreaded programs are difficult to get right, describe a promising approach we call stable multithreading to dramatically improve reliability, and summarize our last four years’ research on building and applying stable multithreading systems.

Available Media

…And Region Serializability for All

Jessica Ouyang, Peter M. Chen, Jason Flinn, and Satish Narayanasamy, University of Michigan

A desirable concurrency semantics to provide for programs is region serializability. This strong semantics guarantees that all program regions between synchronization operations appear to execute in some global and serial order consistent with program order. Unfortunately, this guarantee is currently provided only to programs that are free of data races. For programs with data races, system designers currently face a difficult trade-off between losing all semantic guarantees and hurting performance. In this paper, we argue that region serializability should be guaranteed for all programs, including those with data races. This allows programmers, compilers, and other tools to reason about a program execution as an interleaving of code regions rather than memory instructions. We show one way to provide this guarantee with an execution style called uniparallelism and simple compiler support. The cost of the system is a 100% increase in utilization. However, if there are sufficient spare cores on the computer, the system adds a median overhead of only 15% for 4 threads.

Available Media

Durability Semantics for Lock-based Multithreaded Programs

Dhruva R. Chakrabarti and Hans-J. Boehm, Hewlett-Packard Laboratories

Non-volatile storage connected as memory (NVRAM) offers promising opportunities for simplifying and accelerating manipulation of persistent data. Load and store latency is potentially comparable to that of ordinary memory. The challenge is to ensure that the persisted data remains consistent if a failure occurs during execution, especially in a multithreaded programming environment. In this paper, we provide semantics for identifying a globally consistent state for a lock-based multithreaded program. We show how to conveniently ensure that programs are returned to such a globally consistent state after a crash. We discuss challenges and opportunities along the way, and explain why adding durability to transactional programs may be less expensive.

Available Media
3:00 p.m.–4:00 p.m. Tuesday

Debate

Determinism: Blessing or Curse?

Panelists: Junfeng Yang, Columbia University; Bryan Ford, Yale University; Fil Pizio, Apple Inc.

Available Media
4:00 p.m.–4:30 p.m. Tuesday

Networking Break with Refreshments

Market Street Foyer

4:30 p.m.–5:30 p.m. Tuesday
5:30 p.m.–5:35 p.m. Tuesday

Closing Remarks

Program Co-Chairs: Emery Berger, University of Massachusetts Amherst, and Kim Hazelwood, Google