8:30 a.m.–9:00 a.m. |
Monday |
Continental Breakfast
Market Street Foyer |
9:00 a.m.–10:00 a.m. |
Monday |
Panelists: Niall Dalton, Calxeda; Brandon Lucia, University of Washington and Microsoft Research; Tipp Moseley, Google; Paul Peterson, Intel Corporation
|
10:00 a.m.–10:15 a.m. |
Monday |
Break with Refreshments
Market Street Foyer |
10:15 a.m.–11:15 a.m. |
Monday |
Tsung-Han Lin, Stephen J. Tarsa, and H.T. Kung, Harvard University We characterize a general class of algorithms common in machine learning, scientific computing, and signal processing, whose computational dependencies are both sparse, and dynamically defined throughout execution. Existing parallel computing runtimes, like MapReduce and GraphLab, are a poor fit for this class because they assume statically defined dependencies for resource allocation and scheduling decisions. As a result, changing load characteristics and straggling compute units degrade performance significantly. However, we show that the sparsity of computational dependencies and these algorithms’ natural error tolerance can be exploited to implement a flexible execution model with large efficiency gains, using two simple primitives: selective push-pull and statistical barriers. With reconstruction for compressive time-lapse MRI as a motivating application, we deploy a large Orthogonal Matching Pursuit (OMP) computation on Amazon’s EC2 cluster to demonstrate a 19x speedup over current static execution models.
David Sheffield, Michael Anderson, and Kurt Keutzer, University of California, Berkeley We present Three Fingered Jack, a highly productive approach to writing high-performance software in Python. Our system applies traditional dependence analysis and reordering transformations to a restricted set of Python loop nests. It does this to uncover and exploit vector and thread parallelism. Then it generates native code for multiple platforms. On four benchmark kernels and two applications, our system achieves 0.4 — 13.5x and 0.97 — 113.3x of hand written but not highly optimized C++ performance on Intel and ARM CPUs, respectively.
|
11:15 a.m.–11:30 a.m. |
Monday |
Break
Market Street Foyer
|
11:30 a.m.–12:30 p.m. |
Monday |
Micah J. Best and Daniel Jacobsen, University of British Columbia; Nicholas Vining, Gaslamp Games; Alexandra Fedorova, Simon Fraser University Constructing parallel software is, in essence, the process of associating ‘work’ with computational units. The definition of work is dependent upon the model of parallelism used, and our choice of model can have profound effects on both programmer productivity and run-time efficiency. Given that the movement of data is responsible for the majority of parallelism overhead, and accessing data is responsible for the majority of parallelism errors, data items should be the basis for describing parallel work. As data items rarely exist in isolation and are instead parts of larger collections, we argue that subsets of collections should be the basic unit of parallelism. This requires a semantically rich method of referring to these sub-collections. Sub-collections are not guaranteed to be disjoint, and so an efficient run-time mechanism is required to maintain correctness. With a focus on complex systems, we present some of the challenges inherent in this approach and describe how we are extending Synchronization via Scheduling (SvS) and other techniques to overcome these difficulties. We discuss our experiences incorporating these techniques into a modern video game engine used in an in-development title.
Tim Harris, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Mark Moir, Oracle Labs In data-driven parallelism, changes to data spawn new tasks, which may change more data, spawning yet more tasks. Computation propagates until no further changes occur. Benefits include increasing opportunities for fine-grained parallelism, avoiding redundant work, and supporting incremental computations on large data sets. Nonetheless, data-driven parallelism can be problematic. For example, convergence times of data-driven single-source shortest paths algorithms can vary by two orders of magnitude depending on task execution order. We propose constrained data-driven parallelism, in which programmers can impose ordering constraints on tasks. In particular, we propose new abstractions for defining groups of tasks and constraining the execution order of tasks within each group. We sketch an initial implementation and present experimental results demonstrating that our approach enables new efficient data-driven implementations of a variety of graph algorithms.
Philippe Tillet, Technische Universität Wien; Karl Rupp, Argonne National Laboratory; Siegfried Selberherr, Technische Universität Wien; Chin-Teng Lin, National Chiao Tung University The rise of multi- and many-core architectures also gave birth to a plethora of new parallel programming models. Among these, the open industry standard OpenCL addresses this heterogeneity of programming environments by providing a unified programming framework. The price to pay, however, is that OpenCL requires additional low-level boilerplate code, when compared to vendor-specific solutions, even if only simple operations are to be performed. Also, the unified programming framework does not automatically provide any guarantees on performance portability of a particular implementation. Thus, device-specific compute kernels are still required for obtaining good performance across different hardware architectures.
We address both, the issue of programmability and portable performance, in this work: On the one hand, a high-level programming interface for linear algebra routines allows for the convenient specification of the operations of interest without having to go into the details of the underlying hardware. On the other hand, we discuss the underlying generator for device-specific OpenCL kernels at runtime, which is supplemented by an auto-tuning framework for portable performance as well as with work partitioning and task scheduling for multiple devices.
Our benchmark results show portable performance across hardware from major vendors. In all cases, at least 75 percent of the respective vendor-tuned library was obtained, while in some cases we even outperformed the reference. We further demonstrate the convenient and ecient use of our high-level interface in a multi-device setting with good scalability.
|
12:30 p.m.–2:00 p.m. |
Monday |
FCW Luncheon
Imperial Ballroom |
2:00 p.m.–2:45 p.m. |
Monday |
We present approximate data structures with construction algorithms that execute without synchronization. The data races present in these algorithms may cause them to drop inserted or appended elements. Nevertheless, the algorithms 1) do not crash and 2) may produce a data structure that is accurate enough for its clients to use successfully. We advocate an approach in which the approximate data structures are composed of basic tree and array building blocks with associated synchronization-free construction algorithms. This approach enables developers to reuse the construction algorithms, which have been engineered to execute successfully in parallel contexts despite the presence of data races, without having to understand the details of why they execute successfully.
We evaluate the end-to-end accuracy and performance consequences of our approach by building a space-subdivision tree for the Barnes-Hut N-body simulation out of our presented tree and array building blocks. The resulting approximate data structure construction algorithm eliminates synchronization overhead and anomalies such as excessive serialization and deadlock. The algorithm exhibits good performance (running 14 times faster on 16 cores than the sequential version) and good accuracy (the accuracy loss is four orders of magnitude less than the accuracy gain associated with increasing the accuracy of the Barnes-Hut center of mass approximation by 20%).
Irina Calciu, Brown University; Justin Gottschlich, Intel Labs; Maurice Herlihy, Brown University Emerging cache-coherent non-uniform memory access (cc-NUMA) architectures provide cache coherence across hundreds of cores. These architectures change how applications perform: while local memory accesses can be fast, remote memory accesses suffer from high access times and increased interconnect contention. Because of these costs, performance of legacy code on NUMA systems is often worse than their uniform memory counterparts despite the potential for increased parallelism.
We explore these effects on prior implementations of concurrent stacks and propose the first NUMA-friendly stack design that improves data locality and minimizes interconnect contention. We achieve this by using a dedicated server
thread that performs all operations requested by the client threads. Data is kept in the cache local to the server thread thereby restricting cache-to-cache traffic to messages exchanged between the clients and the server. In addition, we match reciprocal operations (pushes and pops) by using the rendezvous elimination algorithm before sending requests to the server. This has the dual effect of avoiding unnecessary interconnect traffic and reducing the number of operations that change the data structure. The result of combining elimination and delegation is a scalable and highly parallel stack that outperforms all previous stack implementations on NUMA systems.
|
2:45 p.m.–3:00 p.m. |
Monday |
Break with Refreshments
Market Street Foyer |
3:00 p.m.–4:00 p.m. |
Monday |
Eric Brewer, University of California, Berkeley, and Google Parallelism in the cloud comes in two basic flavors: low-latency queries and batch processing. These flavors are both fundamental but quite different in their needs. By separating the two we can build simpler systems that support both well. In particular, we cover the design of a new OS for cloud computing that brings these concepts all the way down to the metal, where the job shifts from virtualizing resources for typically local human users to more direct support for these two classes of parallelism under the remote control of a cluster-wide scheduler. Parallelism in the cloud comes in two basic flavors: low-latency queries and batch processing. These flavors are both fundamental but quite different in their needs. By separating the two we can build simpler systems that support both well. In particular, we cover the design of a new OS for cloud computing that brings these concepts all the way down to the metal, where the job shifts from virtualizing resources for typically local human users to more direct support for these two classes of parallelism under the remote control of a cluster-wide scheduler.
|
5:00 p.m.–6:30 p.m. |
Monday |
Whether this is your first time at a USENIX event or your tenth, please stop by the Monday Happy Hour to start your week off right. Take advantage of this opportunity to meet your peers while enjoying refreshments and hor d’oeuvres and checking out the HotPar ’13 posters. Each HotPar ’13 paper will be presented, as well as additional posters listed on the Poster Session page.
|