Challenges in Real-Time Synchronization Back to Program
Upcoming multi-core processors force developers of real-time systems to meet the challenges in real-time synchronization. This paper sketches some results of a novel wait-free, linearizable, and disjoint-access parallel NCAS library, called RTNCAS. It focuses the use in massive parallel real-time systems and enables developers to implement arbitrarily complex data structures in an easy way, while ensuring linearizability, wait-freedom, as well as disjoint-access parallelism. It allows, furthermore, developers to re-use their sequential algorithms without any modifications and care about concurrency. Thereby, the degree of disjoint-access parallelism can be used to trade a low jitter for a higher average-case performance.
Dynamic Prioritization for Parallel Traversal of Irregularly Structured Spatio-Temporal Graphs Back to Program
We introduce a dynamic prioritization scheme for parallel traversal of an irregularly structured spatio-temporal graph by multiple collaborative agents. We are concerned in particular with parallel execution of a sparse or fast algorithm for, but not limited to, an all-to-all transformation by multiple threads on a multi-core or many-core processor. Such fast algorithms play an escalating role as the basic modules to constitute, at increasingly large scale, computational solutions and simulations in scientific inquiries and engineering designs. We describe certain typical features of a spatio-temporal graph, not necessarily a tree, that describes a sparse or fast algorithm in execution. We show how the proposed dynamic prioritization scheme guarantees the shortest traversal time while subject to a constraint in the number of agents, in both ideal and practical situations. We present experimental results with the application of this scheme to the celebrated fast multipole method.
Support of Collective Effort Towards Performance Portability Back to Program
Performance portability, in the sense that a single source can run with good performance across a wide variation of parallel hardware platforms, is strongly desired by industry and actively being researched. However, evidence is mounting that performance portability cannot be realized at just the toolchain level, or just at the runtime level or just at the hardware abstraction level.
This is a position paper, making a suggestion for how the groups involved can more efficiently solve the performance portability problem together. We don't propose a solution, at all, but rather a support system for the players to self organize and collectively find one. The support system is based on a new extendable virtualization mechanism called VMS (Virtualized Master-Slave), that fulfills the needs of an organizing principle, and provides focus that may increase research efficiency. The difficult work will be the on-going research efforts on parallel language design, compilers, source-to-source transform tools, binary optimization, run-time schedulers, and hardware support for parallelism. Although it doesn't in itself solve the problem, such an organizing principle may be a valuable step towards a solution — the problem may be too complex and require cooperation of too many real-world entities for a single-entity solution.
We briefly review VMS, and illustrate how it could be used to give rise to an eco-system in which performance portability is collectively realized. To support the suggestion, we give measurements of the time to implement three parallelism-construct libraries, and performance numbers for them, along with measurements of the basic overhead of VMS.
Efficient and Correct Transactional Memory Programs Combining Snapshot Isolation and Static Analysis Back to Program
The use of the Snapshot Isolation (SI) level in Transactional Memory (TM) eliminates the need of tracking memory read accesses, reducing the run-time overhead and fastening the commit phase. By detecting only write-write conflicts, SI allows many memory transactions to succeed that would otherwise abort if serialized. This higher commit rate comes at the expense of introducing anomalous behaviors by allowing some real conflicting transactions to commit. We aim at improving the performance of TM systems by running programs under SI, while guaranteeing a serializable semantics. This is achieved by static analysis of TM programs using Separation Logic to detect possible anomalies when running under SI. To guarantee correct behavior, the program code can be automatically modified to avoid these anomalies. Our approach can have an important impact on the performance of single multi-core node TM systems, and also of distributed TM systems by considerable reducing the required network traffic.
Coding Stencil Computations Using the Pochoir Stencil-Specification Language Back to Program
Pochoir is a compiler for a domain-specific language embedded in C++ which produces excellent code from a simple specification of a desired stencil computation. Pochoir allows a wide variety of boundary conditions to be specified, and it automatically parallelizes and optimizes cache performance. Benchmarks of Pochoir-generated code demonstrate a performance advantage of 2–10 times over standard parallel loop implementations. This paper describes the Pochoir specification language and shows how a wide range of stencil computations can be easily specified.
Feasibility of Dynamic Binary Parallelization Back to Program
This paper proposes DBP, an automatic technique that transparently parallelizes a sequential binary executable while it is running. A prototype implementation in simulation was able to increase sequential execution speeds by up to 1.96x, averaged over three benchmarks suites.
Automated Fingerprinting of Performance Pathologies Using Performance Monitoring Units (PMUs) Back to Program
Modern architectures provide access to many hardware performance events, which are capable of providing insight into architectural performance bottlenecks throughout the core and memory hierarchy. These events can provide programmers with unique and powerful insights into the causes of performance problems in their programs, but interpreting these events has been a significant challenge. We describe a technique that uses data mining to automatically fingerprint a program's performance problems, permitting programmers to reap the architectural insights made possible by the events while shielding them from the onerous task of interpreting raw events. We use a decision tree algorithm on a set of micro-benchmarks to construct a model of performance problems. This extracted model is able to divide a profiled application into program phases, and label the phases with the patterns of hardware bottlenecks. Our framework provides programmers with a detailed map of what to optimize in their code, sparing them the need to interpret raw events.
PACORA: Performance Aware Convex Optimization for Resource Allocation Back to Program
Resource allocation is the dynamic allocation and de-allocation of processor cores, memory pages, and various categories of bandwidth to client subcomputations (e.g processes within an operating system) that compete for those resources. Historically, resource allocation has been rather unsystematic, and now the assumptions underlying the traditional strategies no longer hold. The existence of parallel software and multiple heterogeneous processor cores, the requirement to maintain quality of service agreements and responsiveness, and the need to limit power and energy consumption (especially in mobile systems) are inadequately addressed by the status quo. A new approach to resource allocation is described that addresses all of these needs and that can be framed as a convex optimization problem; the result is that an optimal solution will exist and be unique with no local extrema. As a result, rational, fast, and fully incremental solutions to the resource allocation problem become feasible.
Schedule Data, Not Code Back to Program
In this paper we argue that the scheduler, as the intermediary between hardware and software, needs to be fully data-aware. The old paradigm of envisioning tasks as amorphous blobs of 'work' to be assigned to processors is incomplete and needs be expanded. Some techniques and projects have emerged that implicitly use this idea, but either focus on a small aspect of data or are targeted to optimizing specific problems. We argue for more general solutions.
Data Parallel Programming for Irregular Tree Computations Back to Program
The adoption of data parallel primitives to increase multicore utilization presents an opportunity for aggressive compiler optimization. We examine computations over the tree abstract datatype (ADT) in particular.
For better utilization than approaches like flattening, we argue that transformations should specialize for common data and computation regularities. For example, we demonstrate a novel pattern that exploits regularity in node labeling as a SIMD parallelism opportunity, which we call SIMTask parallelism. For better applicability, we argue for better linguistic support of irregularity. For example, we show how the future primitive might be used to tolerate occasional data dependencies in an otherwise associative computation.
We validate our approach on two tree computations: RepMin, popular in the functional programming community, and intrinsic widths, a stage of webpage layout in a prototype web browser. We show speedups over traditional versions of these computations of 124X and 10X, respectively.
Are Database-style Transactions Right for Modern Parallel Programs? Back to Program
The early transactional memory (TM) programming model and semantics were inspired by ideas and semantics from database transactions and today's state-of-the-art TM systems are not too removed from them. But today's TM systems are being used in modern and emerging parallel applications which are very different from the database programs that those transaction semantics and programming model were designed for. The differences include the nature of the synchronization itself, the amount of flexibility demanded from the programming model and the amount of programmer control desired in the programming process. While todays TM systems offer an elegant and concise conceptual interface they typically offer a rigid set of semantics to the programmer and are intended to be used only as black-box components while writing parallel programs. They provide guarantees of properties such as strict atomicity and isolation irrespective of whether a program's semantics require them and usually at a significant performance cost. In this paper we argue that this rigid view of memory transactions limits the range of expressibility necessary for supporting some important parallel programming idioms for some soft-computing programs. Using three specific behaviors as examples we also argue that relaxing some of these guarantees and maintaining the flexibility to support the specific style of synchronization a program needs can substantially improve expressibility and parallel performance especially in programs with long-running transactions which discard significant amounts of work when they abort.
|