Sunday, October 5, 2014 |
8:30 a.m.–9:00 a.m. |
Sunday |
Continental Breakfast
Centennial Foyer |
9:00 a.m.–10:30 a.m. |
Sunday |
Welcome to TRIOS '14
Program Chair: Ken Birman, Cornell University
Kishore Kumar Pusukuri, Oracle Inc.
Dongyou Seo, Hyeonsang Eom, and Heon Y. Yeom, Seoul National University Most of the current CPUs have not single cores, but multicores integrated in the Symmetric MultiProcessing (SMP) architecture, which share the resources such as Last Level Cache (LLC) and Integrated Memory Controller (IMC). On the SMP platforms, the contention for the resources may lead to huge performance degradation. To mitigate the contention, various methods were developed and used; most of these methods focus on finding which tasks share the same resource assuming that a task is the sole owner of a CPU core. However, less attention has been paid to the methods considering the multitasking case. To mitigate contention for memory subsystems, we have devised a new load balancing method, Memory-aware Load Balancing (MLB). MLB dynamically recognizes contention by using simple contention models and performs inter-core task migration. We have evaluated MLB on an Intel i7-2600 and a Xeon E5-2690, and found that our approach can be effectively taken in an adaptive manner, leading to noticeable performance improvements of memory intensive tasks on the different CPU platforms. Also, we have compared MLB with the state of the art method in the multitasking case, Vector Balancing & Sorted Co-scheduling (VBSC), finding out that
MLB can lead to performance improvements compared with VBSC without the modification of timeslice mechanism and is more effective in allowing I/O bound applications to be performed. Also, it can effectively handle the worst case where many memory intensive tasks are co-scheduled when non memory intensive ones terminate in contrast to VBSC. In addition, MLB can achieve performance improvements in CPU-GPU communication in discrete GPU systems.
Andrew Baumann, and Chris Hawblitzel, Microsoft Research; Kornilios Kourtis, ETH Zürich; Tim Harris, Oracle Labs; Timothy Roscoe, ETH Zürich This paper tackles the problem of providing familiar OS abstractions for I/O (such as pipes, network sockets, and a shared file system) to applications on heterogeneous cores including accelerators, co-processors, and ooad engines. We aim to isolate the implementation of these facilities from the details of a platform’s memory architecture, which is likely to include a combination of cache-coherent shared memory, non-cache-coherent shared memory, and
non-shared memory, all in the same system.
We propose coherence-oblivious sharing (Cosh), a new OS abstraction that provides inter-process sharing with
clear semantics on such diverse hardware. We have implemented a prototype of Cosh for the Barrelfish multikernel. We describe how to build common OS functionality using Cosh, and evaluate its performance on a heterogeneous system consisting of commodity cache-coherent CPUs and prototype Intel many-core co-processors.
|
10:30 a.m.–11:00 a.m. |
Sunday |
Break with Refreshments
Centennial Foyer |
11:00 a.m.–12:30 p.m. |
Sunday |
Thanumalayan Sankaranarayana Pillai, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau, University of Wisconsin—Madison We introduce Fracture, a novel framework that transforms and modernizes the basic process abstraction. By “fracturing” an application into logical modules, Fracture enables powerful and novel run-time configurations that improve run-time testing, application availability, and general robustness, all in a generic and incremental manner. We demonstrate the utility of fracturing via in-depth case studies of a chat client, a web server, and two user-level file systems. Through these examples, we show that Fracture enables applications to transparently tolerate memory leaks, buffer overflows, and isolate subsystem crashes, with little change to source code; through intelligent fracturing, we can achieve low overhead as well, thus enabling deployment.
Naila Farooqui, Georgia Institute of Technology; Christopher Rossbach and Yuan Yu, Microsoft Research; Karsten Schwan, Georgia Institute of Technology Parallel architectures like GPUs are a tantalizing compute fabric for performance-hungry developers. While GPUs enable order-of-magnitude performance increases in many data-parallel application domains, writing efficient codes that can actually manifest those increases is a non-trivial endeavor, typically requiring developers to exercise specialized architectural features exposed directly in the programming model. Achieving good performance on GPUs involves effort-intensive tuning, typically requiring the programmer to manually evaluate multiple code versions in search of an optimal combination of problem decomposition with architecture- and runtime-specific parameters. For developers struggling to apply GPUs to more general-purpose computing problems, the introduction of irregular data structures and access patterns serves only to exacerbate these challenges, and only increases the level of effort required.
This paper proposes to automate much of this effort using dynamic instrumentation to inform dynamic, profile-driven optimizations. In this vision, the programmer expresses the application using higher-level front-end programming abstractions such as Dandelion, allowing the system, rather than the programmer, to explore the implementation and optimization space. We argue that such a system is both feasible and urgently needed. We present the design for such a framework, called Leo. For a range of benchmarks, we demonstrate that a system implementing our design can achieve from 1.12 to 27x speedup in kernel runtimes, which translates to 7-40% improvement for end-to-end performance.
Timo Hönig, Heiko Janker, Christopher Eibel, and Wolfgang Schröder-Preikschat, Friedrich-Alexander-Universität Erlangen-Nürnberg; Oliver Mihelic and Rüdiger Kapitza, Technische Universität Braunschweig Optimization of application and system software for energy efficiency is of ecological, economical, and technical importance—and still challenging. Deficiency in
adequate tooling support is a major issue. The few tools available (i.e., measurement instruments, energy profilers) have poorly conceived interfaces and their integration into widely used development processes is missing. This implies time-consuming, tedious measurements and profiling runs and aggravates, if not shoots down, the development of energy-efficient software.
We present PEEK, a systems approach to proactive energy-aware programming. PEEK fully automates energy measurement tasks and suggests program-code improvements at development time by providing automatically generated energy optimization hints. Our approach is based on a combined software and hardware infrastructure to automatically determine energy demand of program code and pinpoint energy faults, thereby integrating seamlessly into existing software development environments. As part of PEEK we have designed a lightweight, yet powerful electronic measuring device capable of taking automated, analog energy measurements. Results show an up to 8.4-fold speed-up of energy analysis when using PEEK, while the energy consumption of the analyzed code was improved by 25.3%.
|
12:30 p.m.–2:00 p.m. |
Tuesday |
Workshop Luncheon
Interlocken B |
2:00 p.m.–3:30 p.m. |
Sunday |
Robert Kiefer, Erik Nordstrom, and Michael J. Freedman, Princeton University Mobile devices regularly move between feast and famine—environments that differ greatly in the capacity and cost of available network resources. Managing these resources effectively is an important aspect of a user’s mobile experience. However, preferences for resource management vary across users, time, and operating conditions, and user and application interests may not align. Furthermore, today’s mobile OS mechanisms are typically coarse-grained, inflexible, and scattered across system and application settings. Users must adopt a "one size fits all" solution or micro-manage their devices.
This paper introduces Tango, a platform for managing network resource usage through a programmatic model that expresses user and app interests (“policies”). Tango centralizes policy expression and enforcement in a controller process that monitors device state and adjusts network usage according to a user’s (potentially dynamic) interests. To align interests and leverage app-specific knowledge, Tango uses a constraint model that informs apps of network limitations so they can optimize their usage. We evaluate how to design policies that account for data limits, user experience, and battery life. We demonstrate how Tango improves individual network-intensive apps like music streaming, as well as conditions when multiple apps compete for limited resources.
Tomas Hruby, Teodor Crivat, Herbert Bos, and Andrew S. Tanenbaum, Vrije Universität Amsterdam Traditionally, applications use sockets to access the network. The socket API is well understood and simple to use. However, its simplicity has also limited its efficiency in existing implementations. Specifically, the socket API requires the application to execute many system calls like select , accept , read , and write . Each of these calls crosses the protection boundary between user space and the operating system, which is expensive. Moreover, the system calls themselves were not designed for high concurrency and have become bottlenecks in modern systems where processing simultaneous tasks is key to performance. We show that we can retain the original socket API without the current limitations. Specifically, our sockets almost completely avoid system calls on the "fast path". We show that our design eliminates up to 99% of the system calls under high load. Perhaps more tellingly, we used our sockets to boost NewtOS, a microkernel-based multiserver system, so that the performance of its network I/O approaches, and sometimes surpasses, the performance of the highly-optimized Linux network stack.
Qingyang Wang, Georgia Institute of Technology; Yasuhiko Kanemasa, Fujitsu Laboratories Ltd.; Jack Li, Chien-An Lai, and Chien-An Cho, Georgia Institute of Technology; Yuji Nomura, Fujitsu Laboratories Ltd.; Calton Pu, Georgia Institute of Technology In this paper, we describe an experimental study of very long response time (VLRT) requests in the latency long tail problem. Applying micro-level event analysis on fine-grained measurement data from n-tier application benchmarks, we show that very short bottlenecks (from tens to hundreds of milliseconds) can cause queue overflows that propagate through an n-tier system, resulting in dropped messages and VLRT requests due to timeout
and retransmissions. Our study shows that even at moderate CPU utilization levels, very short bottlenecks arise from several system layers, including Java garbage collection, anti-synchrony between workload bursts and DVFS clock rate adjustments, and statistical workload interferences among co-located VMs.
As a simple model for a variety of causes of VLRT requests, very short bottlenecks form the basis for a discussion of general remedies for VLRT requests, regardless of their origin. For example, methods that reduce or avoid queue amplification in an n-tier system result in non-trivial trade-offs among system components and their configurations. Our results show interesting challenges remain in both causes and effective remedies of very short bottlenecks.
|
3:30 p.m.–4:00 p.m. |
Sunday |
Break with Refreshments
Centennial Foyer |
4:00 p.m.–5:00 p.m. |
Sunday |
Andy Sayler and Dirk Grunwald, University of Colorado, Boulder In the age of cloud computing, securely storing, tracking, and controlling access to digital “secrets” (e.g. private cryptographic keys, hashed passwords, etc) is a major challenge for developers, administrators, and end-users alike. Yet, the ability to securely store such secrets is critical to the security of the web-connected applications on which we rely. We believe many of the traditional challenges to the secure storage of digital secrets can be overcome through the creation of a dedicated “Secret Storage as a Service” (SSaaS) interface. Such an interface allows us to separate secure secret storage and access control from the applications that require such services. We present Custos: an SSaaS prototype. We describe the Custos design principles and architecture. We also discuss a range of applications in which Custos can be leveraged to store secrets such as cryptographic keys. We compare Custos-backed versions of such applications to the existing alternatives and discuss how Custos and the SSaaS model can improve the security of such applications while still supporting the wide range of features (e.g. multi-device syncing, multi-user sharing, etc) we have come to expect in the age of the Cloud.
David Wolinsky, Daniel Jackowitz, and Bryan Ford, Yale University Despite the attempts of well-designed anonymous communication tools to protect users from tracking or identification, flaws in surrounding software (such as web browsers) and mistakes in configuration may leak the user’s identity. We introduce Nymix, an anonymity-centric operating system architecture designed “top-to bottom” to strengthen identity- and tracking-protection. Nymix’s core contribution is OS support for nymbrowsing: independent, parallel, and ephemeral web sessions. Each web session, or pseudonym, runs in a unique virtual machine (VM) instance evolving from a common base state with support for long-lived sessions which can be anonymously stored to the cloud, avoiding de-anonymization despite potential confiscation or theft. Nymix allows a user to safely browse the Web using various different transports simultaneously through a pluggable communication model that supports Tor, Dissent, and a private browsing mode. In evaluations, Nymix consumes 600 MB per nymbox and loads within 15 to 25 seconds.
|