8:30 a.m.–9:00 a.m. |
Thursday |
Continental Breakfast
Market Street Foyer |
9:00 a.m.–10:30 a.m. |
Thursday |
Session Chair: Calton Pu, Georgia Institute of Technology
The World-Wide Web contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that can encourage publishing more data sets from governments and other public organizations and support new data management opportunities, such as effective crisis response, data journalism and crowd-sourcing data sets. For the first time since the emergence of the Web, structured data is being used widely by search engines and is being collected via a concerted effort.
I will describe some of the efforts we are conducting at Google to collect structured data, filter the high-quality content, and serve it to our users. These efforts include providing Google Fusion Tables, a service for easily ingesting, visualizing and integrating data, mining the Web for high-quality HTML tables, and contributing these data assets to Google's other services. The World-Wide Web contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that can encourage publishing more data sets from governments and other public organizations and support new data management opportunities, such as effective crisis response, data journalism and crowd-sourcing data sets. For the first time since the emergence of the Web, structured data is being used widely by search engines and is being collected via a concerted effort.
I will describe some of the efforts we are conducting at Google to collect structured data, filter the high-quality content, and serve it to our users. These efforts include providing Google Fusion Tables, a service for easily ingesting, visualizing and integrating data, mining the Web for high-quality HTML tables, and contributing these data assets to Google's other services.
Alon Halevy heads the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the database group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic, a company that created search engines for the deep web, and was acquired by Google. Dr. Halevy is a Fellow of the Association for Computing Machinery, received the the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). He received his Ph.D in Computer Science from Stanford University in 1993 and his Bachelors from the Hebrew University in Jerusalem. Halevy is also a coffee culturalist and published the book The Infinite Emotions of Coffee, published in 2011 and a co-author of the book Principles of Data Integration, published in 2012.
|
10:30 a.m.–11:00 a.m. |
Thursday |
Break with Refreshments
Market Street Foyer |
11:00 a.m.–12:30 p.m. |
Thursday |
Session Chair: Christopher Stewart, The Ohio State University
Yanfei Guo, Jia Rao, and Xiaobo Zhou, University of Colorado, Colorado Springs Awarded Best Paper! Hadoop is a popular implementation of the MapReduce framework for running data-intensive jobs on clusters of commodity servers. Although Hadoop automatically parallelizes job execution with concurrent map and reduce tasks, we find that, shuffle, the all-to-all input data fetching phase in a reduce task can significantly affect job performance. We attribute the delay in job completion to the coupling of the shuffle phase and reduce tasks, which leaves the potential parallelism between multiple waves of map and reduce unexploited, fails to address data distribution skew among reduce tasks, and makes task scheduling inefficient. In this work, we propose to decouple shuffle from reduce tasks and convert it into a platform service provided by Hadoop. We present iShuffle, a user-transparent shuffle service that pro-actively pushes map output data to nodes via a novel shuffle-on-write operation and flexibly schedules reduce tasks considering workload balance. Experimental results with representative workloads show that iShuffle reduces job completion time by as much as 30.2%.
João Paiva, Pedro Ruivo, Paolo Romano, and Luís Rodrigues, INESC-ID Lisboa, Instituto Superior Técnico, and Universidade Técnica de Lisboa Best Paper Award Finalist This paper addresses the problem of autonomic data placement in replicated key-value stores. The goal is to automatically optimize replica placement in a way that leverages locality patterns in data accesses, such that inter-node communication is minimized. To do this efficiently is extremely challenging, as one needs not only to find lightweight and scalable ways to identify the right data placement, but also to preserve fast data lookup. The paper introduces new techniques that address each of the challenges above. The first challenge is addressed by optimizing, in a decentralized way, the placement of the objects generating most remote operations for each node. The second challenge is addressed by combining the usage of consistent hashing with a novel data structure, which provides efficient probabilistic data placement. These techniques have been integrated in Infinispan, a popular open-source key-value store. The performance results show that the throughput of the optimized system can be 6 times better than a baseline system employing the widely used static placement based on consistent hashing.
Seokyong Hong, Padmashree Ravindra, and Kemafor Anyanwu, North Carolina State University MapReduce data processing workflows often consist of multiple cycles where each cycle hosts the execution of some data processing operators e.g., join, defined in a program. A common situation is that many data items that are propagated along in a workflow, end up being "fruitless" i.e. they do not contribute to the final output. Given that the dominant costs associated with MapReduce processing (I/O, sorting and network transfer) are very sensitive to the size of intermediate states, such fruitless data items contribute unnecessarily to workflow costs. Consequently, it may be possible to improve the performance of MapReduce data processing workflows by eliminating fruitless data items as
early as possible. Achieving this will require maintaining extra information about the state (output) of each operator, and then passing this information to descendant operators in the workflow. The descendant operators can use this state information to prune fruitless data items from their other inputs. However, this process is not without any overhead and in some cases, its costs may outweigh its benefits. Consequently, a technique for adaptively selecting Information Passing as part of an execution plan is needed. This adaptivity will need to be determined by a cost model that accounts for MapReduce's partitioned execution model as well as its restricted model of communication between operators. These nuances of MapReduce impose limitations on the applicability of information passing techniques developed for traditional database systems.
In this paper, we propose an approach for implementing Adaptive Information Passing for MapReduce platforms. Our proposal includes a benefit estimation model, and an approach for collecting data statistics needed for benefit estimation, which piggybacks on operator execution. Our approach has been integrated into Apache Hive and a comprehensive empirical evaluation is presented.
|
12:30 p.m.–2:00 p.m. |
Thursday |
FCW Luncheon
Market Street Foyer |
2:00 p.m.–3:45 p.m. |
Thursday |
Session Chairs: Karsten Schwan, Georgia Institute of Technology, and Vanish Talwar, HP Labs
Nathan D. Mickulicz, Priya Narasimhan, and Rajeev Gandhi, YinzCam, Inc., and Carnegie Mellon University YinzCam is a cloud-hosted service that provides sports fans with real-time scores, news, photos, statistics, live radio, streaming video, etc., on their mobile devices. YinzCam’s infrastructure is currently hosted on Amazon Web Services (AWS) and supports over 7 million downloads of the official mobile apps of 40+ professional sports teams and venues. YinzCam’s workload is necessarily multi-modal (e.g., pre-game, in-game, post-game, game-day, non-gameday, in-season, off-season) and exhibits large traffic spikes due to extensive usage by sports fans during the actual hours of a game, with normal game-time traffic being twenty-fold of that on non-game days.
We discuss the system’s performance in the three phases of its evolution: (i) when we initially deployed the YinzCam infrastructure and our users experienced unpredictable latencies and a large number of errors, (ii) when
we enabled AWS’ Auto Scaling capability to reduce the latency and the number of errors, and (iii) when we analyzed the YinzCam architecture and discovered opportunities for architectural optimization that allowed us to provide predictable performance with lower latency, a lower number of errors, and at lower cost, when compared with enabling Auto Scaling.
Alina Beygelzimer, Anton Riabov, Daby Sow, Deepak S. Turaga, and Octavian Udrea, IBM T. J. Watson Research Center Large-scale data exploration using Big Data platforms requires the orchestration of complex analytic workflows composed of atomic analytic components for data selection, feature extraction, modeling and scoring. In this paper, we propose an approach that uses a combination of planning and machine learning to automatically determine the most appropriate data-driven workflows to execute in response to a user-specified objective. We combine this with orchestration mechanisms and automatically deploy, adapt and manage such workflows across Big Data platforms. We present results of this automated exploration in real settings in healthcare.
Shekhar Gupta, Christian Fritz, Bob Price, Roger Hoover, and Johan DeKleer, Palo Alto Research Center; Cees Witteveen, Delft University of Technology Hadoop is the de-facto standard for big data analytics applications. Presently available schedulers for Hadoop clusters assign tasks to nodes without regard to the capability of the nodes. We propose ThroughputScheduler, which reduces the overall job completion time on a clusters of heterogeneous nodes by actively scheduling tasks on nodes based on optimally matching job requirements to node capabilities. Node capabilities are learned by running probe jobs on the cluster. ThroughputScheduler uses a Bayesian, active learning scheme to learn the resource requirements of jobs on-the-fly. An empirical evaluation on a set of sample problems demonstrates that ThroughputScheduler can reduce total job completion time by almost 20% compared to the Hadoop FairScheduler and 40% compared to FIFOScheduler. ThroughputScheduler also reduces average mapping time by 33% compared to either of these schedulers.
Vinay Deolalikar, HP-Autonomy Research The past decade has witnessed an astonishing growth in unstructured information in enterprises. The commercial value locked in enterprise unstructured information is being increasingly recognized. Accordingly, a range of textual document analytics—clustering, classification, taxonomy generation, provenance, etc.— have taken center stage as a potential means to manage this explosive growth in unstructured enterprise information, and unlock its value.
Several analytics are time-intensive: the time taken to complete processing the increasingly large volumes of data is significantly more than real-time. However, users are increasingly demanding real-time services that rely on such time-intensive analytics. There is clearly a tension between the aforementioned two developments.
In light of the preceding, vendors increasingly realize that while an analytic may take a longer time to converge, they need to extract useful information from it in real-time. Furthermore, this information has to be application-driven. In other words, it is often not an option to simply "wait until the analytic has finished running:" they must start providing the user with information while the analytic is still running. In summary, there is an emerging stress in Enterprise Information Management (EIM) on application-driven real-time information being extracted from time-intensive analytics.
A priori, it is not clear what could be extracted from an analytic that has yet to complete, and whether any such information would be useful. As of the present, there is little or no research literature on this problem: it is generally assumed that all of the information from an analytic will be available upon its completion.
We present an approach to this problem that is based on decomposing the objective function of the analytic, which is a global function that determines the progress of the analytic, into multiple local, user-centric functions. How can we construct meaningful local functions? How can such functions be measured? How do these functions evolve with time? Do these functions encode useful information that can be obtained real-time? These are the questions we will address in this paper.
We demonstrate our approach using local functions on document clustering using the de facto standard algorithm—k-means. In this case, the multiple local user-centric functions transform k-means into a flow algorithm, with each local function measuring a flow. Our results show that these flows evolve very differently from the global objective function, and in particular, may often converge quickly at many local sites. Using this property, we are able to extract useful information considerably earlier than the time taken by k-means to converge to its final state.
We believe that such pragmatic approaches will have to be taken in order to manage systems performing analytics on large volumes of unstructured data.
Zhuoyao Zhang, University of Pennsylvania; Ludmila Cherkasova, Hewlett-Packard Labs; Boon Thau Loo, University of Pennsylvania An increasing number of MapReduce applications are written using high-level SQL-like abstractions on top of MapReduce engines. Such programs are translated into MapReduce workflows where the output of one job becomes the input of the next job in a workflow. A user must specify the number of reduce tasks for each MapReduce job in a workflow. The reduce task setting may have a significant impact on the execution concurrency, processing efficiency, and the completion time of the worklflow. In this work, we outline an automated performance evaluation framework, called AutoTune, for guiding the user efforts of tuning the reduce task settings in MapReduce sequential workflows while achieving performance objectives. We evaluate performance benefits of the proposed framework using a set of realistic MapReduce applications: TPC-H queries and custom programs mining a collection of enterprise web proxy logs.
|
3:45 p.m.–4:00 p.m. |
Thursday |
Break with Refreshments
Market Street Foyer |
4:00 p.m.–5:45 p.m. |
Thursday |
Session Chair: Levent Gürgen, CEA-Leti
Amine Dhraief, HANA Research Group, University of Manouba; Khalil Drira, LAAS-CNRS, University of Toulouse; Abdelfettah Belghith, HANA Research Group, University of Manouba; Tarek Bouali and Mohamed Amine Ghorbali, HANA Research Group, University of Manouba, and LAAS-CNRS, University of Toulouse Machine-to-Machine (M2M) paradigm is a novel communication technology under standardization at both the ETSI and the 3GPP. It involves a set of sensors and actuators (M2M devices) communicating with M2M applications via M2M gateways, with no human intervention. For M2M communications trust and privacy are key requirements. This drove us to propose a host identity protocol (HIP) based M2M overlay network, called HBMON, in order to ensure private communications between M2M devices, M2M gateway and M2M applications. In this paper, we first propose to add the self-healing capabilities to the M2M gateways. We enable at the M2M gateway level the REAP protocol, a failure detection and locator pair exploration protocol for IPv6 multihoming nodes. We also add mobility management capabilities to the M2M gateway in order to handle M2M devices mobility. Furthermore, in this paper we add the self-optimization capabilities to the M2M gateways. We also modify the REAP protocol to continuously monitor the overlay paths in order to always select the best available one in term of RTT. We implement our solution on the OMNeT++ network simulator. Results highlight the novel gateway capabilities: it recovers from failures, handle mobility and always select the best available path.
Shruti Devasenapathy, Vijay S. Rao, R. Venkatesha Prasad, and Ignas Niemegeers, Delft University of Technology; Abdur Rahim, CreateNet Devices in future Internet of Things (IoT) will be scavenging energy from the ambiance for all their operations. They face challenges in various aspects of network organization and operation due to the nature of ambient energy sources such as, solar insolation, vibration and motion. In this paper we analyze the classical two-way algorithm for neighbor discovery (ND) in an energy harvesting IoT. Through analysis, we outline the parameters that play an important role in ND performance such as node density, duty cycle, beamwidth and energy profile. We also provide simulation results to understand the impact of the energy storage element of energy harvesting devices in the ND process. We demonstrate that there exist trade-offs in choices for antenna beamwidth and node duty cycle, given node density and energy arrival rate. We show that the variations in energy availability impact ND performance. We also demonstrate that the right size of the storage buffer can smooth the effects of energy variability.
Sylvain Frey, EDF R&D and Télécom ParisTech, CNRS LTCI; Ada Diaconescu, Télécom ParisTech, CNRS LTCI; David Menga, EDF R&D; Isabelle Demeure, Télécom ParisTech, CNRS LTCI Autonomic control is vital to the success of large-scale distributed and open IoT systems, which must simultaneously cater for the interests of several parties. However, developing and maintaining autonomic controllers is highly difficult and costly. To illustrate this problem, this paper considers a system that could be deployed in the future, integrating smart homes within a smart microgrid. The paper addresses this problem from a Software Engineering perspective, building on the authors' experience with devising autonomic systems and including recent work on integration design patterns. The contribution focuses on a generic architecture for multi-goal, adaptable and open autonomic systems, exemplified via the development of a concrete autonomic application for the smart micro-grid. Our long-term goal is to progressively identify and develop reusable artefacts, such as paradigms, models and frameworks for helping the development of autonomic applications, which are vital for reaching the full potential of IoT systems.
Arun kishore Ramakrishnan, Nayyab Zia Naqvi, Zubair Wadood Bhatti, Davy Preuveneers, and Yolande Berbers, KU Leuven The Internet of Things (IoT) is the next big wave in computing characterized by large scale open ended heterogeneous network of things, with varying sensing, actuating, computing and communication capabilities. Compared to the traditional field of autonomic computing, the IoT is characterized by an open ended and highly dynamic ecosystem with variable workload and resource availability. These characteristics make it difficult to implement self-awareness capabilities for IoT to manage and optimize itself. In this work, we introduce a methodology to explore and learn the trade-offs of different deployment configurations to autonomously optimize the QoS and other quality attributes of IoT applications. Our experiments demonstrate that our proposed methodology can automate the efficient deployment of IoT applications in the presence of multiple optimization objectives and variable operational circumstances.
|
4:00 p.m.–5:30 p.m. |
Thursday |
Empire Room
Moderators: Karsten Schwan, Georgia Institute of Technology; Vanish Talwar, HP Labs
Panelists: Lucy Cherkasova, HP Labs; Gregory Eitzmann, Google; Sameh Elnikety, Microsoft Research; Krishna Gade, Twitter; Nagapramod Mandagere, IBM Research; Sambavi Muthukrishnan, Facebook; Priya Narasimhan, Carnegie Mellon University; Dilma da Silva, Qualcomm
|
6:30 p.m.–8:00 p.m. |
Thursday |
Session Chair: Rean Griffith, VMware
Daniela Loreti, University of Bologna
Lei Lu, The College of William and Mary
Feng Yan, The College of William and Mary
Christian Krupitzer, University of Mannheim, Germany
Murali Emani, University of Edinburgh
|