Adaptive Information Passing for Early State Pruning in {MapReduce} Data Processing {Workﬂows}

Seokyong Hong; Padmashree Ravindra; Kemafor Anyanwu

Adaptive Information Passing for Early State Pruning in MapReduce Data Processing Workﬂows

Authors:

Seokyong Hong, Padmashree Ravindra, and Kemafor Anyanwu, North Carolina State University

Abstract:

MapReduce data processing workflows often consist of multiple cycles where each cycle hosts the execution of some data processing operators e.g., join, defined in a program. A common situation is that many data items that are propagated along in a workflow, end up being "fruitless" i.e. they do not contribute to the final output. Given that the dominant costs associated with MapReduce processing (I/O, sorting and network transfer) are very sensitive to the size of intermediate states, such fruitless data items contribute unnecessarily to workflow costs. Consequently, it may be possible to improve the performance of MapReduce data processing workflows by eliminating fruitless data items as early as possible. Achieving this will require maintaining extra information about the state (output) of each operator, and then passing this information to descendant operators in the workflow. The descendant operators can use this state information to prune fruitless data items from their other inputs. However, this process is not without any overhead and in some cases, its costs may outweigh its benefits. Consequently, a technique for adaptively selecting Information Passing as part of an execution plan is needed. This adaptivity will need to be determined by a cost model that accounts for MapReduce's partitioned execution model as well as its restricted model of communication between operators. These nuances of MapReduce impose limitations on the applicability of information passing techniques developed for traditional database systems.

In this paper, we propose an approach for implementing Adaptive Information Passing for MapReduce platforms. Our proposal includes a benefit estimation model, and an approach for collecting data statistics needed for benefit estimation, which piggybacks on operator execution. Our approach has been integrated into Apache Hive and a comprehensive empirical evaluation is presented.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {180157,
author = {Seokyong Hong and Padmashree Ravindra and Kemafor Anyanwu},
title = {Adaptive Information Passing for Early State Pruning in {MapReduce} Data Processing {Workflows}},
booktitle = {10th International Conference on Autonomic Computing (ICAC 13)},
year = {2013},
isbn = {978-1-931971-02-7},
address = {San Jose, CA},
pages = {133--143},
url = {https://www.usenix.org/conference/icac13/technical-sessions/presentation/hong},
publisher = {USENIX Association},
month = jun
}

Download

Hong PDF

Log in or register to post comments