sponsors
usenix conference policies
You are here
Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions
Jiaxing Zhang and Hucheng Zhou, Microsoft Research Asia; Rishan Chen, Microsoft Research Asia and Peking University; Xuepeng Fan, Microsoft Research Asia and Huazhong University of Science and Technology; Zhenyu Guo and Haoxiang Lin, Microsoft Research Asia; Jack Y. Li, Microsoft Research Asia and Georgia Institute of Technology; Wei Lin and Jingren Zhou, Microsoft Bing; Lidong Zhou, Microsoft Research Asia
Map/Reduce style data-parallel computation is characterized by the extensive use of user-defined functions for data processing and relies on data-shuffling stages to prepare data partitions for parallel computation. Instead of treating user-defined functions as “black boxes”, we propose to analyze those functions to turn them into “gray boxes” that expose opportunities to optimize data shuffling. We identify useful functional properties for user-defined functions, and propose SUDO, an optimization framework that reasons about data-partition properties, functional properties, and data shuffling. We have assessed this optimization opportunity on over 10,000 data-parallel programs used in production SCOPE clusters, and designed a framework that is incorporated it into the production system. Experiments with real SCOPE programs on real production data have shown that this optimization can save up to 47% in terms of disk and network I/O for shuffling, and up to 48% in terms of cross-pod network traffic.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Jiaxing Zhang and Hucheng Zhou and Rishan Chen and Xuepeng Fan and Zhenyu Guo and Haoxiang Lin and Jack Y. Li and Wei Lin and Jingren Zhou and Lidong Zhou},
title = {Optimizing Data Shuffling in {Data-Parallel} Computation by Understanding {User-Defined} Functions},
booktitle = {9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)},
year = {2012},
isbn = {978-931971-92-8},
address = {San Jose, CA},
pages = {295--308},
url = {https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zhang},
publisher = {USENIX Association},
month = apr
}
connect with us