Muthian Sivathanu, Midhul Vuppalapati, Bhargav S. Gulavani, Kaushik Rajan, and Jyoti Leeka, Microsoft Research India; Jayashree Mohan, Univ. of Texas Austin; Piyus Kedia, IIIT Delhi
We present the design, implementation, and evaluation of Instalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big data analytics in large-scale data centers. Instalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, Instalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle.
To achieve this, Instalytics uses compute-awareness to customize the 3-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables Instalytics to preserve the same recovery cost and availability as traditional replication. Instalytics also uses compute-awareness to expose a new {\em sliced-read} API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes.
We have implemented Instalytics in a production analytics stack, and show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.
FAST '19 Open Access Sponsored by NetApp
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Muthian Sivathanu and Midhul Vuppalapati and Bhargav S. Gulavani and Kaushik Rajan and Jyoti Leeka and Jayashree Mohan and Piyus Kedia},
title = {{INSTalytics}: Cluster Filesystem Co-design for Big-data Analytics},
booktitle = {17th USENIX Conference on File and Storage Technologies (FAST 19)},
year = {2019},
isbn = {978-1-939133-09-0},
address = {Boston, MA},
pages = {235--248},
url = {https://www.usenix.org/conference/fast19/presentation/sivathanu},
publisher = {USENIX Association},
month = feb
}