Bin Yang, Shandong University, National Supercomputing Center in Wuxi; Xu Ji, Tsinghua University, National Supercomputing Center in Wuxi; Xiaosong Ma, Qatar Computing Research institute, HBKU; Xiyang Wang, National Supercomputing Center in Wuxi; Tianyu Zhang and Xiupeng Zhu, Shandong University, National Supercomputing Center in Wuxi; Nosayba El-Sayed, Emory University; Haidong Lan and Yibo Yang, Shandong Unversity; Jidong Zhai, Tsinghua University; Weiguo Liu, Shandong University, National Supercomputing Center in Wuxi; Wei Xue, Tsinghua University, National Supercomputing Center in Wuxi
This paper presents an effort to overcome the complexities of production-use I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. It simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification.
With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with a collection of real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Finally, both codes and data collected are to be released.
NSDI '19 Open Access Sponsored by NetApp
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Bin Yang and Xu Ji and Xiaosong Ma and Xiyang Wang and Tianyu Zhang and Xiupeng Zhu and Nosayba El-Sayed and Haidong Lan and Yibo Yang and Jidong Zhai and Weiguo Liu and Wei Xue},
title = {End-to-end {I/O} Monitoring on a Leading Supercomputer},
booktitle = {16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)},
year = {2019},
isbn = {978-1-931971-49-2},
address = {Boston, MA},
pages = {379--394},
url = {https://www.usenix.org/conference/nsdi19/presentation/yang},
publisher = {USENIX Association},
month = feb
}