Bin Yang, Tsinghua University and National Supercomputer Center in Wuxi; Hao Wei, Tsinghua University; Wenhao Zhu, Shandong University and National Supercomputer Center in Wuxi; Yuhao Zhang, Tsinghua University; Weiguo Liu, Shandong University; Wei Xue, Tsinghua University, Qinghai University and Intelligent Computing and Application Laboratory of Qinghai Province, and National Supercomputer Center in Wuxi
The system architecture of contemporary supercomputers is growing increasingly intricate with the ongoing evolution of system-wide network and storage technologies, making it challenging for application developers and system administrators to manage and utilize the escalating complexity of supercomputers effectively. Moreover, the limited experience of application developers and system administrators in conducting insightful analyses of diverse High-Performance Computing (HPC) workloads and the resulting array of resource utilization characteristics exacerbate the challenge. To address this issue, we undertake a comprehensive analysis of six years' worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes and currently ranked as the world's 11th-fastest supercomputer. Our study provides valuable insights into operational management strategies for HPC systems (i.e., job hanging caused by heavy-load benchmark testing, job starvation caused by aggressive scheduling policies) and I/O workload characteristics (i.e., getattr operations spiking caused by massive access to grid files, a large number of files accessed by many applications in a short period), shedding light on both challenges and opportunities for improvements in the HPC environment. This paper delineates our methodology, findings, and the significance of this study. Additionally, we discuss the potential of our research for future studies and practice within this domain.
USENIX ATC '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Bin Yang and Hao Wei and Wenhao Zhu and Yuhao Zhang and Weiguo Liu and Wei Xue},
title = {Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?},
booktitle = {2024 USENIX Annual Technical Conference (USENIX ATC 24)},
year = {2024},
isbn = {978-1-939133-41-0},
address = {Santa Clara, CA},
pages = {917--933},
url = {https://www.usenix.org/conference/atc24/presentation/yang},
publisher = {USENIX Association},
month = jul
}