Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?

Authors: 

Bin Yang, Tsinghua University and National Supercomputer Center in Wuxi; Hao Wei, Tsinghua University; Wenhao Zhu, Shandong University and National Supercomputer Center in Wuxi; Yuhao Zhang, Tsinghua University; Weiguo Liu, Shandong University; Wei Xue, Tsinghua University, Qinghai University and Intelligent Computing and Application Laboratory of Qinghai Province, and National Supercomputer Center in Wuxi

Abstract: 

The system architecture of contemporary supercomputers is growing increasingly intricate with the ongoing evolution of system-wide network and storage technologies, making it challenging for application developers and system administrators to manage and utilize the escalating complexity of supercomputers effectively. Moreover, the limited experience of application developers and system administrators in conducting insightful analyses of diverse High-Performance Computing (HPC) workloads and the resulting array of resource utilization characteristics exacerbate the challenge. To address this issue, we undertake a comprehensive analysis of six years' worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes and currently ranked as the world's 11th-fastest supercomputer. Our study provides valuable insights into operational management strategies for HPC systems (i.e., job hanging caused by heavy-load benchmark testing, job starvation caused by aggressive scheduling policies) and I/O workload characteristics (i.e., getattr operations spiking caused by massive access to grid files, a large number of files accessed by many applications in a short period), shedding light on both challenges and opportunities for improvements in the HPC environment. This paper delineates our methodology, findings, and the significance of this study. Additionally, we discuss the potential of our research for future studies and practice within this domain.