Evolution of Aegis: Fault Diagnosis for AI Model Training Cloud Service in Production (Experience Track)

Authors: 

Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yikai Zhu, Gang Lu, Zhihui Ren, Xue Li, Zhicheng Wang, Bin Luo, Shuai Peng, Yang Liu, Yichi Xu, Yanqing Chen, Yu Guan, Weicheng Wang, Hanyu Zhao, Xianlong Zeng, Zhiping Yao, Ennan Zhai, Binzhang Fu, and Dennis Cai, Alibaba Cloud