Qian Ding, Ant Group
The rapid advancement of AI has fundamentally transformed the technological landscape. As AI models grow in complexity and scale, the challenges of managing the underlying infrastructure have intensified commensurately. This presentation explores the unique demands of AI infrastructure and how SREs can adapt to this evolving environment.
We'll delve into the specific challenges of managing GPU-accelerated clusters, including anomaly detection, node lifecycle management, and the distinctive requirements of AI workloads. By sharing real-world experiences and lessons learned, we aim to provide valuable insights into how SREs can effectively navigate this new frontier, ensuring the reliability, scalability, and performance of AI infrastructure.

Qian is a staff engineer at Ant Group, specializing in site reliability engineering. He leads the infrastructure SRE team, applying SRE principles to manage AI infrastructure. His expertise spans heterogeneous cluster management, xPU maintenance, and leveraging observability to enhance the team's capability in diagnosing model training and inference issues. With a wealth of experience in infrastructure management, Qian is currently exploring the evolving skill set required for SRE professionals in the era of large language models. His goal is to adapt and grow in this rapidly changing technological landscape, ensuring that SRE practices remain at the forefront of AI infrastructure management.

author = {Qian Ding},
title = {Transformers in {SRE} Land: Evolving to Manage {AI} Infrastructure},
year = {2025},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar
}