Building a 5-Exaflop Supercomputer for Meta-AI Research and Supporting Large-Scale Model Training with a Small Distributed Software and Production Engineering Team

Tuesday, 10 October, 2023 - 09:4510:30
Kalyan Saladi and Chris Bray, Meta Platforms Inc.
Abstract: 

Learn how Meta's latest AI Research SuperCluster with 16,000 GPUs was architected, built and operated by a small geographically distributed team of Software and Production Engineers (SRE) working closely together as one team.

We share insights from operating one of the biggest AI supercomputers, with 5 exaflops of compute power, InfiniBand interconnect, and a high-performance storage system coming together to train leading edge AI models from Meta, as well as the monitoring and observability needs that emerged from supporting large-scale model training (including the recently released Llama series).

Kalyan Saladi, Meta Platforms

Kalyan is a software engineer at Meta working in the AI Research Infrastructure team, with experience in production ML Infrastructure (FBLearner), large scale services reliability and performance. Before that they worked at VMware in virtualization, scheduling and distributed systems.

Chris Bray, Meta Platforms. Inc.

Chris has been a Production Engineer at Meta for over a decade, with a variety of experience from user-facing products like Graph Search and Instagram, acquisitions such as Oculus and WhatsApp, to most recently Meta's various AI Infrastructure platforms for Training, Inference, and most recently Research.

BibTeX
@conference {292113,
author = {Kalyan Saladi and Chris Bray},
title = {Building a 5-Exaflop Supercomputer for {Meta-AI} Research and Supporting {Large-Scale} Model Training with a Small Distributed Software and Production Engineering Team},
year = {2023},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video