Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

Authors: 

Yazhou Zu, Alireza Ghaffarkhah, Hoang-Vu Dang, Brian Towles, Steven Hand, Safeen Huda, Adekunle Bello, Alexander Kolbasov, Arash Rezaei, Dayou Du, Steve Lacy, Hang Wang, Aaron Wisner, Chris Lewis, and Henri Bahini, Google

Abstract: 

TPUv4 (Tensor Processing Unit) is Google’s 3rd generation accelerator for machine learning training, deployed as a 4096-node supercomputer with a custom 3D torus interconnect. In this paper, we describe our experience designing and operating the software infrastructure that allows TPUv4 supercomputers to operate at scale, including features for automatic fault resiliency and hardware recovery. We adopt a software-defined networking (SDN) approach to manage TPUv4’s high-bandwidth inter-chip interconnect (ICI) fabric, using optical circuit switching to dynamically configure routes to work around machine, chip and link failures. Our infrastructure detects failures and automatically triggers reconfiguration to minimize disruption to running workloads, as well as initiating remediation and repair workflows for the affected components. Similar techniques interface with maintenance and upgrade workflows for both hardware and software. Our dynamic reconfiguration approach allows our TPUv4 supercomputers to achieve 99.98% system availability, gracefully handling hardware outages experienced by ~1% of the training jobs.

NSDI '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {295551,
author = {Yazhou Zu and Alireza Ghaffarkhah and Hoang-Vu Dang and Brian Towles and Steven Hand and Safeen Huda and Adekunle Bello and Alexander Kolbasov and Arash Rezaei and Dayou Du and Steve Lacy and Hang Wang and Aaron Wisner and Chris Lewis and Henri Bahini},
title = {Resiliency at Scale: Managing {Google{\textquoteright}s} {TPUv4} Machine Learning Supercomputer},
booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)},
year = {2024},
isbn = {978-1-939133-39-7},
address = {Santa Clara, CA},
pages = {761--774},
url = {https://www.usenix.org/conference/nsdi24/presentation/zu},
publisher = {USENIX Association},
month = apr
}