Todd Underwood, Anthropic, and Brendan Burns, Microsoft
Format: Breakout Group Discussion
Running ML systems is a major new area for many SRE organizations. This session will dive into the differences between running reliable software services in general and ML systems: infrastructure considerations, monitoring, rollouts, performance and cost management, and more.

Todd Underwood leads reliability at Anthropic, a company working to create AI systems that are safe, reliable, and beneficial to society.
Prior to that he led reliability for the Research Platform at Open AI. Before that he was a Senior Engineering Director at Google leading ML capacity engineering at Alphabet. Before that, he founded and led ML Site Reliability Engineering, a set of teams that build and scale internal and external AI/ML services. He was also the Site Lead for Google’s Pittsburgh office. Along with several colleagues, he published Reliable Machine Learning: Applying SRE Principles to ML in Production (O’Reilly Press, 2022).

Brendan Burns is Corporate Vice President for Azure Cloud Native Open Source and Management Platform. He is also a co-founder of the Kubernetes open source project. Before working at Microsoft Azure, he spent eight years working at Google where he worked on search infrastructure and the Google Cloud Platform. Prior to Google he was a Professor of Computer Science at Union College in Schnectady, NY. He has a PhD in Computer Science from the University of Massachusetts Amherst and a BA in Computer Science and Studio Art from Williams College, in Williamstown MA.

author = {Todd Underwood and Brendan Burns},
title = {Running {ML} in Production},
year = {2025},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar
}