Brendan Burns, Microsoft
More and more online services and systems depend on artificial intelligence and large language models to implement core user experiences. Consequently, the safe and reliable rollout of new models and new prompts are critical parts of maintaining the reliability and performance of the overall system. However, unlike traditional systems, there is rarely a clean "working" or "broken" signal from releases. Instead the performance of new models and new prompts is based on probabilistic evaluation of the performance of the new system across many different user inputs. Any change to model or prompt may make some responses better, some responses worse, we need to be able to measure in aggregate across many experiences to determine if there is a regression that needs to be fixed or rolled back. This talk will be a hands-on introduction to approaches that we took during the development of the Azure Copilot and will both describe the problem of reliability in the world of AI models as well as real-world applications that are in use in production today.

Brendan Burns is Corporate Vice President for Azure Cloud Native Open Source and Management Platform. He is also a co-founder of the Kubernetes open source project. Before working at Microsoft Azure, he spent eight years working at Google where he worked on search infrastructure and the Google Cloud Platform. Prior to Google he was a Professor of Computer Science at Union College in Schnectady, NY. He has a PhD in Computer Science from the University of Massachusetts Amherst and a BA in Computer Science and Studio Art from Williams College, in Williamstown MA.

author = {Brendan Burns},
title = {Safe Evaluation and Rollout of {AI} Models},
year = {2025},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar
}