sponsors
general information
Venue
DoubleTree by Hilton Dublin - Burlington Road
Leeson Street Upper
Dublin 4, Ireland
Questions?
About SREcon?
About Registration?
About Sponsorship?
usenix conference policies
Production Improvement Review: Taking a Bite Out of Repair Debt
Martin Check, Microsoft
Azure SRE works with services that have widely variable maturity, ranging from fully federated devops teams, to fully Tiered IT/Ops teams, and everything in between. The one thing all of these services have in common is that they have outages. While they all respond in different ways to recover and respond, SRE has to collect and leverage data in a common manner across all services to prevent outages and drive reliability up consistently. In this this talk we’ll discuss how SRE leverages diverse data sets to drive improvements across this heterogeneous set of services. SRE ensures that teams are rigorously completing post incident reviews and addressing their live site debt. We not only look at the actual repair debt, but we’ve introduced a new concept called “virtual debt” which shows where a service incident response faltered, but no appropriate repair was logged. Virtual Debt is affectionately referred to as “PacMan debt” due to the appearance of the chart. The greater the virtual debt, the bigger the bite.
We’ll also discuss how we expose the data in near real time dashboards that allow team members from the director all the way down to the IC to see relevant views and take the appropriate action. IC’s can find incomplete postmortems they need to work on, a service director can view his accumulated debt to prioritize resources, or a dev manager can review virtual debt to ensure the team is conducting rigorous postmortems. By analyzing historical outages, we’ve found that missed detection leads to an exponential increase in mitigation times. We’ve collected a myriad of other insights by mining through historical outage data and using charts and creative visualizations to surface insights, including what surprising proxy metrics we’ve discovered that influence uptime.
Martin Check is a Site Reliability Engineer on the Microsoft Azure team. He has worked on large scale services at Microsoft for 12 years in a variety of roles ranging from service design and implementation, to crisis response, to leading teams through devops/SRE transitions. Currently he is working on Problem Management efforts for Azure to identify and resolve problems that stand in the way of service uptime through data analysis, surfacing insights, and engineering solutions.
Martin Check, Microsoft
Martin Check is a Site Reliability Engineer on the Microsoft Azure team. He has worked on large scale services at Microsoft for 12 years in a variety of roles ranging from service design and implementation, to crisis response, to leading teams through devops/SRE transitions. Currently he is working on Problem Management efforts for Azure to identify and resolve problems that stand in the way of service uptime through data analysis, surfacing insights, and engineering solutions.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Martin Check},
title = {Production Improvement Review: Taking a Bite Out of Repair Debt},
year = {2016},
address = {Dublin},
publisher = {USENIX Association},
month = jul
}
connect with us