Persistent SRE Antipatterns: Pitfalls on the Road to Creating a Successful SRE Program Like Netflix and Google

Wednesday, November 01, 2017 - 4:45 pm5:30 pm

Blake Bisset; Jonah Horowitz, Stripe

Abstract: 

People aren't just wrong on the internet. Sometimes they bring it back to the office. We're here to debunk the biggest traps we've stepped in, spent good drink money learning about from other people who'd stepped in them, or seen someone who hadn't stepped in them yet propose as good practice. Save yourself some pain. Or just laugh at ours. The talk addresses specific anti-patterns we've seen in building teams and systems to manage service delivery for very large scale operations, and more appropriate ways to approach those issues.

Blake Bisset

Blake Bisset got his first legal tech job at 16. He won’t say how long ago, except that he’s legitimately entitled to make shakeyfists while shouting “Get off my LAN!” He’s done 3 start-ups (a joint venture of Dupont/ConAgra, a biotech spinoff from the U.W., and this other time a bunch of kids were sitting around New Year’s Eve, wondering why they couldn’t watch movies on the Internet), only to end up spending a half-decade as an SRM at YouTube and Chrome, where his happiest accomplishment was holding the go/bestpostmortem link for several years.

Jonah Horowitz, Stripe

Jonah Horowitz is a Site Reliability Engineer with Stripe. He works with all of the individual engineering teams at Stripe to drive reliability efforts. This includes monitoring, alerting, deployment pipelines and chaos resiliency. Before coming to Stripe he worked at several startups around the Bay Area including: Netflix, Quantcast - a leading ad-tech startup where he grew their network to process over 3 million events per second, Looksmart - a contextual advertising company, and he was on the founding team of Wal-Mart.com (now Walmart Labs) where he built out their software deployment pipelines and their product image management systems.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {207153,
author = {Blake Bisset and Jonah Horowitz},
title = {Persistent {SRE} Antipatterns: Pitfalls on the Road to Creating a Successful {SRE} Program Like Netflix and Google},
year = {2017},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = oct
}

Presentation Video 

Presentation Audio

Take back to work: 
What isn’t Site Reliability Engineering? Does your NOC escalate outages to your DevOops Engineer, who in turn calls your Packaging and Deployment Team? Did your Chef just sprinkle some Salt on your Ansible Red Hat and call it SRE? Lots of companies claim to have SRE teams, but some don’t quite understand the full value proposition, or what shiny technologies and organizational structures will negatively impact your operations, rather than empowering your team to accomplish your mission. You’ll hear stories about anti-patterns in Monitoring, Incident Response, Configuration Management, and more that we’ve tripped over in our own teams, seen actually proposed as good practice in talks at other conferences, and heard as we speak to peers scattered around the industry. We'll also discuss how Google and Netflix each view the role of the SRE, and how it differs from the most recent incarnations of the Systems Administrator role. The talk also explains why freedom and responsibility are key, trust is required, and when chaos is your friend. The audience will leave with specific principles and directives around avoiding traps in these problem areas that can limit their overall growth or force them into costly retooling and retraining later on.