Reliability When Everything Is a Platform: Why You Need to {SRE} Your Customers

Dave Rensin; Craig Peterson; Evan Gilman; Saravanan Loganathan; Rita Lu; Craig Sebenik; Andrew Widdowson

All sessions will be held at the Hyatt Regency San Francisco.

Note: There are 5-minute gaps between sessions.

Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)

Attendee Files

SREcon17 Americas Attendee List (PDF)

Monday, March 13, 2017

7:30 am–8:30 am

Continental Breakfast

Grand and Market Street Foyers

8:30 am–9:25 am

Plenary Session

Grand Ballroom

So You Want to Be a Wizard

Julia Evans, Stripe

Available Media

I don't always feel like a wizard. Like many of you, I've been doing operations for a couple of years, and I still have a TON TO LEARN about how to do this "SRE" job.

But along the way, I have learned a few ways to debug tricky problems, get the information I need from my colleagues, and get my job done. We're going to talk about

how asking dumb questions is actually a superpower
how you can read the source code to the Linux kernel when all else fails
debugging tools that make you FEEL like a wizard
how understanding what your _organization_ needs can make you amazing

At the end, we'll have a better understanding of how you can get a lot of awesome stuff done even when you're not the highest level wizard on your team.

Julia Evans is a developer who works on infrastructure at Stripe. She likes making programs go fast and learning how to debug weird problems. She thinks you can be a wizard programmer.

9:25 am–9:55 am

Break with Refreshments

Grand and Market Street Foyers

9:55 am–10:50 am

Track 1

Grand Ballroom C

TrafficShift: Avoiding Disasters at Scale

Monday, 9:55 am–10:50 am PDT

Michael Kehoe and Anil Mallapur, LinkedIn

Available Media

LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center out of rotation and redistribute its traffic to the healthy data centers within minutes, with virtually no visible impact to users

As we transitioned from big monolithic application to micro-services, we witnessed pain in determining capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex micro-services architecture wasn’t sufficient to provide enough confidence in a data center’s capacity. To solve this problem, we at LinkedIn leverage live traffic to stress services site-wide by shifting traffic to simulate a disaster load.

This talk provide details on how LinkedIn uses Traffic Shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.

Michael Kehoe, Staff Site Reliability Engineer in the Production-SRE team, joined the LinkedIn operations team as a new college graduate in January 2014. Prior to that, Michael studied Engineering at the University of Queensland (Australia) where he majored in Electrical Engineering. During his time studying, he interned at NASA Ames Research Center working on the PhoneSat project.

Track 2

Grand Ballroom B

Ten Persistent SRE Antipatterns: Pitfalls on the Road to a Successful SRE Program Like Netflix and Google

Monday, 9:55 am–10:50 am PDT

Jonah Horowitz, Netflix, and Blake Bisset

Available Media

What isn’t Site Reliability Engineering? Does your NOC escalate outages to your DevOops Engineer, who in turn calls your Packaging and Deployment Team? Did your Chef just sprinkle some Salt on your Ansible Red Hat and call it SRE? Lots of companies claim to have SRE teams, but some don’t quite understand the full value proposition, or what shiny technologies and organizational structures will negatively impact your operations, rather than empowering your team to accomplish your mission.

You’ll hear stories about anti-patterns in Monitoring, Incident Response, Configuration Management, Automation, managing your relationship with service developers, and more that we’ve tripped over in our own teams, seen actually proposed as good practice in talks at other conferences, and heard as we speak to peers scattered around the industry. We'll also discuss how Google and Netflix each view the role of the SRE, and how it differs from the traditional Systems Administrator role. The talk also explains why freedom and responsibility are key, trust is required, and when chaos is your friend.

Jonah Horowitz is a Senior Site Reliability Architect with over 20 years of experience keeping servers and sites online. He started with a 2-line, 9600 baud BBS and has worked at both large and small tech companies including Netflix, Walmart.com, Looksmart, and Quantcast.

Blake Bisset got his first legal tech job at 16. He won’t say how long ago, except that he’s legitimately entitled to make shakeyfists while shouting “Get off my LAN!” He’s done 3 start-ups (a joint venture of Dupont/ConAgra, a biotech spinoff from the U.W., and this other time a bunch of kids were sitting around New Year’s Eve, wondering why they couldn’t watch movies on the Internet), only to end up spending a half-decade as an SRM at YouTube and Chrome, where his happiest accomplishment was holding the go/bestpostmortem link for two years.

Track 3

Grand Ballroom A

Keep Calm and Carry On: Scaling Your Org with Microservices

Monday, 9:55 am–10:50 am PDT

Charity Majors, Honeycomb, and Bridget Kromhout, Pivotal

Available Media

Ask people about their experience rolling out microservices, and one theme dominates: engineering is the easy part, people are super hard! Everybody knows about Conway's Law, everybody knows they need to make changes to their organization to support a different product model, but what are those changes? How do you know if you're succeeding or failing, if people are struggling and miserable or just experiencing the discomfort of learning new skills? We'll talk through real stories of pain and grief as people modernize their team and their stack.

CEO/cofounder of honeycomb.io. Previously ran operations at Parse/Facebook, managing a massive fleet of MongoDB replica sets as well as Redis, Cassandra, and MySQL. Worked closely with the RocksDB team at Facebook to develop and roll out the world's first Mongo+Rocks deployment using the pluggable storage engine API. Likes single malt scotch; kind of a snob about it.

Bridget Kromhout is a Principal Technologist for Cloud Foundry at Pivotal. Her CS degree emphasis was in theory, but she now deals with the concrete (if 'cloud' can be considered tangible). After 15 years as an operations engineer, she traded being on call for being on a plane. A frequent speaker and program committee member for tech conferences the world over, she leads the devopsdays organization globally and the devops community at home in Minneapolis. She podcasts with Arrested DevOps, blogs at bridgetkromhout.com, and is active in a Twitterverse near you.

10:55 am–11:50 am

Track 1

Grand Ballroom C

Automated Debugging of Bad Deployments

Monday, 10:55 am–11:20 am PDT

Joe Gordon, Pinterest

Available Media

Debugging a bad deployment can be tedious, from identifying new stack traces to figuring out who introduced them. At Pinterest we have automated most of these processes using using ElasticSearch to identify new stack traces and git-stacktrace to figure out who caused them. Git-stacktrace parses the stack trace and looks for related git changes. This has reduced the time needed to figure out who broke the build from minutes to just a few seconds.

Joe is an SRE at Pinterest, where he works on search and performance. He has previously spoken at numerous conferences such as EuroPython, LinuxCon, LCA (Linux Conference Australia).

Deployment Automation: Releasing Quickly and Reliably

Monday, 11:25 am–11:50 am PDT

Sebastian Yates, Uber SRE

Available Media

We’ve always encouraged engineers to push code quickly to production. Developers at Uber have had full control over how they deploy (which instances, datacenters, etc. they deploy to) and a full production upgrade could complete in minutes. This helped our business achieve incredible growth but has impacted reliability. As a business we need to remain fast but take smarter risks and reduce the potential impact of changes.

We set out building automated deployment workflows that all services could use to make deploying code safer. As we on-boarded services, the biggest battle wasn’t the technical challenges but convincing teams that for our most critical services we needed to trade some deployment velocity for availability. Uber emphasizes moving fast, so slowing down deployments was controversial.

Collecting TTD and TTM statistics for our most significant outages allowed us to make principled decisions about ideal deploy lengths and enabled principled discussions. We added features to the workflows to improve deployment experience; automated canary, deployment batch metrics, automated rollbacks, continuous deployment and load testing.

Changing the way we deploy has improved reliability at Uber. It’s required a culture change, and we hope the lessons we learned about ourselves and our systems will benefit your work.

An SRE at Uber for 2 years, responsible for the end to end availability of Uber's marketplace systems. Loves all things distributed and should never be allowed to name systems of any kind.

Track 2

Grand Ballroom B

SRE and Presidential Campaigns

Monday, 10:55 am–11:20 am PDT

Nat Welch, Hillary for America

Available Media

Hillary for America was the organization created for Hillary Clinton’s 2016 presidential campaign. By election day, the campaign employed 82 software developers. A team of four SREs and two Security Engineers helped protect and scale close to a hundred backend services against a constant stream of DDoS, looming deadlines and constantly shifting priorities.

This team helped build an environment to promote building new projects in a week for immovable deadlines, raising billions of dollars from donors, and withstand spiky and semi-unpredictable traffic. We will walk through how they did this, the unusual schedule of a presidential campaign and how that affects a growing tech team.

Nat Welch has been writing software professionally for over a decade. He was an early SRE on Google Compute Engine and at Hillary for America, along with doing full stack engineering for a variety of startups. He currently lives in Brooklyn, NY.

The Service Score Card—Gamifying Operational Excellence

Monday, 11:25 am–11:50 am PDT

Daniel Lawrence, LinkedIn

Available Media

What makes a “good” service is a moving target. Technologies and requirements change over time. It can be impossible to ensure that none of your services have been left behind.

The Service ScoreCard approach is to have a small check for each service initiative we have, this could be anything measurable; deployment frequency, the oncall team all have phone; ensuring the latest version of the JVM.

The Service ScoreCard, gives each service a grade from 'F' to 'A+', based on passing or failing the list of checks. As soon as anyone see the service grade’s slipping everyone rallies to improve the grades.

We can then set up rules based on the grades, “Only B and above services can deploy 24 / 7”, “moratorium on services without an A+” or “No SRE support until the services below C grade”.

Daniel will fix anything with python, even if its not broken. He is an Aussie on loan to LinkedIn as an SRE, looking after the jobs and recruiting services. When he is not working on tricky problems for LinkedIn, he plays _alot_ of video games and is currently exploring this side of the planet.

Track 3

Grand Ballroom A

How Do Your Packets Flow?

Monday, 10:55 am–11:20 am PDT

Leslie Carr, Clover Health

Available Media

As more of us move to the "cloud" we lose sight of how packets flow from our application to the end user. This is an introduction to how your network traffic flows.

I come from a network engineering background. Talking with many SRE/Devops folks, I've realized that many don't actually understand how network traffic flows. This will be a 20 minute introduction to network traffic - concepts like peering and transit will be introduced, as well as DWDM and awesome cable maps. This will also introduce some of the monetary concepts in traffic - so that people have a better understanding of large fights between providers - like the Netflix/Comcast fight of a few years back.

Leslie Carr is a Devops Engineer at Clover Health and a Board Member of SFMIX.

In her past life, Leslie most recently worked at Cumulus Networks in devops, helping to push automation in the network world. Prior to that, she was on the production side of the world at many large websites, such as Google, Craigslist, and Wikimedia.

Leslie is a lover and user of open source and automation. She dreams of robots taking over all of our jobs one day.

Spotify's Love-Hate Relationship with DNS

Monday, 11:25 am–11:50 am PDT

Lynn Root, Spotify

Available Media

Spotify has a history of loving "boring" technologies, with DNS being one of them. DNS deployments use to be manual and hand-edited in a subversion repo. To make sure there were no surprises, you had to yell "DNS DEPLOY" in the #sre channel on IRC before pushing the button. Now with proper automation and far fewer hands editing records, we've seen just how far we can push DNS. With DNS, we got a stable query interface, free caching, and service discovery. And just how often it _is_ the root of a problem. This talk will walk through Spotify's "coming of age" story, of how we pushed DNS to its limits, and all the weird intricacies we discovered along the way.

Based in NYC, Lynn Root is an insomniac Site Reliability Engineer & the FOSS evangelist for Spotify. She is also a global leader of PyLadies, an international mentorship group for women and friends in the Python community, and the founder & former leader of the San Francisco PyLadies. When her hands are not on a keyboard, they are usually holding a pair of knitting needles.

11:55 am–12:50 pm

Track 1

Grand Ballroom C

A Million Containers Isn't Cool

Monday, 11:55 am–12:50 pm PDT

Chris Sinjakli, SRE at GoCardless

Available Media

You know what's cool? A hundred containers.

A lot of us ship software multiple times a day—but what goes into that, and how do we make it happen reliably?

In this talk, we'll look at the deployment of a typical web app/API. We'll focus on build artifacts - the things we actually ship to production - and why it's helpful to make their build and deployment processes consistent.

From there, we'll move on to containers—Docker in particular—with with a focus on container images and how they can get us to that goal.

We'll deliberately sidestep the world of distributed schedulers—Mesos, Kubernetes, and friends. They're great tools when you need to manage a growing fleet of computers, but running them doesn't come without an operational cost.

By following the example of a production system that's built this way—containerised apps without a distributed scheduler—we'll explore what it takes to move apps into containers, and how doing so might shape your infrastructure.

To wrap up, we'll look at some alternatives that could serve you well if Docker isn't the right fit for your organisation.

Chris enjoys all the weird bits of computing that fall between building software users love and running distributed systems reliably.

All his programs are made from organic, hand-picked, artisanal keypresses.

Track 2

Grand Ballroom B

Java Hates Linux. Deal with It.

Monday, 11:55 am–12:50 pm PDT

Greg Banks, LinkedIn

Available Media

At LinkedIn we run lots of Java services on Linux boxes. Java and Linux are a perfect pair. Except when they're not; then there's fireworks. This talk describes 5 situations we encountered where Java interacted with normal Linux behavior to create stunningly sub-optimal application behavior like minutes-long GC pauses. We'll deep dive to show What Java Got Wrong, why Linux behaves the way it does, and how the two can conspire to ruin your day. Finally we'll examine actual code samples showing how we fixed or hid the problems.

Greg spent twenty three years as a professional C/C++ developer working on projects as diverse as airspace simulation and the Linux kernel. One highlight of his career was shutting down his then employers' entire company for 72 hours with a single misplaced comma...which triggered a chain of five pre-existing cascading error conditions. As penance for sins such as these, he is now on the receiving end as a Data SRE for LinkedIn. He lives in San Francisco and spends his spare time being surprised to have any spare time.

Track 3

Grand Ballroom A

From Combat to Code: How Programming Is Helping Veterans Find Purpose

Monday, 11:55 am–12:50 pm PDT

Jerome Hardaway, Vets Who Code

Available Media

A Talk about how veterans are using software engineering to move on for life after the military, common misconceptions and how to integrate them into your teams.

Jerome Hardaway is a Memphis native currently residing in Nashville. He is the Executive Director of Vets Who Code, a 501(c)(3) that trains early stage transitioning veterans in Web Development and helps them find gainful employment in the software industry.

His work has been featured in Huffington Post as well having been invited to the White House, DreamForce, and Facebook for his work with veterans.

12:50 pm–1:50 pm

Lunch

Atrium

1:50 pm–2:45 pm

Track 1

Grand Ballroom C

A Practical Guide to Monitoring and Alerting with Time Series at Scale

Monday, 1:50 pm–2:45 pm PDT

Jamie Wilkinson, Google

Available Media

Monitoring is the foundational bedrock of site reliability yet is the bane of most sysadmins’ lives. Why? Monitoring sucks when the cost of maintenance scales proportionally with the size of the system being monitored. Recently, tools like Riemann and Prometheus have emerged to address this problem by scaling out monitoring configurations sublinearly with the size of the system.

In a talk complementing the Google SRE book chapter “Practical Alerting from Time Series Data,” Jamie Wilkinson explores the theory of alert design and time series-based alerting methods and offers practical examples in Prometheus that you can deploy in your environment today to reduce the amount of alert spam and help operators keep a healthy level of production hygiene.

Jamie Wilkinson works as a site reliability engineer in Google’s storage infrastructure group, on a Globally Replicated Eventually Consistent High Availability Low Latency Key Value Buzzword Store, but focusses primarily on automation, monitoring and devops.

Track 2

Grand Ballroom B

From Engineering Operations to Site Reliability Engineering

Monday, 1:50 pm–2:45 pm PDT

Nathan Mische, Comcast

Available Media

Over the past three years my team has transformed from an operations team composed of System Administrators who ran services developed by other teams or vendors into a DevOps team composed of SREs who work closely with other engineering teams to create exceptional entertainment and online experiences. This presentation will share some of the strategies that helped make this transition a success, including having a strong vision for the team, fostering management and HR support, strategic performance management, a refined interview processes, internal and external networking, playing “Moneyball”, and using internal programs to grow the team.

Nathan Mische leads an SRE team at Comcast where he helps software engineering teams transition from bare metal and VMware data centers to various PaaS and IaaS cloud offerings. The team operates several shared services to enable the cloud transition, from service discovery to a time series monitoring and alerting platform. The team also develops tools and processes that allow teams to take over routine operational duties for their services.

Track 3

Grand Ballroom A

I’m an SRE Lead! Now What? How to Bootstrap and Organize Your SRE Team

Monday, 1:50 pm–2:45 pm PDT

Ritchie Schacher, SRE Architect, IBM Bluemix DevOps Services, and Rob Orr, IBM

Available Media

So, your organization has decided to build an SRE team, and you are one of the leaders. You have multiple stakeholders with different expectations and agendas. Your executive team expects early and immediate results. They want checklists, goals, and roadmaps, and they want it now. Peer dev managers may believe that you now have a magic wand that will make their headaches disappear, or worse, may believe that they can offload all the dev team’s dirty work to you.

Are you overwhelmed, and wondering how to get started? How will you organize? How will you manage expectations? What management style should you use? How will you define and measure success?

In this session, we will cover a phased approach to organize your SRE team based on our real-world experience operating our cloud-based SaaS DevOps toolset. We will also cover some of our lessons learned and pitfalls for you to avoid to help you on your SRE journey.

Ritchie Schacher is a Senior Technical Staff Member with IBM® Bluemix® DevOps Services, and lead architect for SRE. He has a background in developer tools and team collaboration. He has broad experience in all aspects of the software development lifecycle, including product conception, planning and managing, design, coding, testing, deploying, hosting, releasing, and supporting products. For the last 3 years he has been a part of an exciting team developing and supporting cloud-based SaaS offerings for Bluemix®. In his spare time, Ritchie enjoys playing classical guitar.

Rob Orr is a Program Director at IBM Cloud and leads the SRE team responsible for Bluemix DevOps Services. His current projects include Automation tooling, Developing SLI’s, and coming up with new ways to make SRE the cool team. Previous to this role, Rob was a Sr. Development Manager on IBM Cloud automation products. In his spare time, Rob enjoys Photography and Scotch.

2:50 pm–3:45 pm

Track 1

Grand Ballroom C

BPerf—Bing.com Cloud Profiling on Production

Monday, 2:50 pm–3:15 pm PDT

Mukul Sabharwal, Microsoft

Available Media

BPerf is Bing.com's Production Profiler system. It is built on top of Event Tracing for Windows, a low-overhead logging facility that is able to collect events from the operating system and other participating components like Microsoft's Common Language Runtime (.NET).

BPerf is able to visualize this low-level data, associate it with a search query and present this performance analysis investigation with source code right there for the SRE to analyze.

The talk will cover architecture details of our production profiling system, present the data collection and storage challenges that are need to operate at Bing-scale and conduct an investigation with a demo of the real system in action.

We'll also touch up on writing high-performance latency sensitive managed code (focusing on .NET) and why a production profiling system in an important tool in the SRE toolkit that is a perfect complement to the the high-level monitoring tools that are a staple in the SRE world.

Mukul Sabharwal is a Principal Technical Lead at Microsoft working on the Bing.com Frontend. Responsible for the overall server performance and system health he is a key member of the frontline SRE team and performance architect for many low-level telemetry systems including BPerf. An expert in .NET internals and key contributor of the open source .NET CoreCLR project.

How Robust Monitoring Powers High Availability for LinkedIn Feed

Monday, 3:20 pm–3:45 pm PDT

Rushin Barot, LinkedIn

Available Media

It is common practice for networking services like LinkedIn to introduce a new feature to a small subset of users before deploying to all its members. Even with rigorous testing and tight processes, bugs are inevitably introduced with a new deploy. Unit and integration tests in development environment cannot completely cover all use cases. This results in an inferior experience for the users subject to this treatment. In addition, these bugs cause service outages, impacting service availability, health metrics and increased on-call burden.

Although there exist frameworks for testing new features in development clusters, it is not possible to completely simulate all aspects of a production environment. Read requests for Feed service have enough variation to trigger large classes of unforeseen errors. So, we introduced a production instance of the service, called ‘Dark Canary’, that receives live production traffic that is tee’d from a production source node, but does not return any responses upstream. New experimental code is deployed to the dark canary and compared to the current version in production node for health failures. This helps us in reducing the number of bad sessions to the users and also provides a better understanding of the service capacity requirements.

Rushin Barot is a site Reliability Engineer at LinkedIn, where he contributes to Feed infrastructure. Prior to working at LinkedIn, he was working at Yahoo in Search infrastructure. He holds an MS in Computer Science from San Jose State University.

Track 2

Grand Ballroom B

Tracking Service Infrastructure at Scale

Monday, 2:50 pm–3:15 pm PDT

John Arthorne, Shopify

Available Media

Over the past two years Shopify built up a strong SRE team focused on the critical systems at the core of the company’s business. However at the same time the company was steadily growing a list of secondary production services, to the point where we had several hundred running applications, most of which had poorly defined ownership and ad-hoc infrastructure. These applications had a patchwork of tools such as build and deployment automation, monitoring, alerting, and load testing, but with very little consistency and lots of gaps.

Shopify avoided this looming disaster through automation. We built an application that tracked every production service, and linked it to associated owners, source code, and supporting infrastructure. This application, called Services DB, provides a central place where developers can discover and manage their services across various runtime environments. Services DB also provides a platform for measuring progress on infrastructure quality and building out additional tooling to automate manual steps in infrastructure management. Today, Shopify SREs aren’t worried about being woken up in the middle of the night because of failures in poorly maintained applications. Instead, SREs can focus on building automation, and use Services DB to apply it across large groups of services.

John works on the Shopify Production Engineering team, with a specific focus on creating developer tooling to accelerate application delivery. John is a frequent speaker at technical conferences in both Europe and North America, serves on conference program committees, is a JavaOne Rock Star, and frequently writes blogs and articles on technical topics. His current interests are in tools and practices for infrastructure automation, and in highly scalable cloud architectures. Before joining Shopify, John led a team building cloud-based developer tooling for IBM Bluemix, and was a prominent leader within the Eclipse open source community.

I'm Putting Sloths on the Map

Monday, 3:20 pm–3:45 pm PDT

Preetha Appan, Indeed.com

Available Media

At Indeed, we strive to build systems that can withstand problems with an unreliable network. We want to anticipate and prevent failures, rather than just reacting to them. Our applications run on the private cloud, sharing infrastructure with other services on the same host. The interconnectedness of our system and resource infrastructure introduces challenges when inducing failures that simulate a slow or lossy network. We need the ability to slow down the network for one service or data source and test how this impacts other applications that use it—without causing side effects on applications in the same host.

In this talk, we’ll describe Sloth, a Go tool for inducing network failures. Sloth is a daemon that runs on every host in our infrastructure, including database and index servers. Sloth works by adding and removing complex traffic shaping rules via unix’s tc and iptables. Sloth is implemented with access control and audit logging to ensure its usability without compromising security. It provides a web UI for manual testing and offers an API to embed destructive testing into integration tests. We will discuss specific examples of how using Sloth, we discovered and fixed problems in monitoring, graceful degradation, and usability.

Preetha Appan is a principal software engineer at Indeed, and has expertise in building performant distributed systems for recommendations and search. Her past contributions to Indeed's job and resume search engines include text segmentation improvements, query expansion features, and other major infrastructure and performance improvements. She loves the SRE philosophy and is embracing destructive testing by breaking everyone's applications for improving their resilience.

Track 3

Grand Ballroom A

Lightning Talks 1

Monday, 2:50 pm–3:45 pm PDT

Available Media

Don't Wait to Ask for Help—Joseph Schneider, DroneDeploy
Data Center Automation at Shopify—David Radcliffe, Shopify
Issuing Certificates at Scale—Joel Goguen, Facebook
Why Do SRE Teams Fail?—Igor Ebner de Carvalho, Microsoft
Durability Engineering: Death before Data Loss—Tammy Butow, Dropbox
No Haunted Graveyards—John Reese, Google

3:50 pm–4:45 pm

Track 1

Grand Ballroom C

Breaking Things on Purpose

Monday, 3:50 pm–4:45 pm PDT

Kolton Andrus, Gremlin Inc.

Available Media

Failure Testing prepares us, both socially and technically, for how our systems will behave in the face of failure. By proactively testing, we can find and fix problems before they become crises. Practice makes perfect, yet a real calamity is not a good time for training. Knowing how our systems fail is paramount to building a resilient service.

At Netflix and Amazon, we ran failure exercises on a regular basis to ensure we were prepared. These experiments helped us find problems and saved us from future incidents. Come and learn how to run an effective “Game Day” and safely test in production. Then sleep peacefully knowing you are ready!

Kolton Andrus is the founder and CEO of Gremlin Inc., which provides ‘Failure as a Service’ to help companies build more resilient systems. Previously he was a Chaos Engineer at Netflix improving streaming reliability and operating the Edge services. He designed and built F.I.T., Netflix’s failure injection service. Prior he improved the performance and reliability of the Amazon Retail website. In both companies he has served as a ‘Call Leader’, managing the resolution of company wide incidents. Kolton is passionate about building resilient systems, as it lets him break things for fun and profit.

Track 2

Grand Ballroom B

Making the Most of Your SRE Toolbox: Bootstrapping Your SRE Team through Reuse

Monday, 3:50 pm–4:45 pm PDT

Mark Duquette, IBM DevOps Services SRE, and Tom Schmidt, IBM

Available Media

You've spent the last year building tools and infrastructure to enable Continuous Delivery practices. Your next challenge; apply what you’ve learned building out a Common Infrastructure environment to jumpstart your new SRE organization.

Some of the same concepts that went into developing a comprehensive delivery pipeline can be used for SRE activities.

In this session, we will discuss how to re-purpose existing tools, dashboards and frameworks so that they can be used to enable
SRE tasks.

We will explore these topics using real-world experiences as we worked to build out an effective SRE organization.

Mark Duquette currently works as a Site Reliability Engineer supporting IBM DevOps services where he is responsible for the monitoring and metrics infrastructure. Mark's expertise and knowledge of designing reusable automation has been used to facilitate teams as they explore SRE as well as embrace DevOps practices.

Tom Schmidt currently works as a Site Reliability Engineer in support of DevOps Services at the IBM Canada Lab in Markham, Ontario, Canada. With a diverse background developing common infrastructure and test frameworks, and a passion for automation, Tom has transformed IBM Cloud development organization's perspective on security. Tom leverages recent real world experience applying SRE concepts to develop security and compliance solutions within a Continuous Delivery offering.

Track 3

Grand Ballroom A

Tune Your Way to Savings!

Monday, 3:50 pm–4:45 pm PDT

Sandy Strong and Brian Martin, Twitter

Available Media

The Twitter Ad Server is our revenue engine. It's designed to perform resiliently under high load and unpredictable spikes in demand. It accomplishes this by using concepts from control theory, which allows it to adapt its performance in response to changes in demand, based upon available system resources.

Services that drive revenue lend themselves to optimization projects that focus on how to make the most of our resources. The output of the system is revenue, and the cost of compute resource cuts into that.

We ask ourselves: Can we do the same amount of work with less resource, by appropriately tuning the software and the systems it runs on? If so, this reduces our operational costs.

In this talk we will start at the beginning, where we had a hunch that it was possible to reduce operational costs, and continue all the way through the experiments and unexpected outcomes that led us to settle on our final set of optimizations.

Sandy is a Site Reliability Engineer at Twitter, and has been embedded with the Ads Serving Team for two years.

Brian Martin is a Site Reliability Engineer at Twitter, working on our Core Storage systems.

4:45 pm–5:15 pm

Break with Refreshments

Grand and Market Street Foyers

5:15 pm–6:10 pm

Plenary Session

Grand Ballroom

Every Day Is Monday in Operations

Benjamin Purgason, LinkedIn

Available Media

When I first joined LinkedIn the SRE landscape looked like the wild West: no rules, few laws, and everyone had their own way of doing things. The SRE team was a scrappy band of firefighters working around the clock just to keep the operational fires in check.

As we began to mature we came to resemble something you might find in the “DevOps” movement. We began influencing the lifecycle of applications, upleveling the LinkedIn stack, and reducing the need for the firefighters of yesteryear. Over time we grew into the modern SRE team at LinkedIn.

I am here to share my experiences, challenges, and learnings collected over the course of my tenure as a Site Reliability leader. I will cover ten axioms, fundamental patterns which offer an effective strategy for dealing with the pair of incredible demands placed on modern SREs: the need for uptime and to participate in an accelerating software development cycle.

Join me as I share my war stories—how things went right, how things went very wrong, and the role of the axioms throughout it all.

Ben is a Senior Manager of Site Reliability at LinkedIn. He leads the Tools SRE team, responsible for the operational integrity of internal tooling. He specializes in developing the next generation of leadership, continuing to improve the culture, and increasing the scope of SRE contributions across the company.

6:15 pm–8:15 pm

Reception

Atrium, Sponsored by Google

Tuesday, March 14, 2017

7:30 am–8:30 am

Continental Breakfast

Grand and Market Street Foyers

8:30 am–9:25 am

Plenary Session

Grand Ballroom

Traps and Cookies

Tanya Reilly, Google

Available Media

Does your production environment expect perfect humans? Does technical debt turn your small changes into minefields? This talk highlights tools, code, configuration, and documentation that set us up for disaster. It discusses commons traps that we can disarm and remove, instead of spending precious brain cycles avoiding them. And it offers practical advice for sending your future self (and future coworkers!) little gifts, instead of post-mortems that just say “human error :-(”. Includes stories of preventable outages. Bring your schadenfreude.

Tanya Reilly has been a Systems Administrator and Site Reliability Engineer at Google since 2005, working on low level infrastructure like distributed locking, load balancing, bootstrapping and monitoring systems. Before Google, she was a Systems Administrator at eircom.net, Ireland's largest ISP, and before that she was the entire IT Department for a small software house.

Observability in the Cambrian Stack Era

Charity Majors, Honeycomb

Available Media

Distributed systems, microservices, automation and orchestration, multiple persistence layers, containers and schedulers... today's infrastructure is a Cambrian explosion of novelty and complexity. How is a humble engineer to make sense of it all? That's where observability comes in: engineering your systems to be understandable, explorable, and self-explanatory. Let's talk about what modern tooling is for complex systems and how to bootstrap a culture of observability.

CEO/cofounder of honeycomb.io. Previously ran operations at Parse/Facebook, managing a massive fleet of MongoDB replica sets as well as Redis, Cassandra, and MySQL. Worked closely with the RocksDB team at Facebook to develop and roll out the world's first Mongo+Rocks deployment using the pluggable storage engine API. Likes single malt scotch; kind of a snob about it.

9:25 am–9:55 am

Break with Refreshments

Grand and Market Street Foyers

9:55 am–10:50 am

Track 1

Grand Ballroom C

Reducing MTTR and False Escalations: Event Correlation at LinkedIn

Tuesday, 9:55 am–10:50 am PDT

Michael Kehoe, LinkedIn

Available Media

LinkedIn’s production stack is made up of over 900 applications and over 2200 internal API’s. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner.

In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service.

We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.

Michael Kehoe, Staff Site Reliability Engineer in the Production-SRE team, joined the LinkedIn operations team as a new college graduate in January 2014. Prior to that, Michael studied Engineering at the University of Queensland (Australia) where he majored in Electrical Engineering. During his time studying, he interned at NASA Ames Research Center working on the PhoneSat project.

Track 2

Grand Ballroom B

The Road to Chaos

Tuesday, 9:55 am–10:50 am PDT

Nora Jones, Jet.com

Available Media

Chaos Engineering is described as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

In this presentation I'll go into why we got started with Chaos Engineering, how we built it with a functional programming language, and the road to cultural acceptance. Surprisingly enough, social acceptance proved to be more difficult than actual implementation. There are several different "levels" of chaos that I recommend introducing before unleashing a full-blown chaos solution. I will go over each of the levels, evaluate how the chaos tool has influenced the process, and leave you with a game plan of how you can culturally and technically introduce fault injection in your organization.

I'm a Software Productivity Engineer at Jet.com. I'm passionate about delivering high quality software, improving processes, and promoting efficiency within architecture.

Track 3

Grand Ballroom A

Panel: Training New SREs

Tuesday, 9:55 am–10:50 am PDT

Moderator: Ruth Grace Wong, Pinterest

Panelists: Katie Ballinger, CircleCI; Saravanan Loganathan, Yahoo; Rita Lu, Google; Craig Sebenik, Matterport; Andrew Widdowson, Google

Available Media

Representatives from Google, Yahoo, Matterport, and CircleCI refer to their experience as junior SREs and as senior SREs training junior SREs, to arrive at a list of best practices for training new SREs. This panel will provide advice for SREs trainees, SRE trainers, and SRE teammates.

10:55 am–11:50 am

Track 1

Grand Ballroom C

Postmortem Action Items: Plan the Work and Work the Plan

Tuesday, 10:55 am–11:20 am PDT

Sue Lueder and Betsy Beyer, Google

Available Media

In the 2016 O'Reilly book Site Reliability Engineering, Google described our culture of blameless postmortems, and recommended that organizations institute a similar culture of postmortems after production incidents. This talk shares some best practices and challenges in designing an appropriate action item plan and subsequently executing that plan in a complex environment of competing priorities, resource limitations, and operational realities. We discuss best practices for developing high-quality action items (AIs) for a postmortem, plus methods of ensuring these AIs actually get implemented so that we dont suffer the exact same outage or even worse again. It's worth noting that Google teams are by no means perfect at formulating and executing postmortem action items. We still have a lot to learn in this difficult area, and are sharing our thoughts and strategies to give a starting point for discussion throughout the industry.

Sue Lueder joined Google as a Site Reliability Program Manager in 2014 and is on the team responsible for disaster testing and readiness, incident management processes and tools, and incident analysis. Previous to Google, Sue was a technical program manager and a systems, software, and quality engineer in wireless and smart energy industries (Ingenu Wireless, Texas Instruments, Qualcomm). She has a M.S. in Organization Development from Pepperdine University and a B.S in Physics from UCSD.

Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees from Stanford and Tulane.

Building Real Time Infrastructure at Facebook

Tuesday, 11:25 am–11:50 am PDT

Jeff Barber and Shie Erlich, Facebook

Available Media

Real Time Infrastructure is a set of systems within Facebook that enables real time experiences on Facebook products, such as sending payloads to/from devices, real-time presence, push notifications, real-time delivery of comments and likes on News Feed, comments and reactions on Live (video) and many more across Facebook, Messenger, and Instagram. With more than 1B daily users on Facebook, Real Time Infrastructure is a very large scale affair – on the surface, these features seem to “just work,” while on the backend this is a complex system that we designed to work at scale that makes it run smoothly. This talk will focus on architecture, and how we design our systems at scale for low latency while reducing the risk of failure, and making it easier to recover when failure does occur.

Jeff was writing code before Al Gore invented the Internet. Starting in college with building game engines for no games, his passion led him directly into infrastructure. After starting a company, doing a tour of duty at Amazon working on S3, Jeff landed a sweet gig at Facebook working on Real-time Infrastructure.

Shie is an Engineering Leader for Facebook’s Real Time Infrastructure, which powers many large-scale real time experiences across Facebook, Messenger and Instagram. Previously Shie led engineering teams in Amazon’s S3 and Microsoft’s Windows Azure and has shipped software across the stack from cloud services and large scale distributed systems, through web development to RT/embedded systems.

Track 2

Grand Ballroom B

Anomaly Detection in Infrequently Occurred Patterns

Tuesday, 10:55 am–11:20 am PDT

Dong Wang, Baidu Inc.

Available Media

Anomaly detection is one of the most important works of SREs. The usual way is to find some frequently occurred traffic patterns, and regards them as the normal value scopes. Any values beyond the scope will be regarded as the anomaly. However in some special dates, especially in some holidays, the traffics show the significantly different patterns. The commonly used alerting strategy usually does not work well. In this talk, we will introduce our approaches used to deal with such issues. Our scenarios are actually more complicated in that, in China, the holidays do not have the fixed calendar dates, and different holidays have absolutely different traffic patterns. However the practical deployment of the methods in this talk shows the pretty satisfying results in terms of alarming precision and recall.

Dong Wang is a principal architect at Baidu, the largest search engine in China, and has led Baidu’s SRE team to work on some challenging projects, such as automatic anomaly detection and issue fixing in large scale Internet sites. He is also interested in user experience improvement in the mobile Internet services. Prior to Baidu, he worked at Bell Labs and Google for more than 15 years in total.

Root Cause, You're Probably Doing It Wrong

Tuesday, 11:25 am–11:50 am PDT

TJ Gibson, PayPal

Available Media

It's not very difficult to find a misbehaving application or a frozen computer; it's a lot of what SRE does either manually or using automated means. All too often, SRE is quick to call the bad software or hardware the root cause of a particular issue.

The problem is, this thinking doesn't really get to the heart of the problem. Since identifying the root cause of a problem is the primary starting point to identifying a solution to solve the problem, getting it wrong (even slightly) can lead to dire consequences and even bigger problems down the road.

I want to challenge SRE to think differently and more critically about how they identify the root cause of the problems they encounter.

A technologist with more than 20 years experience ranging from the USAF writing code, network engineering and system administration across Europe, to a stint with a boutique security services provider, and onto leadership positions with InfoSec and Professional Services at PayPal. I am currently settling into a new role within PayPal's SRE organization.

Track 3

Grand Ballroom A

Panel: Training New SREs (continued)

Tuesday, 10:55 am–11:20 am PDT

Moderator: Ruth Grace Wong, Pinterest

Panelists: Katie Ballinger, CircleCI; Saravanan Loganathan, Yahoo; Rita Lu, Google; Craig Sebenik, Matterport; Andrew Widdowson, Google

Available Media

Representatives from Google, Yahoo, Matterport, and CircleCI refer to their experience as junior SREs and as senior SREs training junior SREs, to arrive at a list of best practices for training new SREs. This panel will provide advice for SREs trainees, SRE trainers, and SRE teammates.

Feedback Loops: How SREs Benefit and What Is Needed to Realize Their Potential

Tuesday, 11:25 am–11:50 am PDT

Pooja Tangi, LinkedIn

Available Media

The problem:
Almost every SRE has felt the pain of using an ineffective tool or a top-down initiative requiring mass adoption of a beta product, however very few of these pain points are actually brought to the attention of upper management and are resolved in a timely manner. What is missing? A feedback loop, perhaps, or a systematic bottom-up approach for surfacing the problem areas?

The solution:
As a Technical Program Manager and a former SRE, I have questioned the validity of a proposed value of a given platform. Over a few different initiatives, I have realized that inserting bottom-up communication techniques and cross-team collaboration transforms discontent and toxic communication into supportive engagement that facilitates development of a better product that the SREs appreciate and are excited about onboarding.

I am not talking about a magical tool that will fix all of the issues above but small process tweaks which a project lead or a program manager could easily adapt and help for the betterment of the organization.

In my previous role I worked as an SRE for 5 years. All SRE related issues are very close to my heart and in my current role as a Technical Program Manager at LinkedIn I can easily relate to them and strive to resolve them to make the SRE organization a better place.

11:55 am–12:50 pm

Track 1

Grand Ballroom C

DNSControl: A DSL for DNS as Code from StackOverflow.com

Tuesday, 11:55 am–12:50 pm PDT

Craig Peterson and Tom Limoncelli, Stack Overflow

Available Media

Introducing dnscontrol: the DNS DSL and compiler that lets StackOverflow.com treat DNS as code, with all the DevOps-y benefits of CI/CD, unit testing, and more. StackOverflow.com has a large and complex DNS configuration including many domains, complex CDN interactions, unavoidably repeated data, and more. The dnscontrol language permits us to specify our domains at a high level and leave the actual manipulations and updates to our automation. Massive changes, such as failovers between datacenters, are now a matter of changing a variable and recompiling. We've been able to address new problems with smart macros rather than manual updates. Dnscontrol is extendable and has plug-ins for BIND, CloudFlare, Route53/AWS, Azure, Google Cloud DNS, Name.Com, and more.

Tom is an SRE Manager at StackOverflow.com and is a co-author of The Practice of System and Network Administration (http://the-sysadmin-book.com). He is an internationally recognized author, speaker, system administrator and DevOps advocate. He's also known for other books such as The Practice of Cloud System Administration (http://the-cloud-book.com) and Time Management for System Administrators (O'Reilly). Previously he's worked at small and large companies including Google, Bell Labs / Lucent, AT&T. His blog is http://EverythingSysadmin.com and he tweets @YesThatTom.

Craig is a developer on the SRE team at Stack Overflow. He is a lead developer their open source monitoring system, bosun, and for DNSControl. He has a passion for scale and performance and for improving sre and developer workflows in any way possible.

Track 2

Grand Ballroom B

Lyft's Envoy: Experiences Operating a Large Service Mesh

Tuesday, 11:55 am–12:50 pm PDT

Matt Klein, Lyft

Available Media

Over the past several years Lyft has migrated from a monolith to a sophisticated "service mesh" powered by Envoy, a new high performance open source proxy which aims to make the network transparent to applications. Envoy's out of process architecture allows it to be used alongside any language or runtime. At its core, Envoy is an L4 proxy with a pluggable filter chain model. It also includes a full HTTP stack with a parallel pluggable L7 filter chain. This programming model allows Envoy to be used for a variety of different scenarios including HTTP/2 gRPC proxying, MongoDB filtering and rate limiting, etc. Envoy provides advanced load balancing support including eventually consistent service discovery, circuit breakers, retries, zone aware load balancing, etc. Envoy also has best in class observability using both statistics, logging, and distributed tracing.

In this talk we will discuss why we developed Envoy, focusing primarily on the operational agility that the burgeoning “service mesh” SoA paradigm provides as well as discussing lessons learned along the way.

For more information on Envoy see: https://lyft.github.io/envoy/

Matt Klein is a software engineer at Lyft and the architect of Envoy. He has been working on operating systems, virtualization, distributed systems, networking, and in general making systems easy to operate for 15 years across a variety of companies. Some highlights include leading the development of Twitter's C++ L7 edge proxy and working on high performance computing and networking in Amazon's EC2.

Track 3

Grand Ballroom A

Principles of Chaos Engineering

Tuesday, 11:55 am–12:50 pm PDT

Casey Rosenthal, Netflix

Available Media

Distributed systems create threats to resilience that are not addressed by classical approaches to development and testing. We’ve passed the point where individual humans can reasonably navigate these systems at scale. As we embrace a world that emphasizes automation and engineering over architecting, we left gaps open in our understanding of complex systems.

Chaos Engineering is a new discipline within Software Engineering, building confidence in the behavior of distributed systems at scale. SREs and dedicated practitioners adopt Chaos Engineering as a practical tool for improving resiliency. An explicit, empirical approach provides a formal framework for adopting, implementing, and measuring the success of a Chaos Engineering program. Additional best practices define an ideal implementation, establishing the gold standard for this nascent discipline.

Chaos Engineering isn’t the process of creating chaos, but rather surfacing chaos that is inherent in the behavior of these systems at scale. By focusing on high level business metric, we side step understanding *how* a particular model works in order to identify *whether* it work under realistic, turbulent conditions in production. This fills a gap that arms SREs with a better, holistic understanding of the system’s behavior.

Engineering manager for the Traffic team and the Chaos team at Netflix. Previously an executive manager and senior architect, Casey has managed teams to tackle Big Data, architect solutions to difficult problems, and train others to do the same. He finds opportunities to leverage his experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to clients and colleagues alike. For fun, Casey models human behavior using personality profiles in Ruby, Erlang, Elixir, Prolog, and Scala.

12:50 pm–1:50 pm

Lunch

Atrium

1:50 pm–2:45 pm

Track 1

Grand Ballroom C

It's the End of the World as We Know It (and I Feel Fine): Engineering for Crisis Response at Scale

Tuesday, 1:50 pm–2:45 pm PDT

Matthew Simons, Workiva

Available Media

It's 4 AM and your phone is ringing. The system is in full meltdown and the company is losing money. Start the caffeine drip and get ready for a rough day.

We've all been there. When the world is on fire, Reliability Engineers are almost always on the front lines, extinguishers in hand. For the sake of our own sanity, what can we do to minimize the frequency and impact of production crises? How can we engineer the system to be more resilient, and what best practices can we employ to that end?

These are the questions we've been asking at Workiva as we've grown and scaled our operations from a small startup to a publicly-traded company in only a few years. We're still sane (mostly), so we'd like to share what's worked for us. Hopefully it will help you too.

Matthew is a Reliability Engineer who works for a Top-10 tech company you've probably never heard of: Workiva (ranked #4 in 2016's Best Tech Companies to Work For). He grew up in the bay area and now resides in Ames, Iowa. He's an entrepreneur and a passionate driver of innovation, relentlessly pursuing higher levels of automation and process streamlining. He's also a woodworker, a chef, and a pc games enthusiast in his spare time.

Track 2

Grand Ballroom B

Zero Trust Networks: Building Trusted Systems in Untrusted Networks

Tuesday, 1:50 pm–2:45 pm PDT

Doug Barth, Stripe, and Evan Gilman, PagerDuty

Available Media

Let's face it - the perimeter-based architecture has failed us. Today's attack vectors can easily defeat expensive stateful firewalls and evade IDS systems. Perhaps even worse, perimeters trick people into believing that the network behind it is somehow "safe", despite the fact that chances are overwhelmingly high that at least one device on that network is already compromised.

It is time to consider an alternative approach. Zero Trust is a new security model, one which considers all parts of the network to be equally untrusted. Taking this stance dramatically changes the way we implement security systems. For instance, how useful is a perimeter firewall if the networks on either side are equally untrusted? What is your VPN protecting if the network you're dialing into is untrusted? The Zero Trust architecture is very different indeed.

In this talk, we'll go over the Zero Trust model itself, why it is so important, what a Zero Trust network looks like, and what components are required in order to actually meet the challenge.

Doug is a Site Reliability Engineer at Stripe. With a deep interest in software, hardware, and production systems, he has spent his career using computers to solve hard problems. He helped deploy PagerDuty's IPsec mesh network, and is now working on a book about Zero Trust Networks.

Evan is currently a Site Reliability Engineer at PagerDuty. With roots in academia, he finds passion in both reliable, performant systems, and the networks they run on. When he's not building automated systems for PagerDuty, he can be found at the nearest pinball table or working on his upcoming book, Zero Trust Networks.

Track 3

Grand Ballroom A

How to Work with Technical Writers

Tuesday, 1:50 pm–2:45 pm PDT

Diane Bates, Google

Available Media

Whether you’re building an external product or a tool to make life easier for your internal engineers, your users deserve excellent documentation to ramp up quickly. Part of creating a great customer experience is providing documentation that’s easy to use and answers your customer’s questions before they even know what their questions are. The goal of the engineer/TW relationship is to polish a project before releasing it to customers.

This talk describes:

What a TW does
How to work with a TW to scope a project
Expectations for resources and engineering hours
General guidelines for how to work with a TW

I'm a tech writer on the Google SRE Tech Writing team. I've been on the team for two years and contributed to the SRE book. I've been a tech writer for 22 years, working for Microsoft, Cray, Inc., and Real Networks, and contracting at several other companies, including Motorola, Seattle Avionics, Boeing, Microscan, and GTE. Before that, I was on staff at the University of Washington Quantitative Science Dept for 2 years. I was an electronics technician on F-16 radar in the Air Force Reserves for 9 years.

2:50 pm–3:45 pm

Track 1

Grand Ballroom C

A Secure Docker Build Pipeline for Microservices

Tuesday, 2:50 pm–3:15 pm PDT

Talk cancelled due to illness
Mason Jones, Staff Platform Engineer at Credit Karma

When we began shifting to a microservices architecture, we had to answer the question of how to build Docker images with an emphasis on security: how do we know what's being built and deployed, and whether it's safe? We created a centralized build pipeline providing developer flexibility while allowing security and compliance to manage the process. In this presentation I'll describe the pipeline, our requirements, and how we solved the challenge.

Staff Engineer on the Platform Services team at Credit Karma in San Francisco, working on microservices and container infrastructure. Prior to Credit Karma was VP of Technology at Womply and Chief Architect at RPX Corporation.

Don't Call Me Remote! Building and Managing Distributed Teams

Tuesday, 3:20 pm–3:45 pm PDT

Tony Lyon, Production Engineering Manager, Facebook

Available Media

This talk discusses the challenges and best practices for managing globally distributed teams. How to recruit and develop teams in remote locations, the investment needed for successful distance management, and extending company culture while building local culture will be discussed, focusing on communication, inclusion, and adaptation. Based on experience with managing teams distributed across EMEA and across North and South America, we'll talk about communication strategies, keeping remote career growth on a level playing field with HQ, and when to respond with yes, repeat, no.

A 25-year industry veteran, Tony Lyon is a Production Engineering manager at Facebook Seattle and currently supports multiple PE teams there. Tony was the first Production Engineering manager in place in Seattle, where a big part of the role was to build out the PE teams and provide management support to a broad swath of PE teams as Seattle grew. Tony’s in the process of transitioning to being the first PE manager for Facebook in the NYC office, as we establish a Production Engineering presence there in support of our broader engineering team’s growth in New York.

Track 2

Grand Ballroom B

Killing Our Darlings: How to Deprecate Systems

Tuesday, 2:50 pm–3:15 pm PDT

Daniele Sluijters, Spotify

Available Media

A lot of the duties of SRE's involve around helping with the creation of new (infrastructure) services and supporting these while they're in use. What we don't talk about very often is how to shut these services down once they've reached EOL.

This talk focuses on what SREs can do, from the start right to the end, in order to facilitate being able to deprecate and shut down services with minimal impact on the rest of the organisation.

Daniele Sluijters is an SRE at Spotify who works with our configuration management, service discovery and traffic services as well as evangelising FOSS within the organisation and pushing for open sourcing our infrastructure components and tooling. He is a well known member of the Puppet community with many contributions to their ecosystem and a regular speaker.

Engineering Reliable Mobile Applications

Tuesday, 3:20 pm–3:45 pm PDT

Kristine Chen, Google

Available Media

Traditional SRE have accomplished much on the server-side, making global distributed systems and services incredibly reliable. More users are using mobile in their daily lives and it’s increasingly important that SRE think about reliability on the client side. This talk discusses the importance of a shift in focus to mobile applications and what we are doing to move to a world where client code is considered a first-class citizen. Much of the talk will center around Android though many principles can be applied to other platforms as well.

Kristine Chen has been a Site Reliability Engineer at Google for 4 years. She worked on Google Search’s serving infrastructure before moving to a team focusing on reliability on mobile devices.

Track 3

Grand Ballroom A

SRE Isn't for Everyone, but It Could Be

Tuesday, 2:50 pm–3:45 pm PDT

Ashe Dryden, Programming Diversity

Available Media

SRE as a field has grown exponentially over the past few years, yet still remains burdened by barriers to entry. What can we be doing better as a community and an industry to attract and retain talent? What obstacles do people face in attempting to participate? What we can do to increase participation? How can we foster an environment that supports, sustains, and enables people from a wide range of backgrounds, experiences, and identities?

Ashe Dryden is a former White House fellow, programmer, diversity advocate and consultant, prolific writer, speaker, and creator of AlterConf and Fund Club. She’s one of the foremost experts on diversity in the tech industry. She’s currently writing two books: The Diverse Team and The Inclusive Event. Her work has been featured in the New York Times, Scientific American, Wired, NPR, and more. Ashe will be speaking on the topic of open, diverse, and inclusive communities.

3:50 pm –4:45 pm

Track 1

Grand Ballroom C

Changing Old Habits: Meetup’s Path to SRE

Tuesday, 3:50 pm–4:15 pm PDT

Rich Hsieh, Core Engineering Manager at Meetup

Available Media

Back in July 2016, I asked myself the question, “What would an SRE team look like at Meetup?” Given the opportunity to try SRE at Meetup, I came up with a 6 month roadmap to change Meetup’s current processes. Out of this came new approaches to incident escalation, postmortems and playbooks.

Over the course of 6 months, I found unexpected allies in the company, improved broken processes and made strides toward a culture of learning, transparency and documentation -- changing old habits for the better.

If you’re looking to start SRE at your company or want to learn how small tweaks can lead to big wins, come hear how I created a pilot program of SRE at Meetup. I will go over a month-by-month breakdown of the SRE team’s tasks, including getting buy-in from leadership, rallying support from stakeholders, and identifying areas of impact, sharing anecdotes along the way.

Of course, this is just the beginning. SRE at Meetup is by no means near it’s full potential and there are still some big problems to address, but hopefully these steps will inspire you and your team to get on the right track.

Rich Hsieh is a Core Engineering Manager at Meetup. Having joined Meetup in July 2007, he's seen Meetup grow from a small startup to a company that's now 200 strong. During his time at Meetup, he's worn several hats, developing and architecting features on the site, working on-call shifts, introducing SRE, and now managing a team of engineers. In his spare time, you'll find him organizing and training for another marathon with his Meetup, Dashing Whippets Running Team, in NYC.

Ambry—LinkedIn’s Distributed Immutable Object Store

Tuesday, 4:20 pm–4:45 pm PDT

Arjun Shenoy AV, LinkedIn

Available Media

The presentation will start with a brief introduction into Ambry and the importance of blob stores in cloud based systems. This will be followed by a brief description of how the situation was in LinkedIn before Ambry. The presentation will cover major features, architecture in detail & main operational flows that take place in the background while a request is being processed.Towards the end we will mention the latest features that have been added to Ambry.

Arjun is a Site Reliability Engineer (Data Infra Storage) at LinkedIn.

Track 2

Grand Ballroom B

Lightning Talks 2

Tuesday, 3:50 pm–4:45 pm PDT

Available Media

4-Minute Docker Deploys—Kat Drobnjakovic, Shopify
Measuring Reliability through VALET Metrics—Raja Selvaraj, Home Depot
Lessons Learned from Transforming SEs into SRE at Microsoft Azure—Cezar Alevatto Guimaraes, Microsoft
SR(securit)E—Tom Schmidt, IBM
How Three Changes Led to Big Increases in Oncall Health—Dale Neufeld, Shopify

Track 3

Grand Ballroom A

SRE Isn't for Everyone, But It Could Be (continued)

Tuesday, 3:50 pm–4:45 pm PDT

Ashe Dryden, Programming Diversity

SRE as a field has grown exponentially over the past few years, yet still remains burdened by barriers to entry. What can we be doing better as a community and an industry to attract and retain talent? What obstacles do people face in attempting to participate? What we can do to increase participation? How can we foster an environment that supports, sustains, and enables people from a wide range of backgrounds, experiences, and identities?

Ashe Dryden is a former White House fellow, programmer, diversity advocate and consultant, prolific writer, speaker, and creator of AlterConf and Fund Club. She’s one of the foremost experts on diversity in the tech industry. She’s currently writing two books: The Diverse Team and The Inclusive Event. Her work has been featured in the New York Times, Scientific American, Wired, NPR, and more. Ashe will be speaking on the topic of open, diverse, and inclusive communities.

4:45 pm–5:15 pm

Break with Refreshments

Grand and Market Street Foyers

5:15 pm–6:10 pm

Plenary Session

Grand Ballroom

Reliability When Everything Is a Platform: Why You Need to SRE Your Customers

Tuesday, 5:15 pm–6:10 pm PDT

Dave Rensin, Google

Available Media

The general trend in software over the last several years is to give every system an API and turn every product into a platform. When these systems only served end users, their reliability depended solely on how well we did our jobs as SREs. Increasingly, however, our customers' perceptions of our reliability are being driven by the quality of the software they bring to our platforms. The normal boundaries between our platforms and our customers are being blurred and it's getting harder to deliver a consistent end user reliability experience.

In this talk we'll discuss a provocative idea—that as SREs we should take joint operational responsibility and go on-call for the systems our customers build on our platforms. We'll discuss the specific technical and operational challenges in this approach and the results of an experiment we're running at Google to address this need.

Finally, we'll try to take a glimpse into the future and see what these changes mean for the future of SRE as a discipline.

Dave Rensin is a Google SRE Director leading Customer Reliability Engineering (CRE)—a team of SREs pointed outward at customer production systems. Previously, he led Global Support for Google Cloud. As a longtime startup veteran he has lived through an improbable number of "success disasters" and pathologically weird failure modes. Ask him how to secure a handheld computer by accidentally writing software to make it catch fire, why a potato chip can is a terrible companion on a North Sea oil derrick, or about the time he told Steve Jobs that the iPhone was "destined to fail."

SREcon17 Americas Program

Monday, March 13, 2017

7:30 am–8:30 am

Continental Breakfast

8:30 am–9:25 am

Plenary Session

9:25 am–9:55 am

Break with Refreshments

9:55 am–10:50 am

10:55 am–11:50 am

11:55 am–12:50 pm

12:50 pm–1:50 pm

Lunch

1:50 pm–2:45 pm

2:50 pm–3:45 pm

3:50 pm–4:45 pm

4:45 pm–5:15 pm

Break with Refreshments

5:15 pm–6:10 pm

Plenary Session

6:15 pm–8:15 pm

Reception

Tuesday, March 14, 2017

7:30 am–8:30 am

Continental Breakfast

8:30 am–9:25 am

Plenary Session

9:25 am–9:55 am

Break with Refreshments

9:55 am–10:50 am

10:55 am–11:50 am

11:55 am–12:50 pm

12:50 pm–1:50 pm

Lunch

1:50 pm–2:45 pm

2:50 pm–3:45 pm

3:50 pm –4:45 pm

4:45 pm–5:15 pm

Break with Refreshments

5:15 pm–6:10 pm

Plenary Session

6:15 pm–7:00 pm

Sponsor Showcase Happy Hour

7:00 pm–8:00 pm

Reception