Conference Programme

All sessions will be held at the DoubleTree by Hilton Dublin unless otherwise noted.

Downloads for Registered Conference Attendees

Attendee Files 
SREcon16 Europe Attendee List

 

Monday, 11 July 2016

08:00–09:00 Monday

Morning Coffee and Tea

Pre-Function Area

09:00–10:20 Monday

Plenary Session

Pembroke/Lansdowne Rooms

Splicing SRE DNA Sequences in the Biggest Software Company on the Planet

Greg Veith, Microsoft

Greg is the Director of the Azure Site Reliability Engineering team in Azure, Microsoft’s cloud infrastructure - a multi-billion-dollar business that is the foundation of the company’s service offerings. Azure is deployed across all geographies and used worldwide by millions of customers. In addition to 1st party workloads such as Office 365, Bing, and Dynamics, Azure is leveraged by multi-billion-dollar Fortune 500’s, small and medium size businesses, as well as startups. The production footprint is enormous—rivaled only by a couple companies on the planet. Azure is uniquely positioned in the Enterprise Cloud space and its service offerings include IaaS, PaaS, and SaaS (infrastructure, platform, and software as a service).

Outside of work, you can find him spending time with his family, skiing, hiking, biking, playing on the water, and voraciously consuming words (reading).

The principles and constructs of DevOps are pervading the industry and have lit the path for the capability to execute with both speed and quality in balance while managing hockey stick growth. SRE has established itself as the most effective incarnation of DevOps in our industry. Most companies and organizations are nowhere near “SRE goal state.” As this audience knows, building SRE, and applying the devops principles into existing companies and service code bases involves cultural engineering as well as deep tech investments. SRE is a step function change in many dimensions and requires new "genes" to be inserted into an organization's DNA. There is no larger enterprise cloud+consumer company in the world than Microsoft, and we’re on the journey now and investing heavily to shift to SRE. The core of Microsoft was not born in the cloud, but instead in first gen consumer (PC), and then rose to dominate Enterprise.

The principles and constructs of DevOps are pervading the industry and have lit the path for the capability to execute with both speed and quality in balance while managing hockey stick growth. SRE has established itself as the most effective incarnation of DevOps in our industry. Most companies and organizations are nowhere near “SRE goal state.” As this audience knows, building SRE, and applying the devops principles into existing companies and service code bases involves cultural engineering as well as deep tech investments. SRE is a step function change in many dimensions and requires new "genes" to be inserted into an organization's DNA. There is no larger enterprise cloud+consumer company in the world than Microsoft, and we’re on the journey now and investing heavily to shift to SRE. The core of Microsoft was not born in the cloud, but instead in first gen consumer (PC), and then rose to dominate Enterprise. There are a handful of services at Microsoft that were born in the cloud and have scaled massively, with Bing Search being the “granddaddy” with 13 years of experience in the pocket from which we have learned much. Microsoft Azure is the Enterprise Cloud. We are making an enormous investment to run at galactic scale to be the infrastructure for the world’s infrastructure and compete relentlessly in the market with top competitors. In this talk we will compare and contrast the journey within Bing to the current state of execution and operations and how we are taking the lessons from that experience, inspiration from industry, and learnings to date as we build SRE within Azure.

Greg is the Director of the Azure Site Reliability Engineering team in Azure, Microsoft’s cloud infrastructure—a multi-billion-dollar business that is the foundation of the company’s service offerings. Azure is deployed across all geographies and used worldwide by millions of customers. In addition to first party workloads such as Office 365, Bing, and Dynamics, Azure is leveraged by multi-billion-dollar Fortune 500s, small and medium size businesses, as well as startups. The production footprint is enormous—rivaled only by a couple companies on the planet. Azure is uniquely positioned in the Enterprise Cloud space and its service offerings include IaaS, PaaS, and SaaS (infrastructure, platform, and software as a service).

Outside of work, you can find Greg spending time with his family, skiing, hiking, biking, playing on the water, and voraciously consuming words (reading).

Available Media

Doorman: Global Distributed Client Side Rate Limiting

Jos Visser, Google

Jos Visser has been working in the field of reliable and highly available systems since 1988. Starting as a systems programmer (MVS) at a bank, Jos's >25 year career has seen him working with a variety of mission critical systems technologies, including Stratus fault-tolerant systems, HP MC/ServiceGuard, Sun Enterprise Cluster, and Linux Lifekeeper. Jos joined Google in 2006 as an engineer in the Maps SRE team. Since then he has worked in a number of different areas including Social (Orkut SRE), Google's cloud computing, backup and monitoring teams, and YouTube. Since early 2016, he is working in the Travel SRE team in Cambridge MA, where he is tech lead for the pipelines that ingest airline and travel industry data.

Doorman is a Google developed system for global distributed client side rate limiting. We are in the process of open sourcing it. With Doorman an arbitrary number of globally distributed clients can coordinate their usage of a shared resource so that the global usage does not exceed global capacity.

Doorman is a Google developed system for global distributed client side rate limiting. We are in the process of open sourcing it. With Doorman an arbitrary number of globally distributed clients can coordinate their usage of a shared resource so that the global usage does not exceed global capacity.

This presentation:

  • Describes the fundamentals of the Doorman system
  • Explains the concepts of the RPC protocol between Doorman components
  • Shows code examples of Doorman configurations and clients
  • Shows graphs of how Doorman clients ask for and get capacity, and how this sums up globally
  • Explains how Doorman deals with spikes, clients going away, servers going away
  • Explains Doorman's system reliability features
  • Points to the Doorman open source repository
  • Explains the Doorman simulation (in Python) which can be used to quickly verify Doorman's behaviour in a specific scenario

Jos Visser has been working in the field of reliable and highly available systems since 1988. Starting as a systems programmer (MVS) at a bank, Jos's >25 year career has seen him working with a variety of mission critical systems technologies, including Stratus fault-tolerant systems, HP MC/ServiceGuard, Sun Enterprise Cluster, and Linux Lifekeeper. Jos joined Google in 2006 as an engineer in the Maps SRE team. Since then he has worked in a number of different areas including Social (Orkut SRE), Google's cloud computing, backup and monitoring teams, and YouTube. Since early 2016, he is working in the Travel SRE team in Cambridge MA, where he is tech lead for the pipelines that ingest airline and travel industry data.

Available Media

10:20–11:00 Monday

Break with Refreshments

Pre-Function Area

11:00–12:20 Monday

Tracks 1/2

Pembroke/Lansdowne Rooms

Building and Running SRE Teams

Kurt Andersen, LinkedIn

Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon(US), and SANOG on various aspects of reliability, authentication and security.

General Stanley McChrystal led the Joint Special Operations Task Force in Iran in the mid to late 2000's. While in command of the Task Force, he was responsible for transforming an organization which was dominated by Taylorist reductionism into an agile, responsive network which could dynamically adapt and win in the threat landscape around them. In his book Team of Teams: New Rules of Engagement for a Complex World, he outlines the key learnings that emerged from that process. The same issues and challenges face site reliability engineers and managers for SRE teams as we cope with the complexity of our own and partner ecosystems. In this talk, I will highlight the key points from Team of Teams: New Rules of Engagement for a Complex World and show how the solutions that helped make the Task Force successful can be applied to make SRE teams succeed too.

General Stanley McChrystal led the Joint Special Operations Task Force in Iran in the mid to late 2000's. While in command of the Task Force, he was responsible for transforming an organization which was dominated by Taylorist reductionism into an agile, responsive network which could dynamically adapt and win in the threat landscape around them. In his book Team of Teams: New Rules of Engagement for a Complex World, he outlines the key learnings that emerged from that process. The same issues and challenges face site reliability engineers and managers for SRE teams as we cope with the complexity of our own and partner ecosystems. In this talk, I will highlight the key points from Team of Teams: New Rules of Engagement for a Complex World and show how the solutions that helped make the Task Force successful can be applied to make SRE teams succeed too.

Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon(US), and SANOG on various aspects of reliability, authentication and security.

Available Media

Track 3

Ulster Suite

Data Center Networks: The Rip van Winkle Edition

Dinesh Dutt, Cumulus Networks

If Rip Van Winkle had gone to sleep around 2006 and woken up 10 years later, he'd find the world a strange brew of the new and the old. He'd be amazed that phones had grown a brain, dismayed that a most excellent rendition of the Dark Knight had wandered back to the wasteland as most Dark Knight capers do. People had warmed upto electric cars, but not to climate change. And, if Ol' Rip were a network operations guy at some of the large webscale companies, he might think he'd died and woken up in heaven.

If Rip Van Winkle had gone to sleep around 2006 and woken up 10 years later, he'd find the world a strange brew of the new and the old. He'd be amazed that phones had grown a brain, dismayed that a most excellent rendition of the Dark Knight had wandered back to the wasteland as most Dark Knight capers do. People had warmed upto electric cars, but not to climate change. And, if Ol' Rip were a network operations guy at some of the large webscale companies, he might think he'd died and woken up in heaven. Networks were no longer slow as molasses: to deploy, manage and upgrade. He'd find some things had stayed the same (IPv4 still ruled the roost), and some others not so much. He would be puzzled by the terminology and the discussions as he wandered the hallways. SDN, Open networking, Openflow, microservices, Ansible, Puppet, Kubernetes, and so on.

This tutorial is an attempt to bring folks up to speed on whats happened with networking in the past 10 years or so, especially in the data center, concluding with some thoughts on why exciting times lie ahead. The talk will be roughly divided into the following sections:

  1. Who Moved My Network ? What's causing all this turmoil in networking
  2. Solutions: Requirements, Terminology, Pros and Cons
  3. Changing Landscape: Network Topologies
  4. Changing Foundation: Network Protocols
  5. Changing Operations: Modern Operations
  6. Changing Residents: Modern applications and their implications on networks
  7. Reading Tea Leaves

The tutorial will include demos and hands on work with some modern tools.

The audience is expected to be aware of basic networking (bridging, routing, broadcast, multicast etc.).

The key takeways from this talk will be:

  • An understanding of the forces behind the changes in data center networking
  • The morphology an physiology of modern DC networks
  • What these changes presage of the future

Some preliminary ideas for hands on work:

  • Build multi-host container network
  • Build and configure a nxm CLOS topology with BGP
  • Design a CLOS for x number of servers given certain box specifications

Dinesh G. Dutt has been in the networking industry for the past 15 years, most of it at Cisco Systems. Before joining Cumulus, he was a Fellow at Cisco Systems. He has been involved in enterprise and data center networking technologies, including the design of many of the ASICs that powered Cisco's mega-switches such as Cat6K and the Nexus family of switches. He also has experience in storage networking from his days at Andiamo Systems and in the design of FCoE. He is a co-author of TRILL and VxLAN, and has filed for over 40 patents.

Available Media

Track 4

Munster Suite

Staring into the eBPF Abyss

Sasha Goldshtein, SELA Group

Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.

eBPF (extended Berkeley Packet Filters) is a modern kernel technology that can be used to introduce dynamic tracing into a system that wasn't prepared or instrumented in any way. The tracing programs run in the kernel, are guaranteed to never crash or hang your system, and can probe every module and function -- from the kernel to user-space frameworks such as Node and Ruby.

In this workshop, you will experiment with Linux dynamic tracing first-hand. First, you will explore BCC, the BPF Compiler Collection, which is a set of tools and libraries for dynamic tracing. Many of your tracing needs will be answered by BCC, and you will experiment with memory leak analysis, generic function tracing, kernel tracepoints, static tracepoints in user-space programs, and the "baked" tools for file I/O, network, and CPU analysis. You'll be able to choose between working on a set of hands-on labs prepared by the instructors, or trying the tools out on your own test system.

eBPF (extended Berkeley Packet Filters) is a modern kernel technology that can be used to introduce dynamic tracing into a system that wasn't prepared or instrumented in any way. The tracing programs run in the kernel, are guaranteed to never crash or hang your system, and can probe every module and function -- from the kernel to user-space frameworks such as Node and Ruby.

In this workshop, you will experiment with Linux dynamic tracing first-hand. First, you will explore BCC, the BPF Compiler Collection, which is a set of tools and libraries for dynamic tracing. Many of your tracing needs will be answered by BCC, and you will experiment with memory leak analysis, generic function tracing, kernel tracepoints, static tracepoints in user-space programs, and the "baked" tools for file I/O, network, and CPU analysis. You'll be able to choose between working on a set of hands-on labs prepared by the instructors, or trying the tools out on your own test system.

Next, you will hack on some of the bleeding edge tools in the BCC toolkit, and build a couple of simple tools of your own. You'll be able to pick from a curated list of GitHub issues for the BCC project, a set of hands-on labs with known "school solutions", and an open-ended list of problems that need tools for effective analysis. At the end of this workshop, you will be equipped with a toolbox for diagnosing issues in the field, as well as a framework for building your own tools when the generic ones do not suffice.

Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.

Available Media
12:20–13:40 Monday

Conference Luncheon

Sussex Restaurant

13:40–15:00 Monday

Track 1: Capacity Planning

Lansdowne Room

Flash Sale Engineering

Emil Stolarsky, Shopify

Emil is a production engineer at Shopify where he works on performance, the production pipeline, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's shivering over a spiked cup of coffee in the great Canadian north.

 

From stores with ads in the Super Bowl to selling Kanye’s latest album, Shopify has built a name for itself handling some of the world’s largest flash sales. These high profile events generate write-heavy traffic that can be four times our platform’s baseline throughput and don’t lend themselves to off-the-shelf solutions.

This talk is the story of how we engineered our platform to survive large bursts of traffic. Since it’s not financially sound for Shopify to have the required capacity always running, we built queueing and page caching layers into our Nginx load balancers with Lua. To guarantee these solutions worked, we tested them with a purpose-built load testing service.

Although flash sales are unique to commerce platforms, the lessons we learn from them are applicable to any services that experience bursts of traffic.

From stores with ads in the Super Bowl to selling Kanye’s latest album, Shopify has built a name for itself handling some of the world’s largest flash sales. These high profile events generate write-heavy traffic that can be four times our platform’s baseline throughput and don’t lend themselves to off-the-shelf solutions.

This talk is the story of how we engineered our platform to survive large bursts of traffic. Since it’s not financially sound for Shopify to have the required capacity always running, we built queueing and page caching layers into our Nginx load balancers with Lua. To guarantee these solutions worked, we tested them with a purpose-built load testing service.

Although flash sales are unique to commerce platforms, the lessons we learn from them are applicable to any services that experience bursts of traffic.

Emil is a production engineer at Shopify where he works on performance, the production pipeline, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's shivering over a spiked cup of coffee in the great Canadian north.

Available Media

Managing Up and Sideways as an SRE

Liz Fong-Jones, Google

Liz is a Senior Site Reliability Engineer at Google and manages a team of SREs responsible for Google's storage systems. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Ever have a bad manager? Or have a project go off the rails but feel powerless to stop the trainwreck? I'll talk about why knowing a little bit about management can help you as an individual contributor or tech lead, and talk about a few ways that you can help yourself and your SRE team without ever formally managing yourself.

Ever have a bad manager? Or have a project go off the rails but feel powerless to stop the trainwreck? I'll talk about why knowing a little bit about management can help you as an individual contributor or tech lead, and talk about a few ways that you can help yourself and your SRE team without ever formally managing yourself.

Liz is a Senior Site Reliability Engineer at Google and manages a team of SREs responsible for Google's storage systems. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Available Media

Track 2: Brownfield

Pembroke Room

The Production Engineering Lifecycle: How We Build, Run, and Disband Great Reliability-focused Teams

Andrew Ryan, Facebook

Andrew has been a Production Engineer at Facebook since 2009, as a tech lead for a number of teams, supporting systems such as Hadoop and Memcache. Before that, he spent eight years managing SaaS deployments at CollabNet, and various sysadmin positions before that.

Engineers focused on reliability and scalability under real-world conditions are a scarce resource in any organization. How do we know where to deploy them, and how do we use them in the best possible way? In Facebook's Production Engineering team, we have this problem all the time, and we've dealt with it a variety of ways throughout the years. Some of these ways have worked better than others, and we'd like to share what works and what hasn't.

In this talk, we will share our approaches to when to start a production engineering team, how to integrate that team into the existing development team, how to prioritize and divide work between engineers, and even when to disband or merge the team. We will also discuss practical matters such as how we divide on call responsibilities and roadmap items, and how we integrate engineers in multiple locations and time zones. 

Engineers focused on reliability and scalability under real-world conditions are a scarce resource in any organization. How do we know where to deploy them, and how do we use them in the best possible way? In Facebook's Production Engineering team, we have this problem all the time, and we've dealt with it a variety of ways throughout the years. Some of these ways have worked better than others, and we'd like to share what works and what hasn't.

In this talk, we will share our approaches to when to start a production engineering team, how to integrate that team into the existing development team, how to prioritize and divide work between engineers, and even when to disband or merge the team. We will also discuss practical matters such as how we divide on call responsibilities and roadmap items, and how we integrate engineers in multiple locations and time zones. 

Andrew has been a Production Engineer at Facebook since 2009, as a tech lead for a number of teams, supporting systems such as Hadoop and Memcache. Before that, he spent eight years managing SaaS deployments at CollabNet, and various sysadmin positions before that.

Available Media

How to Improve Your Service by Roasting It

Jake Welch, Microsoft

Jake Welch is a Site Reliability Engineer on the Microsoft Azure team in NYC. He has worked on large scale services at Microsoft for eight years, primarily in Azure infrastructure and Storage in software engineering/operational/managerial roles and on the major disaster on-call team. In 2014, he started the first SRE pilot within Azure. Prior to Microsoft, Jake worked as a developer building websites and automating backend business workflows across OSX and Windows.

 

In many companies, including Microsoft, SRE is not yet an integrated part of the operational landscape. Instead it is being actively adapted into mature companies. Our team has been working to develop new and interesting ways to introduce SRE and its tenets to an organization with many different operational approaches ranging from IT Ops to DevOps.

In many companies, including Microsoft, SRE is not yet an integrated part of the operational landscape. Instead it is being actively adapted into mature companies. Our team has been working to develop new and interesting ways to introduce SRE and its tenets to an organization with many different operational approaches ranging from IT Ops to DevOps.

The process of introducing SRE has proven to be quite complex and socially delicate: you can't go in to a team and just tell them they are doing things wrong. You need to find the right way to show a developer all the warts on their baby and motivate them to work with you on addressing them. Furthermore, you have to deal with their earnest desire to treat you as "just another ops team" who is only there to take the pager from them.
One of the tools we've used to enable the right conversations is to hold what we call a Service Roast. Named after the famous friar's club roasts, the goal is to establish a safe environment to dig into and expose those warts, wrinkles, design flaws, shortcomings, and problems everyone knows a service has but doesn't want to talk about. We can't help you if you won't tell us where it hurts.

To perform the Service Roasts, we've discovered some process, ground rules, a new role of impartial moderator, and some useful structure to host this kind of meeting. Thus far we've been able to obtain great insight into some of our services and more importantly created some very interesting (and lively) conversations.

To be sure, this is a high-risk activity, and shouldn't be done without careful consideration of the teams participating, but we'll present what we've learned about holding these roasts, guidance teams need for successful participation, and (importantly) why we don't use this approach everywhere.

Available Media

Track 3

Ulster Suite

(Continued from previous session)

Data Center Networks: The Rip van Winkle Edition

Dinesh Dutt, Cumulus Networks

If Rip Van Winkle had gone to sleep around 2006 and woken up 10 years later, he'd find the world a strange brew of the new and the old. He'd be amazed that phones had grown a brain, dismayed that a most excellent rendition of the Dark Knight had wandered back to the wasteland as most Dark Knight capers do. People had warmed upto electric cars, but not to climate change. And, if Ol' Rip were a network operations guy at some of the large webscale companies, he might think he'd died and woken up in heaven.

If Rip Van Winkle had gone to sleep around 2006 and woken up 10 years later, he'd find the world a strange brew of the new and the old. He'd be amazed that phones had grown a brain, dismayed that a most excellent rendition of the Dark Knight had wandered back to the wasteland as most Dark Knight capers do. People had warmed upto electric cars, but not to climate change. And, if Ol' Rip were a network operations guy at some of the large webscale companies, he might think he'd died and woken up in heaven. Networks were no longer slow as molasses: to deploy, manage and upgrade. He'd find some things had stayed the same (IPv4 still ruled the roost), and some others not so much. He would be puzzled by the terminology and the discussions as he wandered the hallways. SDN, Open networking, Openflow, microservices, Ansible, Puppet, Kubernetes, and so on.

This tutorial is an attempt to bring folks up to speed on whats happened with networking in the past 10 years or so, especially in the data center, concluding with some thoughts on why exciting times lie ahead. The talk will be roughly divided into the following sections:

  1. Who Moved My Network ? What's causing all this turmoil in networking
  2. Solutions: Requirements, Terminology, Pros and Cons
  3. Changing Landscape: Network Topologies
  4. Changing Foundation: Network Protocols
  5. Changing Operations: Modern Operations
  6. Changing Residents: Modern applications and their implications on networks
  7. Reading Tea Leaves

The tutorial will include demos and hands on work with some modern tools.

The audience is expected to be aware of basic networking (bridging, routing, broadcast, multicast etc.).

The key takeways from this talk will be:

  • An understanding of the forces behind the changes in data center networking
  • The morphology an physiology of modern DC networks
  • What these changes presage of the future

Some preliminary ideas for hands on work:

  • Build multi-host container network
  • Build and configure a nxm CLOS topology with BGP
  • Design a CLOS for x number of servers given certain box specifications

Dinesh G. Dutt has been in the networking industry for the past 15 years, most of it at Cisco Systems. Before joining Cumulus, he was a Fellow at Cisco Systems. He has been involved in enterprise and data center networking technologies, including the design of many of the ASICs that powered Cisco's mega-switches such as Cat6K and the Nexus family of switches. He also has experience in storage networking from his days at Andiamo Systems and in the design of FCoE. He is a co-author of TRILL and VxLAN, and has filed for over 40 patents.

Available Media

Track 4

Munster Suite

(Continued from previous session)

Staring into the eBPF Abyss

Sasha Goldshtein, SELA Group

Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.

eBPF (extended Berkeley Packet Filters) is a modern kernel technology that can be used to introduce dynamic tracing into a system that wasn't prepared or instrumented in any way. The tracing programs run in the kernel, are guaranteed to never crash or hang your system, and can probe every module and function -- from the kernel to user-space frameworks such as Node and Ruby.

In this workshop, you will experiment with Linux dynamic tracing first-hand. First, you will explore BCC, the BPF Compiler Collection, which is a set of tools and libraries for dynamic tracing. Many of your tracing needs will be answered by BCC, and you will experiment with memory leak analysis, generic function tracing, kernel tracepoints, static tracepoints in user-space programs, and the "baked" tools for file I/O, network, and CPU analysis. You'll be able to choose between working on a set of hands-on labs prepared by the instructors, or trying the tools out on your own test system.

eBPF (extended Berkeley Packet Filters) is a modern kernel technology that can be used to introduce dynamic tracing into a system that wasn't prepared or instrumented in any way. The tracing programs run in the kernel, are guaranteed to never crash or hang your system, and can probe every module and function -- from the kernel to user-space frameworks such as Node and Ruby.

In this workshop, you will experiment with Linux dynamic tracing first-hand. First, you will explore BCC, the BPF Compiler Collection, which is a set of tools and libraries for dynamic tracing. Many of your tracing needs will be answered by BCC, and you will experiment with memory leak analysis, generic function tracing, kernel tracepoints, static tracepoints in user-space programs, and the "baked" tools for file I/O, network, and CPU analysis. You'll be able to choose between working on a set of hands-on labs prepared by the instructors, or trying the tools out on your own test system.

Next, you will hack on some of the bleeding edge tools in the BCC toolkit, and build a couple of simple tools of your own. You'll be able to pick from a curated list of GitHub issues for the BCC project, a set of hands-on labs with known "school solutions", and an open-ended list of problems that need tools for effective analysis. At the end of this workshop, you will be equipped with a toolbox for diagnosing issues in the field, as well as a framework for building your own tools when the generic ones do not suffice.

Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.

Available Media
15:00–15:40 Monday

Break with Refreshments

Pre-Function Area

15:40–17:00 Monday

Track 1: Capacity Planning

Lansdowne Room

Capacity Planning at Scale

Ramón Medrano Llamas, Google

Raul has been an SRE Technical Lead on Identity and Authentication services at Google since 2013.

Have you ever bought machines? What if you need to even build datacenters? How can you predict how many you are going to need in two years from now? How can you make efficient use of all the resources you suddenly got? What if you are missing some resources? Can we automate all these stuff and integrate with our continuous delivery?

These are just a few questions anyone planning a large computer fleet always make. This talk will cover some of the approaches and tooling that can be used to effectively plan for the demand of services and how to cover it on the most efficient manner.

Have you ever bought machines? What if you need to even build datacenters? How can you predict how many you are going to need in two years from now? How can you make efficient use of all the resources you suddenly got? What if you are missing some resources? Can we automate all these stuff and integrate with our continuous delivery?

These are just a few questions anyone planning a large computer fleet always make. This talk will cover some of the approaches and tooling that can be used to effectively plan for the demand of services and how to cover it on the most efficient manner.

Ramón has been an SRE Technical Lead on Identity and Authentication services at Google since 2013.

Available Media

Load Shedding—Approaches, Principles, Experiences, and Impact in Service Management

Acacio Cruz, Google

Google SRE Director for 9 years; Director in Frameworks & Production Platforms.

Overview of principles and load-shedding mechanisms in large-scale services and how they impact service management.

Overview of principles and load-shedding mechanisms in large-scale services and how they impact service management.

Google SRE Director for 9 years; Director in Frameworks & Production Platforms

Available Media

Track 2: Brownfield

Pembroke Room

Tier1 Metamorphoses

Nina Mushiana, LinkedIn

Available Media

Panel: Brownfield SRE

Moderator: Caskey Dickson, Microsoft

Track 3

Ulster Suite

(Continued from previous session)

Data Center Networks: The Rip van Winkle Edition

Dinesh Dutt, Cumulus Networks

If Rip Van Winkle had gone to sleep around 2006 and woken up 10 years later, he'd find the world a strange brew of the new and the old. He'd be amazed that phones had grown a brain, dismayed that a most excellent rendition of the Dark Knight had wandered back to the wasteland as most Dark Knight capers do. People had warmed upto electric cars, but not to climate change. And, if Ol' Rip were a network operations guy at some of the large webscale companies, he might think he'd died and woken up in heaven.

If Rip Van Winkle had gone to sleep around 2006 and woken up 10 years later, he'd find the world a strange brew of the new and the old. He'd be amazed that phones had grown a brain, dismayed that a most excellent rendition of the Dark Knight had wandered back to the wasteland as most Dark Knight capers do. People had warmed upto electric cars, but not to climate change. And, if Ol' Rip were a network operations guy at some of the large webscale companies, he might think he'd died and woken up in heaven. Networks were no longer slow as molasses: to deploy, manage and upgrade. He'd find some things had stayed the same (IPv4 still ruled the roost), and some others not so much. He would be puzzled by the terminology and the discussions as he wandered the hallways. SDN, Open networking, Openflow, microservices, Ansible, Puppet, Kubernetes, and so on.

This tutorial is an attempt to bring folks up to speed on whats happened with networking in the past 10 years or so, especially in the data center, concluding with some thoughts on why exciting times lie ahead. The talk will be roughly divided into the following sections:

  1. Who Moved My Network ? What's causing all this turmoil in networking
  2. Solutions: Requirements, Terminology, Pros and Cons
  3. Changing Landscape: Network Topologies
  4. Changing Foundation: Network Protocols
  5. Changing Operations: Modern Operations
  6. Changing Residents: Modern applications and their implications on networks
  7. Reading Tea Leaves

The tutorial will include demos and hands on work with some modern tools.

The audience is expected to be aware of basic networking (bridging, routing, broadcast, multicast etc.).

The key takeways from this talk will be:

  • An understanding of the forces behind the changes in data center networking
  • The morphology an physiology of modern DC networks
  • What these changes presage of the future

Some preliminary ideas for hands on work:

  • Build multi-host container network
  • Build and configure a nxm CLOS topology with BGP
  • Design a CLOS for x number of servers given certain box specifications

Dinesh G. Dutt has been in the networking industry for the past 15 years, most of it at Cisco Systems. Before joining Cumulus, he was a Fellow at Cisco Systems. He has been involved in enterprise and data center networking technologies, including the design of many of the ASICs that powered Cisco's mega-switches such as Cat6K and the Nexus family of switches. He also has experience in storage networking from his days at Andiamo Systems and in the design of FCoE. He is a co-author of TRILL and VxLAN, and has filed for over 40 patents.

Available Media

Track 4

Munster Suite

(Continued from previous session)

Staring into the eBPF Abyss

Sasha Goldshtein, SELA Group

Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.

eBPF (extended Berkeley Packet Filters) is a modern kernel technology that can be used to introduce dynamic tracing into a system that wasn't prepared or instrumented in any way. The tracing programs run in the kernel, are guaranteed to never crash or hang your system, and can probe every module and function -- from the kernel to user-space frameworks such as Node and Ruby.

In this workshop, you will experiment with Linux dynamic tracing first-hand. First, you will explore BCC, the BPF Compiler Collection, which is a set of tools and libraries for dynamic tracing. Many of your tracing needs will be answered by BCC, and you will experiment with memory leak analysis, generic function tracing, kernel tracepoints, static tracepoints in user-space programs, and the "baked" tools for file I/O, network, and CPU analysis. You'll be able to choose between working on a set of hands-on labs prepared by the instructors, or trying the tools out on your own test system.

eBPF (extended Berkeley Packet Filters) is a modern kernel technology that can be used to introduce dynamic tracing into a system that wasn't prepared or instrumented in any way. The tracing programs run in the kernel, are guaranteed to never crash or hang your system, and can probe every module and function -- from the kernel to user-space frameworks such as Node and Ruby.

In this workshop, you will experiment with Linux dynamic tracing first-hand. First, you will explore BCC, the BPF Compiler Collection, which is a set of tools and libraries for dynamic tracing. Many of your tracing needs will be answered by BCC, and you will experiment with memory leak analysis, generic function tracing, kernel tracepoints, static tracepoints in user-space programs, and the "baked" tools for file I/O, network, and CPU analysis. You'll be able to choose between working on a set of hands-on labs prepared by the instructors, or trying the tools out on your own test system.

Next, you will hack on some of the bleeding edge tools in the BCC toolkit, and build a couple of simple tools of your own. You'll be able to pick from a curated list of GitHub issues for the BCC project, a set of hands-on labs with known "school solutions", and an open-ended list of problems that need tools for effective analysis. At the end of this workshop, you will be equipped with a toolbox for diagnosing issues in the field, as well as a framework for building your own tools when the generic ones do not suffice.

Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.

Available Media
17:30–19:00 Monday

Reception, Sponsored by Google

Herbert Room

 

Tuesday, 12 July 2016

08:00–09:00 Tuesday

Morning Coffee and Tea

Pre-Function Area

09:00–10:20 Tuesday

Track 1: Incident Management

Lansdowne Room

Production Improvement Review: Taking a Bite Out of Repair Debt

Martin Check, Microsoft

Martin Check is a Site Reliability Engineer on the Microsoft Azure team. He has worked on large scale services at Microsoft for 12 years in a variety of roles ranging from service design and implementation, to crisis response, to leading teams through devops/SRE transitions. Currently he is working on Problem Management efforts for Azure to identify and resolve problems that stand in the way of service uptime through data analysis, surfacing insights, and engineering solutions. 

Azure SRE works with services that have widely variable maturity, ranging from fully federated devops teams, to fully Tiered IT/Ops teams, and everything in between. The one thing all of these services have in common is that they have outages. While they all respond in different ways to recover and respond, SRE has to collect and leverage data in a common manner across all services to prevent outages and drive reliability up consistently. In this this talk we’ll discuss how SRE leverages diverse data sets to drive improvements across this heterogeneous set of services. SRE ensures that teams are rigorously completing post incident reviews and addressing their live site debt. We not only look at the actual repair debt, but we’ve introduced a new concept called “virtual debt” which shows where a service incident response faltered, but no appropriate repair was logged. Virtual Debt is affectionately referred to as “PacMan debt” due to the appearance of the chart.

Azure SRE works with services that have widely variable maturity, ranging from fully federated devops teams, to fully Tiered IT/Ops teams, and everything in between. The one thing all of these services have in common is that they have outages. While they all respond in different ways to recover and respond, SRE has to collect and leverage data in a common manner across all services to prevent outages and drive reliability up consistently. In this this talk we’ll discuss how SRE leverages diverse data sets to drive improvements across this heterogeneous set of services. SRE ensures that teams are rigorously completing post incident reviews and addressing their live site debt. We not only look at the actual repair debt, but we’ve introduced a new concept called “virtual debt” which shows where a service incident response faltered, but no appropriate repair was logged. Virtual Debt is affectionately referred to as “PacMan debt” due to the appearance of the chart. The greater the virtual debt, the bigger the bite.

We’ll also discuss how we expose the data in near real time dashboards that allow team members from the director all the way down to the IC to see relevant views and take the appropriate action. IC’s can find incomplete postmortems they need to work on, a service director can view his accumulated debt to prioritize resources, or a dev manager can review virtual debt to ensure the team is conducting rigorous postmortems. By analyzing historical outages, we’ve found that missed detection leads to an exponential increase in mitigation times. We’ve collected a myriad of other insights by mining through historical outage data and using charts and creative visualizations to surface insights, including what surprising proxy metrics we’ve discovered that influence uptime. 

Martin Check is a Site Reliability Engineer on the Microsoft Azure team. He has worked on large scale services at Microsoft for 12 years in a variety of roles ranging from service design and implementation, to crisis response, to leading teams through devops/SRE transitions. Currently he is working on Problem Management efforts for Azure to identify and resolve problems that stand in the way of service uptime through data analysis, surfacing insights, and engineering solutions. 

Available Media

Track 2: Monitoring and Alerting

Pembroke Room

The Many Ways Your Monitoring Is Lying to You

Sebastian Kirsch, Google

Sebastian Kirsch is a Site Reliability Engineer for Google in Zürich, Switzerland. He manages the team that runs Google Calendar. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler, as well as on external products like Google Maps. He specializes in the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high reliability bar as every other Google service.

Monitoring and dashboarding systems are crucial to understanding the behavior of large distributed systems. But monitoring systems can lead you on wild goose chases, or hide issues. In this talk, I will look at some examples of how a monitoring system can lie to you – in order to sensitize the audience to these failure modes and encourage them to look for similar examples in their own systems.

Monitoring and dashboarding systems are crucial to understanding the behavior of large distributed systems. But monitoring systems can lead you on wild goose chases, or hide issues. In this talk, I will look at some examples of how a monitoring system can lie to you – in order to sensitize the audience to these failure modes and encourage them to look for similar examples in their own systems.

Sebastian Kirsch is a Site Reliability Engineer for Google in Zürich, Switzerland. He manages the team that runs Google Calendar. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler, as well as on external products like Google Maps. He specializes in the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high reliability bar as every other Google service.

Available Media

Next-generation Alerting and Fault Detection

Dieter Plaetinck, raintank

There is a common belief that in order to solve more [advanced] alerting cases and get more complete coverage, we need complex, often math-heavy solutions based on machine learning or stream processing.
This talk sets context and pro's/cons for such approaches, and provides anecdotal examples from the industry, nuancing the applicability of these methods.

We then explore how we can get dramatically better alerting, as well as make our lives a lot easier by optimizing workflow and machine-human interaction through an alerting IDE (exemplified by bosun), basic logic, basic math and metric metadata, even for solving complicated alerting problems such as detecting faults in seasonal timeseries data.

There is a common belief that in order to solve more [advanced] alerting cases and get more complete coverage, we need complex, often math-heavy solutions based on machine learning or stream processing.
This talk sets context and pro's/cons for such approaches, and provides anecdotal examples from the industry, nuancing the applicability of these methods.

We then explore how we can get dramatically better alerting, as well as make our lives a lot easier by optimizing workflow and machine-human interaction through an alerting IDE (exemplified by bosun), basic logic, basic math and metric metadata, even for solving complicated alerting problems such as detecting faults in seasonal timeseries data.

Dieter is an industrial engineer who started out as a systems engineer for the European social networking site Netlog, did some information retrieval/machine learning research at the university of Ghent, then joined Vimeo for backend/syseng stuff but ended up doing mostly open source monitoring and now works at raintank, the open source monitoring company behind Grafana.

Available Media

Track 3

Ulster Suite

Statistics for Engineers

Heinrich Hartmann, Circonus

Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:

  • Is the system down?
  • Is user experience degraded for some percentage of our customers?
  • How did our query response times change with the last update?

Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations.

Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:

  • Is the system down?
  • Is user experience degraded for some percentage of our customers?
  • How did our query response times change with the last update?

Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations.

The tutorial focuses on practical aspects, and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX tools and the IPython toolkit.
This tutorial has been given at several occasions over the last year and has been refined and extended since, cf. Twitter #StatsForEngineers

Track 4

Munster Suite

Accident Models in Post Mortems

Will Gallego, Nathan Hoffman, and Miriam Lautner, Etsy

Will is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.

Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.

Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two-part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

Will Gallego is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.

Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.

Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two-part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

Will Gallego is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.

Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.

Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.

Available Media
10:20–11:00 Tuesday

Break with Refreshments

Pre-Function Area

11:00–12:20 Tuesday

Track 1: Incident Management

Lansdowne Room

The Next Linux Superpower: eBPF Primer

Sasha Goldshtein, SELA Group

Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.

Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way.

Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way.

Hopefully, you won't have to yearn for a lot longer. eBPF (extended Berkeley Packet Filters) is a kernel technology that enables a plethora of diagnostic scenarios by introducing dynamic, safe, low-overhead, efficient programs that run in the context of your live kernel. Sure, BPF programs can attach to sockets; but more interestingly, they can attach to kprobes and uprobes, static kernel tracepoints, and even user-mode static probes. And modern BPF programs have access to a wide set of instructions and data structures, which means you can collect valuable information and analyze it on-the-fly, without spilling it to huge files and reading them from user space.

In this talk, we will introduce BCC, the BPF Compiler Collection, which is an open set of tools and libraries for dynamic tracing on Linux. Some tools are easy and ready to use, such as execsnoop, fileslower, and memleak. Other tools such as trace and argdist require more sophistication and can be used as a Swiss Army knife for a variety of scenarios. We will spend most of the time demonstrating the power of modern dynamic tracing -- from memory leaks to static probes in Ruby, Node, and Java programs, from slow file I/O to monitoring network traffic. Finally, we will discuss building our own tools using the Python and Lua bindings to BCC, and its LLVM backend.

Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.

Available Media

The Virtuous Cycle: Getting Good Things out of Bad Failures

Joy Scharmen, Heroku

Joy grew up in the wilds of Detroit with Robocop as her only friend. She thought she wanted to be an artist in college but then discovered the computer lab and it was all downhill from there. Currently she is Director of Service Reliability Engineering at Heroku in San Francisco. She loves swearing, whiskey, and saying things like “process is programming for humans!”.

System failures happen. Hardware dies, software crashes, capacity gets exceeded, and any of these things can cause unexpected effects in the most carefully-architected systems.

At Heroku, we deal with complex systems failures. We’re running a platform as a service: our whole business model requires us to provide operations for our customers so they don’t have to do it themselves. We run over a million postgres databases, tens of thousands of redis, and hundreds of thousands of dynos on thousands of AWS instances.

What do we get out of these incidents? Pain and suffering? Yes, sometimes. We also get data about how our systems are actually working. We get ideas for making it work better. And sometimes we get ideas for whole new products.

In this talk, I’ll discuss how to take the bad of a system failure and turn it into good: better products, more reliable platforms, and less stressed engineers.

System failures happen. Hardware dies, software crashes, capacity gets exceeded, and any of these things can cause unexpected effects in the most carefully-architected systems.

At Heroku, we deal with complex systems failures. We’re running a platform as a service: our whole business model requires us to provide operations for our customers so they don’t have to do it themselves. We run over a million postgres databases, tens of thousands of redis, and hundreds of thousands of dynos on thousands of AWS instances.

What do we get out of these incidents? Pain and suffering? Yes, sometimes. We also get data about how our systems are actually working. We get ideas for making it work better. And sometimes we get ideas for whole new products.

In this talk, I’ll discuss how to take the bad of a system failure and turn it into good: better products, more reliable platforms, and less stressed engineers.

Joy grew up in the wilds of Detroit with Robocop as her only friend. She thought she wanted to be an artist in college but then discovered the computer lab and it was all downhill from there. Currently she is Director of Service Reliability Engineering at Heroku in San Francisco. She loves swearing, whiskey, and saying things like “process is programming for humans!”.

Available Media

Track 2: Monitoring and Alerting

Pembroke Room

Alerting for Distributed Systems—A Tale of Symptoms and Causes, Signals and Noise

Björn Rabenstein, SoundCloud

Björn is a Production Engineer at SoundCloud and one of the main Prometheus developers. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.

Noisy alerts are the deadly sin of monitoring. They obfuscate real issues and cause pager fatigue. Instead of reacting with the due sense of urgency, the person on-call will start to skim or even ignore alerts, not to speak about the destruction of their sanity and work-life balance. Unfortunately, there are many monitoring pitfalls on the road to complex production systems, and most of them result in noisier alerts. In distributed systems, and in particular in a microservice architecture, there is usually a good understanding of local failure modes while the behavior of the system as a whole is difficult to reason with. Thus, it is tempting to alert on the many possible causes – after all, finding the root cause of a problem is important.

Noisy alerts are the deadly sin of monitoring. They obfuscate real issues and cause pager fatigue. Instead of reacting with the due sense of urgency, the person on-call will start to skim or even ignore alerts, not to speak about the destruction of their sanity and work-life balance. Unfortunately, there are many monitoring pitfalls on the road to complex production systems, and most of them result in noisier alerts. In distributed systems, and in particular in a microservice architecture, there is usually a good understanding of local failure modes while the behavior of the system as a whole is difficult to reason with. Thus, it is tempting to alert on the many possible causes – after all, finding the root cause of a problem is important. However, a distributed system is designed to tolerate local failures, and a human should only be paged on real or imminent problems of a service, ideally aggregated to one meaningful alert per problem. The definition of a problem should be clear and explicit rather than relying on some kind of automatic "anomaly detection." Taking historical trends into account is needed, though, to detect imminent problems. Those predictions should be simple rather than "magic." Alerting because "something seems weird" is almost never the right thing to do.

SoundCloud's long way from noisy pagers to much saner on-call rotations will serve as a case study, demonstrating how different monitoring technologies, among them most notably Prometheus, have affected alerting.

Björn is a Production Engineer at SoundCloud and one of the main Prometheus developers. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.

Available Media

Track 3

Ulster Suite

(Continued from previous session)

Statistics for Engineers

Heinrich Hartmann, Circonus

Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:

  • Is the system down?
  • Is user experience degraded for some percentage of our customers?
  • How did our query response times change with the last update?

Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations.

Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:

  • Is the system down?
  • Is user experience degraded for some percentage of our customers?
  • How did our query response times change with the last update?

Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations.

The tutorial focuses on practical aspects, and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX tools and the IPython toolkit.
This tutorial has been given at several occasions over the last year and has been refined and extended since, cf. Twitter #StatsForEngineers

Track 4

Munster Suite

Post Mortem Facilitation

Will Gallego, Nathan Hoffman, and Miriam Lautner, Etsy

Will is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.

Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.

Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two-part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

Will Gallego is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.

Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.

Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two-part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

Will Gallego is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.

Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.

Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.

Available Media
12:20–13:40 Tuesday

Conference Luncheon, Sponsored by Amazon Data Services Ireland

Sussex Restaurant

13:40–15:00 Tuesday

Track 1: Incident Management

Lansdowne Room

Challenges of Machine Learning at Scale

Graham Poulter, Google

Graham is an SRE at Google working on machine learning pipelines for ad click prediction.  

Motivated by the problem of predicting whether any given ad would be clicked in response to a query, in this introductory talk we outline the requirements and large-system design challenges that arise when designing a machine learning system that makes millions of predictions per second with low latency, learns quickly from the responses to those predictions, and maintains a consistent level of model quality over time. We present alternatives for meeting those challenges using diagrams of machine learning pipelines.

Concepts used in this talk: machine learning (classification), software pipelines, sharding and replication, map-reduce

Graham is an SRE at Google working on machine learning pipelines for ad click prediction.  

Motivated by the problem of predicting whether any given ad would be clicked in response to a query, in this introductory talk we outline the requirements and large-system design challenges that arise when designing a machine learning system that makes millions of predictions per second with low latency, learns quickly from the responses to those predictions, and maintains a consistent level of model quality over time. We present alternatives for meeting those challenges using diagrams of machine learning pipelines.

Concepts used in this talk: machine learning (classification), software pipelines, sharding and replication, map-reduce

Graham is an SRE at Google working on machine learning pipelines for ad click prediction.  

Available Media

Panel: Oncall

Moderator: Laura Nolan, Google

Available Media

Track 2: Monitoring and Alerting

Pembroke Room

Lightning Talks

API Management—Why Speed Matters
Arianna Aondio, Varnish Software

Reverse Engineering the “Human API” for Automation and Profit
Nati Cohen, SimilarWeb

What a 17th Century Samurai Taught Me about Being an SRE
Caskey L. Dickson, Microsoft

Chatops/Automation: How to get there while everything's on fire
Fran Garcia, Hosted Graphite

Sysdig Love
Alejandro Brito Monedero, Alea Solutions

Automations with Saltstack
Effie Mouzeli, Logicea, LLC

Myths of Network Automation
David Rothera, Facebook

DNS @ Shopify
Emil Stolarsky, Shopify

Hashing Infrastructures
Jimmy Tang, Rapid7

API Management—Why Speed Matters
Arianna Aondio, Varnish Software

Reverse Engineering the “Human API” for Automation and Profit
Nati Cohen, SimilarWeb

What a 17th Century Samurai Taught Me about Being an SRE
Caskey L. Dickson, Microsoft

Chatops/Automation: How to get there while everything's on fire
Fran Garcia, Hosted Graphite

Sysdig Love
Alejandro Brito Monedero, Alea Solutions

Automations with Saltstack
Effie Mouzeli, Logicea, LLC

Myths of Network Automation
David Rothera, Facebook

DNS @ Shopify
Emil Stolarsky, Shopify

Hashing Infrastructures
Jimmy Tang, Rapid7

Available Media

Track 3

Ulster Suite

DivOps, Continuous Diversity at Scale

Laine Campbell, Author, Database Reliability Engineering, O'Reilly Media

This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes.

This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes.

Track 4

Munster Suite

Effective Design Review Participation

Kurt Andersen, LinkedIn

Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon(US), and SANOG on various aspects of reliability, authentication and security.

This workshop is a part of the "full lifecycle" workshop track which includes Post-Mortems, Incident Response, and Effective Design Review Participation. Using several example cases, participants in this session will learn to apply a variety of different points of view to analyze a design for issues which could affect its reliability and operability.

Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon(US), and SANOG on various aspects of reliability, authentication and security.

This workshop is a part of the "full lifecycle" workshop track which includes Post-Mortems, Incident Response, and Effective Design Review Participation. Using several example cases, participants in this session will learn to apply a variety of different points of view to analyze a design for issues which could affect its reliability and operability.

Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon(US), and SANOG on various aspects of reliability, authentication and security.

15:00–15:40 Tuesday

Break with Refreshments

Pre-Function Area

15:40–17:00 Tuesday

Track 1: Incident Management

Lansdowne Room

Moving a Large Workload from a Public Cloud to an OpenStack Private Cloud: Is It Really Worth It?

Nicolas Brousse, TubeMogul

Nicolas Brousse is Senior Director of Operations Engineering at TubeMogul (NASDAQ: TUBE). The company's sixth employee and first operations hire, Nicolas has grown TubeMogul's infrastructure over the past seven years from several machines to over two thousand servers that handle billions of requests per day for clients like Allstate, Chrysler, Heineken and Hotels.com.

Adept at adapting quickly to ongoing business needs and constraints, Nicolas leads a global team of site reliability engineers and database architects that monitor TubeMogul's infrastructure 24/7 and adhere to "DevOps" methodology. Nicolas is a frequent speaker at top U.S. technology conferences and regularly gives advice to other operations engineers. Prior to relocating to the U.S. to join TubeMogul, Nicolas worked in technology for over 15 years, managing heavy traffic and large user databases for companies like MultiMania, Lycos and Kewego. Nicolas lives in Richmond, CA, and is an avid fisherman and aspiring cowboy.

Available Media

My Scariest Day: When Things Go All Wrong

Lightning Talks session

Moderators: John Looney, Google, and Gareth Eason, Facebook

Track 2: Monitoring and Alerting

Pembroke Room

My Service Runs at 99.999%...All Those Tweets about Outages Are Not Real: It's Our Competition Trying to Malign Us!

Kumar Srinivasamurthy, Microsoft

Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, availability, performance, capacity, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale. 

Do you have services where the owners claim they run at five 9's but you often run into errors? It's very easy and convenient to build metrics at the service level. These often hide a wide array of issues that users might face. Having the right metrics is a key component of building sustainable SRE culture. This talk goes into the design of these metrics, real world examples to illustrate good/bad designs. 

Do you have services where the owners claim they run at five 9's but you often run into errors? It's very easy and convenient to build metrics at the service level. These often hide a wide array of issues that users might face. Having the right metrics is a key component of building sustainable SRE culture. This talk goes into the design of these metrics, real world examples to illustrate good/bad designs. 

Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, availability, performance, capacity, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale. 

Available Media

Availability Objectives of SoundCloud’s Microservices

Bora Tunca, SoundCloud

Bora is a software developer at SoundCloud. He started his journey there three years ago. As a generalist, he has worked on various parts of their architecture. Nowadays he is part of the Core Engineering, where he helps to build and integrate the core business services of SoundCloud. When he's not juggling various languages, he's playing basketball—as long as someone on the team covers his on-call shifts...

In a microservices architecture, different services usually have different availabilities. It is often hard to see how the availability of a single service affects the availability of the overall system. Without a clear idea about the availability requirements of individual services, even a seemingly subtle degradation of a service can cause a critical outage. Unfortunately these are discovered only after thorough post-mortems. At SoundCloud we kicked off a project called “Availability Objectives”. An availability objective is the minimum availability a service is allowed to have. These objectives are calculated based on the requirements of the clients of those services. We started by visiting all of our services and setting an availability objective for each of them. We built tools to expose the availability of these services and to flag the ones that drop below their objectives. As a result, we can now make informed decisions about the integration points we need to improve first.

In a microservices architecture, different services usually have different availabilities. It is often hard to see how the availability of a single service affects the availability of the overall system. Without a clear idea about the availability requirements of individual services, even a seemingly subtle degradation of a service can cause a critical outage. Unfortunately these are discovered only after thorough post-mortems. At SoundCloud we kicked off a project called “Availability Objectives”. An availability objective is the minimum availability a service is allowed to have. These objectives are calculated based on the requirements of the clients of those services. We started by visiting all of our services and setting an availability objective for each of them. We built tools to expose the availability of these services and to flag the ones that drop below their objectives. As a result, we can now make informed decisions about the integration points we need to improve first. This talk will share the insights we gained via this project and how it affected our overall availability and engineering productivity.

Bora is a software developer at SoundCloud. He started his journey there three years ago. As a generalist, he has worked on various parts of their architecture. Nowadays he is part of the Core Engineering, where he helps to build and integrate the core business services of SoundCloud. When he's not juggling various languages, he's playing basketball—as long as someone on the team covers his on-call shifts...

Available Media

Downtime Budgets

Cory Lueninghoener, Los Alamos National Laboratory

Cory Lueninghoener leads the HPC Design Group at Los Alamos National Laboratory. He has helped design, build, and manage some of the largest scientific computing resources in the world, including systems ranging in size from 100,000 to 900,000 processors. He is especially interested in turning large-scale system research into practice, and has worked on configuration management and system management tools in the past. Cory was co-chair of LISA 2015 and is active in the large scale system engineering community.

The concept of the error budget is a great way to hack SLAs and make them into a positive tool for system engineers. But how can you take the same idea from a world that handles millions of transactions in a day to one that handles hundreds? High Performance Computing jobs run for hours, days, or weeks at a time, resulting in unique challenges related to system availability, maintenance, and experimentation. This talk will explore a way to modify the error budget concept to fit in an HPC environment by applying the same idea to cluster outages, both planned and unplanned, and to ultimately give customers the best computing environment possible.

The concept of the error budget is a great way to hack SLAs and make them into a positive tool for system engineers. But how can you take the same idea from a world that handles millions of transactions in a day to one that handles hundreds? High Performance Computing jobs run for hours, days, or weeks at a time, resulting in unique challenges related to system availability, maintenance, and experimentation. This talk will explore a way to modify the error budget concept to fit in an HPC environment by applying the same idea to cluster outages, both planned and unplanned, and to ultimately give customers the best computing environment possible.

Cory Lueninghoener leads the HPC Design Group at Los Alamos National Laboratory. He has helped design, build, and manage some of the largest scientific computing resources in the world, including systems ranging in size from 100,000 to 900,000 processors. He is especially interested in turning large-scale system research into practice, and has worked on configuration management and system management tools in the past. Cory was co-chair of LISA 2015 and is active in the large scale system engineering community.

Available Media

Track 3

Ulster Suite

DivOps, Continuous Diversity at Scale

Laine Campbell, Author, Database Reliability Engineering, O'Reilly Media

This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes.

This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes.

Track 4

Munster Suite

Practical Incident Response

Laura Nolan, Google

Laura has been a Site Reliability Engineer at Google since 2013, working in areas as diverse as data infrastructure and pipelines, alerting and more recently, networking. She is passionate about sharing knowledge and intrigued by the behaviour of complex systems, including the humans who run them. Prior to Google she worked as a performance engineer in e-commerce and as a software engineer in R&D for a large Irish software company.

This workshop is structured as a fast-moving but fun game (think fluxx crossed with a hectic oncall shift) but the subject matter is entirely serious: we will use it to explore best practices and pitfalls for managing incidents as a team. You will work as part of a team managing a production outage: we'll go through the entire process from detection of the incident, problem diagnosis, mitigation, and resolution, finishing with the first draft of the postmortem.

This workshop is structured as a fast-moving but fun game (think fluxx crossed with a hectic oncall shift) but the subject matter is entirely serious: we will use it to explore best practices and pitfalls for managing incidents as a team. You will work as part of a team managing a production outage: we'll go through the entire process from detection of the incident, problem diagnosis, mitigation, and resolution, finishing with the first draft of the postmortem.

Laura has been a Site Reliability Engineer at Google since 2013, working in areas as diverse as data infrastructure and pipelines, alerting and more recently, networking. She is passionate about sharing knowledge and intrigued by the behaviour of complex systems, including the humans who run them. Prior to Google she worked as a performance engineer in e-commerce and as a software engineer in R&D for a large Irish software company.

17:30–18:30 Tuesday

Happy Hour, Sponsored by Facebook

Herbert Room

 

Wednesday, 13 July 2016

08:00–09:00 Wednesday

Morning Coffee and Tea

Pre-Function Area

09:00–10:20 Wednesday

Track 1: Wildcard

Lansdowne Room

Relieving Technical Debt through Short Projects

Liz Fong-Jones, Google

Liz is a Senior Site Reliability Engineer at Google and manages a team of SREs responsible for Google's storage systems. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

It's easy to plan out month-long or year-long projects, or to have an interrupts rotation for dealing with oncall/tickets, but how do you make sure you're also doing the short week-long projects that can relieve your technical debt? I'll cover a planning approach that my team found that makes room for all three sets of work, reducing in the long term the operational burden of the services we operate.

It's easy to plan out month-long or year-long projects, or to have an interrupts rotation for dealing with oncall/tickets, but how do you make sure you're also doing the short week-long projects that can relieve your technical debt? I'll cover a planning approach that my team found that makes room for all three sets of work, reducing in the long term the operational burden of the services we operate.

Liz is a Senior Site Reliability Engineer at Google and manages a team of SREs responsible for Google's storage systems. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Available Media

Running Storage at Facebook

Federico Piccinini, Facebook

Available Media

Track 2 - Network

Pembroke Room

Past, Present, and Future of Network Operations

David Barroso, Fastly

David is a Network Systems Engineer at Fastly where he spends his time dealing with the network with code and thinking in ways of integrating it with the application.

Historically the network has lacked the skills, the tools and even the means to fully embrace automation or build abstractions for the rest of the organization to consume. However, the tide is changing and most modern equipment nowadays provide standard linux tools or open APIs to interact with them.

In this talk, we will explore how to build network abstractions and leverage on the experience gathered by the devops community over the years to expose the network to the organization, increase agility and provide situational awareness.

Historically the network has lacked the skills, the tools and even the means to fully embrace automation or build abstractions for the rest of the organization to consume. However, the tide is changing and most modern equipment nowadays provide standard linux tools or open APIs to interact with them.

In this talk, we will explore how to build network abstractions and leverage on the experience gathered by the devops community over the years to expose the network to the organization, increase agility and provide situational awareness.

David is a Network Systems Engineer at Fastly where he spends his time dealing with the network with code and thinking in ways of integrating it with the application.

Available Media

Bridging Multicast to the Cloud

Daniel Emord, Pythian

Dan Emord has been designing and deploying multi-platform solutions for eight years and is currently a Lead Site Reliability Consultant at Pythian. Dan runs the gamut of client requests with Pythian’s open scope engagements, including architecting and implementing a wide variety of technologies to solve problems such as network design, multi-platform system automation, and customizing open source tools.

As more organizations move their workloads to cloud providers, they may discover small gotchas that prevent them from easily running their existing applications that are currently in traditional on-premises environments. One such example is multicast: none of the big players support multicast traffic between nodes in their cloud offerings. The solution? Overlay a mesh virtual network that supports multicast between nodes and can be extended to include on-premises systems. In this talk, we’ll go over how to bring multicast into your cloud environment by implementing n2n in a resilient fashion to link on-premises and cloud environments through a gateway.

As more organizations move their workloads to cloud providers, they may discover small gotchas that prevent them from easily running their existing applications that are currently in traditional on-premises environments. One such example is multicast: none of the big players support multicast traffic between nodes in their cloud offerings. The solution? Overlay a mesh virtual network that supports multicast between nodes and can be extended to include on-premises systems. In this talk, we’ll go over how to bring multicast into your cloud environment by implementing n2n in a resilient fashion to link on-premises and cloud environments through a gateway.

Dan Emord has been designing and deploying multi-platform solutions for 8 years and is currently a Lead Site Reliability Consultant at Pythian. Dan runs the gamut of client requests with Pythian’s open scope engagements, including architecting and implementing a wide variety of technologies to solve problems such as network design, multi-platform system automation, and customizing open source tools.

Available Media

Full-Mesh IPsec Network: 10 Dos and 500 Don'ts

Fran Garcia, Hosted Graphite

Currently the SRE team lead at Hosted Graphite, Fran has previously been mostly responsible for causing (and occasionally preventing) outages in varied fields such as advertising, online gaming and sports betting. Do not ask him about chatops.

How do you secure your internal network when your servers are located in different continents/providers and you don't trust or even manage your network?

IPSec is a great way to secure a network but it's usually deployed as a way of connecting a small group of trusted networks, and both the tools and existing documentation reflect this. This is not really an option in some environments where you don't really control the network and want to interoperate across different providers, so you find yourself sailing through uncharted waters at times when trying to build a fully meshed network with IPSec, where each server can establish a secure connection to any other server in its cluster.

How do you secure your internal network when your servers are located in different continents/providers and you don't trust or even manage your network?

IPSec is a great way to secure a network but it's usually deployed as a way of connecting a small group of trusted networks, and both the tools and existing documentation reflect this. This is not really an option in some environments where you don't really control the network and want to interoperate across different providers, so you find yourself sailing through uncharted waters at times when trying to build a fully meshed network with IPSec, where each server can establish a secure connection to any other server in its cluster.

In this talk we'll explore our journey from idea to full deployment in production, while focusing in all the mistakes we made along the way and all the deficiencies that we've found in terms of tooling and documentation. After the talk you should have a better understanding of how IPSec can be useful to you, and a bunch of things you should avoid when considering implementing it (because trust me, they don't work).

Currently the SRE team lead at Hosted Graphite, Fran has previously been mostly responsible for causing (and occasionally preventing) outages in varied fields such as advertising, online gaming and sports betting. Do not ask him about chatops.

Available Media

Track 3

Ulster Suite

Docker from Scratch

Avishai Ish-Shalom, Fewbytes, and Nati Cohen, SimilarWeb

Avishai Ish-Shalom is a veteran Ops and a survivor of many production skirmishes. Avishai helps companies deal with web era operations and scale as an independent consultant. In his spare time Avishai is spreading weird ideas and conspiracy theories such as DevOps.

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Avishai Ish-Shalom is a veteran Ops and a survivor of many production skirmishes. Avishai helps companies deal with web era operations and scale as an independent consultant. In his spare time Avishai is spreading weird ideas and conspiracy theories such as DevOps.

Track 4

Munster Suite

Distributed Log-Processing Design Workshop

Andrea Spadaccini, Google

Andrea Spadaccini works in Dublin as a Site Reliability Manager for Google, which he joined in 2012 as an SRE working on the systems that distill, store and serve all the metrics about Google's Ads platforms. Prior to that, he worked on Linux-based PBX products, hacked on open source CPU simulators and co-founded a non-profit for students to get work experience while pursuing their studies. He earned a PhD in Computer Engineering from the University of Catania, where he focused mostly on biometric recognition. spadaccio@google.com.

Participants will have the opportunity to try their hand on designing a reliable, distributed, multi-datacenter near-real-time log processing system.

The session will start with a short presentation on lessons learned about designing reliable distributed systems, and then participants will break out in small groups, assisted by Google facilitators, and try their hand at solving a real-world design challenge, from high-level architecture down to an estimate of the computing resources required to run the service.

The session will likely appeal to experienced engineers who want to have fun tackling a real-world design problem faced by many teams in Google.

Participants will have the opportunity to try their hand on designing a reliable, distributed, multi-datacenter near-real-time log processing system.

The session will start with a short presentation on lessons learned about designing reliable distributed systems, and then participants will break out in small groups, assisted by Google facilitators, and try their hand at solving a real-world design challenge, from high-level architecture down to an estimate of the computing resources required to run the service.

The session will likely appeal to experienced engineers who want to have fun tackling a real-world design problem faced by many teams in Google.

Andrea Spadaccini works in Dublin as a Site Reliability Manager for Google, which he joined in 2012 as an SRE working on the systems that distill, store and serve all the metrics about Google's Ads platforms. Prior to that, he worked on Linux-based PBX products, hacked on open source CPU simulators and co-founded a non-profit for students to get work experience while pursuing their studies. He earned a PhD in Computer Engineering from the University of Catania, where he focused mostly on biometric recognition. spadaccio@google.com.

10:20–11:00 Wednesday

Break with Refreshments

Pre-Function Area

11:00–12:20 Wednesday

Track 1 - Wildcard

Lansdowne Room

Extreme OS Kernel Testing

Kirk Russell, Shopify

Kirk is currently a Production Engineer at Shopify, making sure that our docker image build system can keep up with 22 launches a day.

Fuzz testing has been used to evaluate the robustness of operating system distributions for over twenty years. Eventually, a fuzz test suite will suffer from reduced effectiveness. The first obstacle is the pesticide paradox: as you fix the easy defects, it gets difficult to find the remaining obscure defects. Also, the test execution time and the debug/fix cycle tends to be manual work that can take hours or even days of effort. During the presentation, a structured framework for creating new fuzz tests will be introduced, along with a competitive analysis approach used to minimize defect reproduction complexity.

Fuzz testing has been used to evaluate the robustness of operating system distributions for over twenty years. Eventually, a fuzz test suite will suffer from reduced effectiveness. The first obstacle is the pesticide paradox: as you fix the easy defects, it gets difficult to find the remaining obscure defects. Also, the test execution time and the debug/fix cycle tends to be manual work that can take hours or even days of effort. During the presentation, a structured framework for creating new fuzz tests will be introduced, along with a competitive analysis approach used to minimize defect reproduction complexity.

Kirk is currently a Production Engineer at Shopify, making sure that our docker image build system can keep up with 22 launches a day.

Available Media

DNS: Old Solution for Modern Problems

Thomas Jackson and Rauf Guliyev, LinkedIn

I am a Traffic SRE at LinkedIn responsible for shuffling bits between devices around the world and LinkedIn's service infrastructure. I like to solve all kinds of engineering problems, so I spend my free time building and an exoskeleton race kit car.

As infrastructure becomes more complex, dynamic, and diverse service discovery becomes very important.

There are many solutions to this problem (thrift, rest.li, custom-zk, etc.) all of which require application changes which precludes the use of off-the-shelf software.

We have applications at LinkedIn where it isn't practical to integrate with our internal service discovery systems. After some thought we decided that all of these applications do support a common service discovery system: our old friend DNS.

In this presentation, we'll talk about how we implemented a distributed, highly available, eventually consistent service discovery system using DNS written in Go. We'll talk about the design, implementation, and challenges encountered on the way to production.

As infrastructure becomes more complex, dynamic, and diverse service discovery becomes very important.

There are many solutions to this problem (thrift, rest.li, custom-zk, etc.) all of which require application changes which precludes the use of off-the-shelf software.

We have applications at LinkedIn where it isn't practical to integrate with our internal service discovery systems. After some thought we decided that all of these applications do support a common service discovery system: our old friend DNS.

In this presentation, we'll talk about how we implemented a distributed, highly available, eventually consistent service discovery system using DNS written in Go. We'll talk about the design, implementation, and challenges encountered on the way to production.
We'll focus on:

  • Architecture
  • Extensibility
  • Availabilty
  • Operability

The Results:

  • Significantly reduced complexity
  • Dramatic decrease in convergence time
  • Ubiquitous service discovery
  • Leverage existing DNS infrastructure

Rauf Guliyev is a Traffic SRE at LinkedIn responsible for shuffling bits between devices around the world and LinkedIn's service infrastructure. He likes to solve all kinds of engineering problems and spends his free time building and an exoskeleton race kit car.

Available Media

Track 2 - Network

Pembroke Room

Scaling Shopify's Multi-Tenant Architecture across Multiple Datacenters

Florian Weingarten, Shopify

Multi-tenant architectures are a very convenient and economical way to share resources like web servers, job workers, and datastores among several customers on your platform. Even the smallest Shopify store on a $9/month plan can easily survive getting hammered with a 1M RPM flash sale by leveraging the resources of the entire platform. However, architectures like this can also have several drawbacks. They are potentially harder to scale and things like resource starvation or back-end outages are harder to isolate.

In this talk, I’m going to walk you through the history of how Shopify grew from being a small standard single-database single-datacenter Rails application to the multi-database multi-datacenter setup that we run today. We will talk about the advantages in terms of resiliency, scalability, and disaster recovery that this architecture gives us, how we got there, and where we want to go in the future.

Multi-tenant architectures are a very convenient and economical way to share resources like web servers, job workers, and datastores among several customers on your platform. Even the smallest Shopify store on a $9/month plan can easily survive getting hammered with a 1M RPM flash sale by leveraging the resources of the entire platform. However, architectures like this can also have several drawbacks. They are potentially harder to scale and things like resource starvation or back-end outages are harder to isolate.

In this talk, I’m going to walk you through the history of how Shopify grew from being a small standard single-database single-datacenter Rails application to the multi-database multi-datacenter setup that we run today. We will talk about the advantages in terms of resiliency, scalability, and disaster recovery that this architecture gives us, how we got there, and where we want to go in the future.

You will learn about things like how to use the Border Gateway Protocol and Equal-Cost Multi-Path routing for implementing intra-datacenter high availability, how we implement our own load balancing algorithms, what it takes to prepare a Ruby on Rails application for a move like this, and how we do completely scripted datacenter failovers in a matter of seconds with no considerable downtime.

Originally from Germany, Florian studied mathematics and computer science at RWTH-Aachen University. Did some research on cryptography and privacy in a previous life. Now working as an infrastructure engineer on the core architecture team at Shopify in Ottawa, Canada, poking holes into other people’s code.

Available Media

Leading a Team with Values

Rich Archbold, Intercom

Adding a small set of authentic, opinionated, collaboratively formed core values can be the magic ingredient to building a high performing, happy team.

In this talk you'll hear the story of how, through the introduction of four core values, the Intercom Infrastructure team achieved significant improvements in service reliability, cost effectiveness and dramatically reduced their interrupt driven workload. You will also hear how and why core values actually work and how to apply them in your own team environment.

Rich enjoys learning from interesting outages, loves hard infrastructure scaling challenges, and runs on Starbucks coffee. He is currently Director of Engineering @ Intercom and is an ex Facebooker and Amazonian. 

Adding a small set of authentic, opinionated, collaboratively formed core values can be the magic ingredient to building a high performing, happy team.

In this talk you'll hear the story of how, through the introduction of four core values, the Intercom Infrastructure team achieved significant improvements in service reliability, cost effectiveness and dramatically reduced their interrupt driven workload. You will also hear how and why core values actually work and how to apply them in your own team environment.

Rich enjoys learning from interesting outages, loves hard infrastructure scaling challenges, and runs on Starbucks coffee. He is currently Director of Engineering @ Intercom and is an ex Facebooker and Amazonian. 

Available Media

Track 3

Ulster Suite

(Continued from previous session)

Docker from Scratch

Avishai Ish-Shalom, Fewbytes, and Nati Cohen, SimilarWeb

Avishai Ish-Shalom is a veteran Ops and a survivor of many production skirmishes. Avishai helps companies deal with web era operations and scale as an independent consultant. In his spare time Avishai is spreading weird ideas and conspiracy theories such as DevOps.

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Avishai Ish-Shalom is a veteran Ops and a survivor of many production skirmishes. Avishai helps companies deal with web era operations and scale as an independent consultant. In his spare time Avishai is spreading weird ideas and conspiracy theories such as DevOps.

Track 4

Munster Suite

(Continued from previous session)

Distributed Log-Processing Design Workshop

Andrea Spadaccini, Google

Andrea Spadaccini works in Dublin as a Site Reliability Manager for Google, which he joined in 2012 as an SRE working on the systems that distill, store and serve all the metrics about Google's Ads platforms. Prior to that, he worked on Linux-based PBX products, hacked on open source CPU simulators and co-founded a non-profit for students to get work experience while pursuing their studies. He earned a PhD in Computer Engineering from the University of Catania, where he focused mostly on biometric recognition. spadaccio@google.com.

Participants will have the opportunity to try their hand on designing a reliable, distributed, multi-datacenter near-real-time log processing system.

The session will start with a short presentation on lessons learned about designing reliable distributed systems, and then participants will break out in small groups, assisted by Google facilitators, and try their hand at solving a real-world design challenge, from high-level architecture down to an estimate of the computing resources required to run the service.

The session will likely appeal to experienced engineers who want to have fun tackling a real-world design problem faced by many teams in Google.

Participants will have the opportunity to try their hand on designing a reliable, distributed, multi-datacenter near-real-time log processing system.

The session will start with a short presentation on lessons learned about designing reliable distributed systems, and then participants will break out in small groups, assisted by Google facilitators, and try their hand at solving a real-world design challenge, from high-level architecture down to an estimate of the computing resources required to run the service.

The session will likely appeal to experienced engineers who want to have fun tackling a real-world design problem faced by many teams in Google.

Andrea Spadaccini works in Dublin as a Site Reliability Manager for Google, which he joined in 2012 as an SRE working on the systems that distill, store and serve all the metrics about Google's Ads platforms. Prior to that, he worked on Linux-based PBX products, hacked on open source CPU simulators and co-founded a non-profit for students to get work experience while pursuing their studies. He earned a PhD in Computer Engineering from the University of Catania, where he focused mostly on biometric recognition. spadaccio@google.com.

12:20–13:40 Wednesday

Conference Luncheon

Sussex Restaurant

13:40–14:40 Wednesday

Track 1 - Wildcard

Lansdowne Room

The Knowledge: Towards a Culture of Engineering Documentation

Riona MacNamara, Google

Riona MacNamara is a staff technical writer at Google. She leads Google's Documentation Infrastructure team, which aims to make internal engineers happier and more productive by fully integrating the creation, maintenance, and discovery of engineering information into our development workflow and culture. She previously worked at Amazon and Microsoft.

For several years, Google's internal surveys identified the lack of trustworthy, discoverable documentation as the #1 problem impacting internal developer productivity. We're not alone: Stack Overflow's 2016 survey ranked ""Poor documentation"" as the #2 problem facing engineers.

Solving this problem is tough. It's not enough to build tooling; the culture needs to change. Google internal engineering is attacking the challenge three ways: Building a documentation platform; integrating that platform into the engineering toolchain; and building a culture where documentation - like testing - is accepted as a natural, required part of the development process.

For several years, Google's internal surveys identified the lack of trustworthy, discoverable documentation as the #1 problem impacting internal developer productivity. We're not alone: Stack Overflow's 2016 survey ranked ""Poor documentation"" as the #2 problem facing engineers.

Solving this problem is tough. It's not enough to build tooling; the culture needs to change. Google internal engineering is attacking the challenge three ways: Building a documentation platform; integrating that platform into the engineering toolchain; and building a culture where documentation - like testing - is accepted as a natural, required part of the development process.

In this talk, we'll share our learnings and best practices around both tooling and culture, the evolution of documentation, and some thoughts about how we can transition from the creation of documents towards an ecosystem where context-appropriate, trustworthy documentation is reliably and effortlessly available to the engineers that need it.

Riona MacNamara is a staff technical writer at Google. She leads Google's Documentation Infrastructure team, which aims to make internal engineers happier and more productive by fully integrating the creation, maintenance, and discovery of engineering information into our development workflow and culture. She previously worked at Amazon and Microsoft.

Available Media

Bridging the Safety Gap from Scripts to Full Auto-Remediation

David Mah, Dropbox

David Mah is an SRE at Dropbox, where he built out several of Dropbox’s “Magic Pocket” storage system’s verification and safety subsystems. More recently, he built Dropbox’s Naoru - an automation platform that is used ot de-risk dangerous maintenance automation tasks.

On the flip-side of career interests, David cares a lot about how to keep folks growing and happy. Towards this, he runs Dropbox’s engineering internship program and is heavily involved in SRE recruiting, particularly university recruiting.

At Dropbox, to bridge the gap between “scripts” and “fully automatic automation”, we’ve introduced a concept of “Human Authorized Execution”. This means that a tool automatically finds problems and decides how to fix them, but a human operator is required to audit the tool’s decisions before the automation may run.

Why do we need this? Because it’s terrifying to have automation run fully automatically. With a human involved, their intuition can answer a really important question: Why might I NOT want to run this script? If we took a simple approach… for instance deploying a cron job to run our scripts whenever alerts fire, then we would lose that human’s sense of danger.

At Dropbox, to bridge the gap between “scripts” and “fully automatic automation”, we’ve introduced a concept of “Human Authorized Execution”. This means that a tool automatically finds problems and decides how to fix them, but a human operator is required to audit the tool’s decisions before the automation may run.

Why do we need this? Because it’s terrifying to have automation run fully automatically. With a human involved, their intuition can answer a really important question: Why might I NOT want to run this script? If we took a simple approach… for instance deploying a cron job to run our scripts whenever alerts fire, then we would lose that human’s sense of danger.

At Dropbox, we’ve built an alert auto-remediation platform which forces us to build our maintenance automation in a way that adheres to these principles. Through it, we’ve been able to overcome our discomfort with risky automation and transition our way into actually running scripts fully automatically.

In this talk we will discuss the thought process we bring towards building trustworthy automation, how we’ve driven our infrastructure organization towards a culture of embracing it, and simple steps that you could take to start gaining similar benefits in your organization.

This talk is targeted towards helping organisations who do not currently have extensive automation but wish to put together a road map on how to move towards fully automated operational infrastructure.

David Mah is an SRE at Dropbox, where he built out several of Dropbox’s “Magic Pocket” storage system’s verification and safety subsystems. More recently, he built Dropbox’s Naoru - an automation platform that is used ot de-risk dangerous maintenance automation tasks.

On the flip-side of career interests, David cares a lot about how to keep folks growing and happy. Towards this, he runs Dropbox’s engineering internship program and is heavily involved in SRE recruiting, particularly university recruiting.

Available Media

Track 2 - Network

Pembroke Room

Fixing the Internet for Real-Time Applications (Games)

Adam Comerford, Riot Games

Adam Comerford is currently a Senior Systems Engineer at Riot Games in Dublin and obsessed with improving the League of Legends experience for players in Europe (and beyond). He has a broad technical background spanning 15+ years and multiple disciplines including networking, distributed systems, NoSQL databases and more.

League of Legends (LoL) is not a game of seconds, but of milliseconds. In day-to-day life, two seconds fly by unnoticed but in-game a two-second stun can feel like an eternity. In any single match of LoL, thousands of decisions made in milliseconds dictate which team scores bragging rights and which settles for “honorable opponent” points. The Internet, however, was not constructed for applications that run like this, essentially in real time.

This talk will discuss the steps Riot Games has taken, and will continue to take, to fix this fundamental problem with commodity Internet, with a specific focus on the work done to improve the experience of our European players.

League of Legends (LoL) is not a game of seconds, but of milliseconds. In day-to-day life, two seconds fly by unnoticed but in-game a two-second stun can feel like an eternity. In any single match of LoL, thousands of decisions made in milliseconds dictate which team scores bragging rights and which settles for “honorable opponent” points. The Internet, however, was not constructed for applications that run like this, essentially in real time.

This talk will discuss the steps Riot Games has taken, and will continue to take, to fix this fundamental problem with commodity Internet, with a specific focus on the work done to improve the experience of our European players.

Adam Comerford is currently a Senior Systems Engineer at Riot Games in Dublin and obsessed with improving the League of Legends experience for players in Europe (and beyond). He has a broad technical background spanning 15+ years and multiple disciplines including networking, distributed systems, NoSQL databases and more.

Available Media

Track 3

Ulster Suite

(Continued from previous session)

Docker from Scratch

Avishai Ish-Shalom, Fewbytes, and Nati Cohen, SimilarWeb

Avishai Ish-Shalom is a veteran Ops and a survivor of many production skirmishes. Avishai helps companies deal with web era operations and scale as an independent consultant. In his spare time Avishai is spreading weird ideas and conspiracy theories such as DevOps.

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Avishai Ish-Shalom is a veteran Ops and a survivor of many production skirmishes. Avishai helps companies deal with web era operations and scale as an independent consultant. In his spare time Avishai is spreading weird ideas and conspiracy theories such as DevOps.

Track 4

Munster Suite

Lightning Talks

To sign up for a lightning talk, write on the board outside of Munster Suite.

To sign up for a lightning talk, write on the board outside of Munster Suite.

14:40–15:00 Wednesday

Break with Refreshments

Pre-Function Area

15:00–17:00 Wednesday

Closing Plenary Session

Pembroke/Lansdowne Rooms

Data Privacy Legislation and the Impact on SRE

John Looney, Google, and Simon McGarr, Digital Rights Ireland

Available Media

Techniques and Tools for a Coherent Discussion about Performance in Complex Architectures

Theo Schlossnagle, Circonus

Theo founded Circonus in 2010 where he now serves as Founder and CEO. After earning undergraduate and graduate degrees from Johns Hopkins University in computer science researching resource allocation techniques in distributed systems during four years of post-graduate work. In 1997, Theo founded OmniTI, which has established itself as the go-to source for organizations facing today's most challenging scalability, performance and security problems. He was also the principal architect of the Momentum MTA, which is now the flagship product of Sparkpost.

A widely respected industry thought leader, Theo is the author of Scalable Internet Architectures (Sams) and a frequent speaker at worldwide IT conferences. Theo is a computer scientist in every respect. Theo is a member of the IEEE and a senior member of the ACM. He serves on the editorial board of the ACM's Queue Magazine and sits on the ACM Practitioner Board.

Most applications today have separate networked services measuring in the tens to hundreds; especially with the growing popularity of micro services. Crossing the boundary between these services often means a change in team and even a change in programming languages. In this session I will discuss the challenges this presents, why it is important to have a single engineering conversation about performance and how we can accomplish this.

Most applications today have separate networked services measuring in the tens to hundreds; especially with the growing popularity of micro services. Crossing the boundary between these services often means a change in team and even a change in programming languages. In this session I will discuss the challenges this presents, why it is important to have a single engineering conversation about performance and how we can accomplish this.

Theo founded Circonus in 2010 where he now serves as Founder and CEO. After earning undergraduate and graduate degrees from Johns Hopkins University in computer science researching resource allocation techniques in distributed systems during four years of post-graduate work. In 1997, Theo founded OmniTI, which has established itself as the go-to source for organizations facing today's most challenging scalability, performance and security problems. He was also the principal architect of the Momentum MTA, which is now the flagship product of Sparkpost.

A widely respected industry thought leader, Theo is the author of Scalable Internet Architectures (Sams) and a frequent speaker at worldwide IT conferences. Theo is a computer scientist in every respect. Theo is a member of the IEEE and a senior member of the ACM. He serves on the editorial board of the ACM's Queue Magazine and sits on the ACM Practitioner Board.

Available Media

Government Needs SRE

Mikey Dickerson, U.S. Digital Service

In 2013, Mikey Dickerson joined what became known as the “ad hoc” team, tasked with rescuing HealthCare.gov after its failed launch on October 1. In August 2014, President Obama established the United States Digital Service and appointed Mikey to serve as the Administrator to see if the strategy that succeeded at pulling Healthcare.gov out of the fire could be applied to other government problems. Now nearly 2 years old and about 150 people spanning a network of federal agencies, the U.S. Digital Service has taken on immigration, education, veterans benefits, and health data interoperability. The U.S. Digital Service is helping agencies build effective government services and improve IT procurements by focusing on industry best practices and agile methodology, ultimately driving change in the largest institution in history. Prior to joining the U.S. Digital Service, Mikey worked as a Site Reliability Manager at Google.

In 2013, Mikey Dickerson joined what became known as the “ad hoc” team, tasked with rescuing HealthCare.gov after its failed launch on October 1. In August 2014, President Obama established the United States Digital Service and appointed Mikey to serve as the Administrator to see if the strategy that succeeded at pulling Healthcare.gov out of the fire could be applied to other government problems. Now nearly 2 years old and about 150 people spanning a network of federal agencies, the U.S. Digital Service has taken on immigration, education, veterans benefits, and health data interoperability. The U.S. Digital Service is helping agencies build effective government services and improve IT procurements by focusing on industry best practices and agile methodology, ultimately driving change in the largest institution in history. Prior to joining the U.S. Digital Service, Mikey worked as a Site Reliability Manager at Google.

In 2013, Mikey Dickerson joined what became known as the “ad hoc” team, tasked with rescuing HealthCare.gov after its failed launch on October 1. In August 2014, President Obama established the United States Digital Service and appointed Mikey to serve as the Administrator to see if the strategy that succeeded at pulling Healthcare.gov out of the fire could be applied to other government problems. Now nearly 2 years old and about 150 people spanning a network of federal agencies, the U.S. Digital Service has taken on immigration, education, veterans benefits, and health data interoperability. The U.S. Digital Service is helping agencies build effective government services and improve IT procurements by focusing on industry best practices and agile methodology, ultimately driving change in the largest institution in history. Prior to joining the U.S. Digital Service, Mikey worked as a Site Reliability Manager at Google.

Available Media