08:00–09:00 |
Tuesday |
Morning Coffee and Tea
Pre-Function Area
|
09:00–10:20 |
Tuesday |
Martin Check is a Site Reliability Engineer on the Microsoft Azure team. He has worked on large scale services at Microsoft for 12 years in a variety of roles ranging from service design and implementation, to crisis response, to leading teams through devops/SRE transitions. Currently he is working on Problem Management efforts for Azure to identify and resolve problems that stand in the way of service uptime through data analysis, surfacing insights, and engineering solutions.
Azure SRE works with services that have widely variable maturity, ranging from fully federated devops teams, to fully Tiered IT/Ops teams, and everything in between. The one thing all of these services have in common is that they have outages. While they all respond in different ways to recover and respond, SRE has to collect and leverage data in a common manner across all services to prevent outages and drive reliability up consistently. In this this talk we’ll discuss how SRE leverages diverse data sets to drive improvements across this heterogeneous set of services. SRE ensures that teams are rigorously completing post incident reviews and addressing their live site debt. We not only look at the actual repair debt, but we’ve introduced a new concept called “virtual debt” which shows where a service incident response faltered, but no appropriate repair was logged. Virtual Debt is affectionately referred to as “PacMan debt” due to the appearance of the chart. Azure SRE works with services that have widely variable maturity, ranging from fully federated devops teams, to fully Tiered IT/Ops teams, and everything in between. The one thing all of these services have in common is that they have outages. While they all respond in different ways to recover and respond, SRE has to collect and leverage data in a common manner across all services to prevent outages and drive reliability up consistently. In this this talk we’ll discuss how SRE leverages diverse data sets to drive improvements across this heterogeneous set of services. SRE ensures that teams are rigorously completing post incident reviews and addressing their live site debt. We not only look at the actual repair debt, but we’ve introduced a new concept called “virtual debt” which shows where a service incident response faltered, but no appropriate repair was logged. Virtual Debt is affectionately referred to as “PacMan debt” due to the appearance of the chart. The greater the virtual debt, the bigger the bite.
We’ll also discuss how we expose the data in near real time dashboards that allow team members from the director all the way down to the IC to see relevant views and take the appropriate action. IC’s can find incomplete postmortems they need to work on, a service director can view his accumulated debt to prioritize resources, or a dev manager can review virtual debt to ensure the team is conducting rigorous postmortems. By analyzing historical outages, we’ve found that missed detection leads to an exponential increase in mitigation times. We’ve collected a myriad of other insights by mining through historical outage data and using charts and creative visualizations to surface insights, including what surprising proxy metrics we’ve discovered that influence uptime.
Martin Check is a Site Reliability Engineer on the Microsoft Azure team. He has worked on large scale services at Microsoft for 12 years in a variety of roles ranging from service design and implementation, to crisis response, to leading teams through devops/SRE transitions. Currently he is working on Problem Management efforts for Azure to identify and resolve problems that stand in the way of service uptime through data analysis, surfacing insights, and engineering solutions.
|
Sebastian Kirsch is a Site Reliability Engineer for Google in Zürich, Switzerland. He manages the team that runs Google Calendar. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler, as well as on external products like Google Maps. He specializes in the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high reliability bar as every other Google service.
Monitoring and dashboarding systems are crucial to understanding the behavior of large distributed systems. But monitoring systems can lead you on wild goose chases, or hide issues. In this talk, I will look at some examples of how a monitoring system can lie to you – in order to sensitize the audience to these failure modes and encourage them to look for similar examples in their own systems.
Monitoring and dashboarding systems are crucial to understanding the behavior of large distributed systems. But monitoring systems can lead you on wild goose chases, or hide issues. In this talk, I will look at some examples of how a monitoring system can lie to you – in order to sensitize the audience to these failure modes and encourage them to look for similar examples in their own systems.
Sebastian Kirsch is a Site Reliability Engineer for Google in Zürich, Switzerland. He manages the team that runs Google Calendar. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler, as well as on external products like Google Maps. He specializes in the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high reliability bar as every other Google service.
Dieter Plaetinck, raintank There is a common belief that in order to solve more [advanced] alerting cases and get more complete coverage, we need complex, often math-heavy solutions based on machine learning or stream processing. This talk sets context and pro's/cons for such approaches, and provides anecdotal examples from the industry, nuancing the applicability of these methods.
We then explore how we can get dramatically better alerting, as well as make our lives a lot easier by optimizing workflow and machine-human interaction through an alerting IDE (exemplified by bosun), basic logic, basic math and metric metadata, even for solving complicated alerting problems such as detecting faults in seasonal timeseries data. There is a common belief that in order to solve more [advanced] alerting cases and get more complete coverage, we need complex, often math-heavy solutions based on machine learning or stream processing. This talk sets context and pro's/cons for such approaches, and provides anecdotal examples from the industry, nuancing the applicability of these methods.
We then explore how we can get dramatically better alerting, as well as make our lives a lot easier by optimizing workflow and machine-human interaction through an alerting IDE (exemplified by bosun), basic logic, basic math and metric metadata, even for solving complicated alerting problems such as detecting faults in seasonal timeseries data.
Dieter is an industrial engineer who started out as a systems engineer for the European social networking site Netlog, did some information retrieval/machine learning research at the university of Ghent, then joined Vimeo for backend/syseng stuff but ended up doing mostly open source monitoring and now works at raintank, the open source monitoring company behind Grafana.
|
Heinrich Hartmann, Circonus Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:
- Is the system down?
- Is user experience degraded for some percentage of our customers?
- How did our query response times change with the last update?
Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations. Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:
- Is the system down?
- Is user experience degraded for some percentage of our customers?
- How did our query response times change with the last update?
Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations.
The tutorial focuses on practical aspects, and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX tools and the IPython toolkit. This tutorial has been given at several occasions over the last year and has been refined and extended since, cf. Twitter #StatsForEngineers
|
Will Gallego, Nathan Hoffman, and Miriam Lautner, Etsy
Will is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.
Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.
Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.
Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two-part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.
Will Gallego is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.
Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.
Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.
Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two-part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.
Will Gallego is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.
Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.
Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.
|
10:20–11:00 |
Tuesday |
Break with Refreshments
Pre-Function Area
|
11:00–12:20 |
Tuesday |
Sasha Goldshtein, SELA Group
Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.
Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way. Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way.
Hopefully, you won't have to yearn for a lot longer. eBPF (extended Berkeley Packet Filters) is a kernel technology that enables a plethora of diagnostic scenarios by introducing dynamic, safe, low-overhead, efficient programs that run in the context of your live kernel. Sure, BPF programs can attach to sockets; but more interestingly, they can attach to kprobes and uprobes, static kernel tracepoints, and even user-mode static probes. And modern BPF programs have access to a wide set of instructions and data structures, which means you can collect valuable information and analyze it on-the-fly, without spilling it to huge files and reading them from user space.
In this talk, we will introduce BCC, the BPF Compiler Collection, which is an open set of tools and libraries for dynamic tracing on Linux. Some tools are easy and ready to use, such as execsnoop, fileslower, and memleak. Other tools such as trace and argdist require more sophistication and can be used as a Swiss Army knife for a variety of scenarios. We will spend most of the time demonstrating the power of modern dynamic tracing -- from memory leaks to static probes in Ruby, Node, and Java programs, from slow file I/O to monitoring network traffic. Finally, we will discuss building our own tools using the Python and Lua bindings to BCC, and its LLVM backend.
Sasha Goldshtein is the CTO of SELA Group, a Microsoft C# MVP, and a Pluralsight author. He leads the Performance and Debugging team at SELA Technology Center, and is the author of numerous training courses, open source projects, books, and online articles on diagnostic tools and performance optimization. Sasha consults on various topics, including production debugging, application and system troubleshooting, performance investigation, and distributed architecture.
Joy grew up in the wilds of Detroit with Robocop as her only friend. She thought she wanted to be an artist in college but then discovered the computer lab and it was all downhill from there. Currently she is Director of Service Reliability Engineering at Heroku in San Francisco. She loves swearing, whiskey, and saying things like “process is programming for humans!”.
System failures happen. Hardware dies, software crashes, capacity gets exceeded, and any of these things can cause unexpected effects in the most carefully-architected systems.
At Heroku, we deal with complex systems failures. We’re running a platform as a service: our whole business model requires us to provide operations for our customers so they don’t have to do it themselves. We run over a million postgres databases, tens of thousands of redis, and hundreds of thousands of dynos on thousands of AWS instances.
What do we get out of these incidents? Pain and suffering? Yes, sometimes. We also get data about how our systems are actually working. We get ideas for making it work better. And sometimes we get ideas for whole new products.
In this talk, I’ll discuss how to take the bad of a system failure and turn it into good: better products, more reliable platforms, and less stressed engineers. System failures happen. Hardware dies, software crashes, capacity gets exceeded, and any of these things can cause unexpected effects in the most carefully-architected systems.
At Heroku, we deal with complex systems failures. We’re running a platform as a service: our whole business model requires us to provide operations for our customers so they don’t have to do it themselves. We run over a million postgres databases, tens of thousands of redis, and hundreds of thousands of dynos on thousands of AWS instances.
What do we get out of these incidents? Pain and suffering? Yes, sometimes. We also get data about how our systems are actually working. We get ideas for making it work better. And sometimes we get ideas for whole new products.
In this talk, I’ll discuss how to take the bad of a system failure and turn it into good: better products, more reliable platforms, and less stressed engineers.
Joy grew up in the wilds of Detroit with Robocop as her only friend. She thought she wanted to be an artist in college but then discovered the computer lab and it was all downhill from there. Currently she is Director of Service Reliability Engineering at Heroku in San Francisco. She loves swearing, whiskey, and saying things like “process is programming for humans!”.
|
Björn Rabenstein, SoundCloud
Björn is a Production Engineer at SoundCloud and one of the main Prometheus developers. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.
Noisy alerts are the deadly sin of monitoring. They obfuscate real issues and cause pager fatigue. Instead of reacting with the due sense of urgency, the person on-call will start to skim or even ignore alerts, not to speak about the destruction of their sanity and work-life balance. Unfortunately, there are many monitoring pitfalls on the road to complex production systems, and most of them result in noisier alerts. In distributed systems, and in particular in a microservice architecture, there is usually a good understanding of local failure modes while the behavior of the system as a whole is difficult to reason with. Thus, it is tempting to alert on the many possible causes – after all, finding the root cause of a problem is important. Noisy alerts are the deadly sin of monitoring. They obfuscate real issues and cause pager fatigue. Instead of reacting with the due sense of urgency, the person on-call will start to skim or even ignore alerts, not to speak about the destruction of their sanity and work-life balance. Unfortunately, there are many monitoring pitfalls on the road to complex production systems, and most of them result in noisier alerts. In distributed systems, and in particular in a microservice architecture, there is usually a good understanding of local failure modes while the behavior of the system as a whole is difficult to reason with. Thus, it is tempting to alert on the many possible causes – after all, finding the root cause of a problem is important. However, a distributed system is designed to tolerate local failures, and a human should only be paged on real or imminent problems of a service, ideally aggregated to one meaningful alert per problem. The definition of a problem should be clear and explicit rather than relying on some kind of automatic "anomaly detection." Taking historical trends into account is needed, though, to detect imminent problems. Those predictions should be simple rather than "magic." Alerting because "something seems weird" is almost never the right thing to do.
SoundCloud's long way from noisy pagers to much saner on-call rotations will serve as a case study, demonstrating how different monitoring technologies, among them most notably Prometheus, have affected alerting.
Björn is a Production Engineer at SoundCloud and one of the main Prometheus developers. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.
|
(Continued from previous session)
Heinrich Hartmann, Circonus Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:
- Is the system down?
- Is user experience degraded for some percentage of our customers?
- How did our query response times change with the last update?
Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations. Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:
- Is the system down?
- Is user experience degraded for some percentage of our customers?
- How did our query response times change with the last update?
Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations.
The tutorial focuses on practical aspects, and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX tools and the IPython toolkit. This tutorial has been given at several occasions over the last year and has been refined and extended since, cf. Twitter #StatsForEngineers
|
Will Gallego, Nathan Hoffman, and Miriam Lautner, Etsy
Will is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.
Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.
Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.
Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two-part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.
Will Gallego is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.
Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.
Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.
Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two-part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.
Will Gallego is an engineer at Etsy who believes in a free and open web, blamelessness and empathy in the workplace, and pronouncing gif with a soft g.
Nathan Hoffman has worked at Etsy for four years as a Software Engineer. After graduating from the Etsy School on Post Mortem Facilitation, Nathan has led many successful and some not-so-successful post mortems, and is happy to talk about all of them. Before Etsy, he spent some time at the Recurse Centre.
Miriam is an Engineer at Etsy, graduate of the 2014 Recurse Center class, and an avid climber in her spare time.
|
12:20–13:40 |
Tuesday |
Conference Luncheon, Sponsored by Amazon Data Services Ireland
Sussex Restaurant
|
13:40–15:00 |
Tuesday |
Graham is an SRE at Google working on machine learning pipelines for ad click prediction.
Motivated by the problem of predicting whether any given ad would be clicked in response to a query, in this introductory talk we outline the requirements and large-system design challenges that arise when designing a machine learning system that makes millions of predictions per second with low latency, learns quickly from the responses to those predictions, and maintains a consistent level of model quality over time. We present alternatives for meeting those challenges using diagrams of machine learning pipelines.
Concepts used in this talk: machine learning (classification), software pipelines, sharding and replication, map-reduce
Graham is an SRE at Google working on machine learning pipelines for ad click prediction.
Motivated by the problem of predicting whether any given ad would be clicked in response to a query, in this introductory talk we outline the requirements and large-system design challenges that arise when designing a machine learning system that makes millions of predictions per second with low latency, learns quickly from the responses to those predictions, and maintains a consistent level of model quality over time. We present alternatives for meeting those challenges using diagrams of machine learning pipelines.
Concepts used in this talk: machine learning (classification), software pipelines, sharding and replication, map-reduce
Graham is an SRE at Google working on machine learning pipelines for ad click prediction.
Moderator: Laura Nolan, Google
|
API Management—Why Speed Matters
Arianna Aondio, Varnish Software
Reverse Engineering the “Human API” for Automation and Profit
Nati Cohen, SimilarWeb
What a 17th Century Samurai Taught Me about Being an SRE
Caskey L. Dickson, Microsoft
Chatops/Automation: How to get there while everything's on fire
Fran Garcia, Hosted Graphite
Sysdig Love
Alejandro Brito Monedero, Alea Solutions
Automations with Saltstack
Effie Mouzeli, Logicea, LLC
Myths of Network Automation
David Rothera, Facebook
DNS @ Shopify
Emil Stolarsky, Shopify
Hashing Infrastructures
Jimmy Tang, Rapid7 API Management—Why Speed Matters
Arianna Aondio, Varnish Software
Reverse Engineering the “Human API” for Automation and Profit
Nati Cohen, SimilarWeb
What a 17th Century Samurai Taught Me about Being an SRE
Caskey L. Dickson, Microsoft
Chatops/Automation: How to get there while everything's on fire
Fran Garcia, Hosted Graphite
Sysdig Love
Alejandro Brito Monedero, Alea Solutions
Automations with Saltstack
Effie Mouzeli, Logicea, LLC
Myths of Network Automation
David Rothera, Facebook
DNS @ Shopify
Emil Stolarsky, Shopify
Hashing Infrastructures
Jimmy Tang, Rapid7
|
Laine Campbell, Author, Database Reliability Engineering, O'Reilly Media This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes. This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes.
|
Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon(US), and SANOG on various aspects of reliability, authentication and security.
This workshop is a part of the "full lifecycle" workshop track which includes Post-Mortems, Incident Response, and Effective Design Review Participation. Using several example cases, participants in this session will learn to apply a variety of different points of view to analyze a design for issues which could affect its reliability and operability.
Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon(US), and SANOG on various aspects of reliability, authentication and security. This workshop is a part of the "full lifecycle" workshop track which includes Post-Mortems, Incident Response, and Effective Design Review Participation. Using several example cases, participants in this session will learn to apply a variety of different points of view to analyze a design for issues which could affect its reliability and operability.
Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon(US), and SANOG on various aspects of reliability, authentication and security.
|
15:00–15:40 |
Tuesday |
Break with Refreshments
Pre-Function Area
|
15:40–17:00 |
Tuesday |
Nicolas Brousse, TubeMogul
Nicolas Brousse is Senior Director of Operations Engineering at TubeMogul (NASDAQ: TUBE). The company's sixth employee and first operations hire, Nicolas has grown TubeMogul's infrastructure over the past seven years from several machines to over two thousand servers that handle billions of requests per day for clients like Allstate, Chrysler, Heineken and Hotels.com.
Adept at adapting quickly to ongoing business needs and constraints, Nicolas leads a global team of site reliability engineers and database architects that monitor TubeMogul's infrastructure 24/7 and adhere to "DevOps" methodology. Nicolas is a frequent speaker at top U.S. technology conferences and regularly gives advice to other operations engineers. Prior to relocating to the U.S. to join TubeMogul, Nicolas worked in technology for over 15 years, managing heavy traffic and large user databases for companies like MultiMania, Lycos and Kewego. Nicolas lives in Richmond, CA, and is an avid fisherman and aspiring cowboy.
Lightning Talks session
Moderators: John Looney, Google, and Gareth Eason, Facebook
|
Kumar Srinivasamurthy, Microsoft
Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, availability, performance, capacity, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale.
Do you have services where the owners claim they run at five 9's but you often run into errors? It's very easy and convenient to build metrics at the service level. These often hide a wide array of issues that users might face. Having the right metrics is a key component of building sustainable SRE culture. This talk goes into the design of these metrics, real world examples to illustrate good/bad designs.
Do you have services where the owners claim they run at five 9's but you often run into errors? It's very easy and convenient to build metrics at the service level. These often hide a wide array of issues that users might face. Having the right metrics is a key component of building sustainable SRE culture. This talk goes into the design of these metrics, real world examples to illustrate good/bad designs.
Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, availability, performance, capacity, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale.
Bora is a software developer at SoundCloud. He started his journey there three years ago. As a generalist, he has worked on various parts of their architecture. Nowadays he is part of the Core Engineering, where he helps to build and integrate the core business services of SoundCloud. When he's not juggling various languages, he's playing basketball—as long as someone on the team covers his on-call shifts...
In a microservices architecture, different services usually have different availabilities. It is often hard to see how the availability of a single service affects the availability of the overall system. Without a clear idea about the availability requirements of individual services, even a seemingly subtle degradation of a service can cause a critical outage. Unfortunately these are discovered only after thorough post-mortems. At SoundCloud we kicked off a project called “Availability Objectives”. An availability objective is the minimum availability a service is allowed to have. These objectives are calculated based on the requirements of the clients of those services. We started by visiting all of our services and setting an availability objective for each of them. We built tools to expose the availability of these services and to flag the ones that drop below their objectives. As a result, we can now make informed decisions about the integration points we need to improve first. In a microservices architecture, different services usually have different availabilities. It is often hard to see how the availability of a single service affects the availability of the overall system. Without a clear idea about the availability requirements of individual services, even a seemingly subtle degradation of a service can cause a critical outage. Unfortunately these are discovered only after thorough post-mortems. At SoundCloud we kicked off a project called “Availability Objectives”. An availability objective is the minimum availability a service is allowed to have. These objectives are calculated based on the requirements of the clients of those services. We started by visiting all of our services and setting an availability objective for each of them. We built tools to expose the availability of these services and to flag the ones that drop below their objectives. As a result, we can now make informed decisions about the integration points we need to improve first. This talk will share the insights we gained via this project and how it affected our overall availability and engineering productivity.
Bora is a software developer at SoundCloud. He started his journey there three years ago. As a generalist, he has worked on various parts of their architecture. Nowadays he is part of the Core Engineering, where he helps to build and integrate the core business services of SoundCloud. When he's not juggling various languages, he's playing basketball—as long as someone on the team covers his on-call shifts...
Cory Lueninghoener, Los Alamos National Laboratory
Cory Lueninghoener leads the HPC Design Group at Los Alamos National Laboratory. He has helped design, build, and manage some of the largest scientific computing resources in the world, including systems ranging in size from 100,000 to 900,000 processors. He is especially interested in turning large-scale system research into practice, and has worked on configuration management and system management tools in the past. Cory was co-chair of LISA 2015 and is active in the large scale system engineering community.
The concept of the error budget is a great way to hack SLAs and make them into a positive tool for system engineers. But how can you take the same idea from a world that handles millions of transactions in a day to one that handles hundreds? High Performance Computing jobs run for hours, days, or weeks at a time, resulting in unique challenges related to system availability, maintenance, and experimentation. This talk will explore a way to modify the error budget concept to fit in an HPC environment by applying the same idea to cluster outages, both planned and unplanned, and to ultimately give customers the best computing environment possible. The concept of the error budget is a great way to hack SLAs and make them into a positive tool for system engineers. But how can you take the same idea from a world that handles millions of transactions in a day to one that handles hundreds? High Performance Computing jobs run for hours, days, or weeks at a time, resulting in unique challenges related to system availability, maintenance, and experimentation. This talk will explore a way to modify the error budget concept to fit in an HPC environment by applying the same idea to cluster outages, both planned and unplanned, and to ultimately give customers the best computing environment possible.
Cory Lueninghoener leads the HPC Design Group at Los Alamos National Laboratory. He has helped design, build, and manage some of the largest scientific computing resources in the world, including systems ranging in size from 100,000 to 900,000 processors. He is especially interested in turning large-scale system research into practice, and has worked on configuration management and system management tools in the past. Cory was co-chair of LISA 2015 and is active in the large scale system engineering community.
|
Laine Campbell, Author, Database Reliability Engineering, O'Reilly Media This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes. This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes.
|
Laura has been a Site Reliability Engineer at Google since 2013, working in areas as diverse as data infrastructure and pipelines, alerting and more recently, networking. She is passionate about sharing knowledge and intrigued by the behaviour of complex systems, including the humans who run them. Prior to Google she worked as a performance engineer in e-commerce and as a software engineer in R&D for a large Irish software company.
This workshop is structured as a fast-moving but fun game (think fluxx crossed with a hectic oncall shift) but the subject matter is entirely serious: we will use it to explore best practices and pitfalls for managing incidents as a team. You will work as part of a team managing a production outage: we'll go through the entire process from detection of the incident, problem diagnosis, mitigation, and resolution, finishing with the first draft of the postmortem.
This workshop is structured as a fast-moving but fun game (think fluxx crossed with a hectic oncall shift) but the subject matter is entirely serious: we will use it to explore best practices and pitfalls for managing incidents as a team. You will work as part of a team managing a production outage: we'll go through the entire process from detection of the incident, problem diagnosis, mitigation, and resolution, finishing with the first draft of the postmortem.
Laura has been a Site Reliability Engineer at Google since 2013, working in areas as diverse as data infrastructure and pipelines, alerting and more recently, networking. She is passionate about sharing knowledge and intrigued by the behaviour of complex systems, including the humans who run them. Prior to Google she worked as a performance engineer in e-commerce and as a software engineer in R&D for a large Irish software company.
|
17:30–18:30 |
Tuesday |
Happy Hour, Sponsored by Facebook
Herbert Room
|