Note: Meeting Room 7 will be available as an On-Call Room for attendees.
SREcon17 Europe/Middle East/Africa Program Grid
Download the program in grid format (PDF, updated 31 Aug 2017).
Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)
30 August 2017
07:30–09:00
Morning Coffee and Tea
Prefunction
09:00–10:20
Pembroke and Lansdowne Rooms
Care and Feeding of SRE
Narayan Desai, Google
As SRE enters the ops zeitgeist, much of the focus has been placed on tactics—techniques that individual operations teams can adopt to improve their effectiveness. While there is value in singleton adoption, I'll make the case in this talk that organizational support and culture across the organization that corresponds with these tactics results in impact far greater than the sum of its parts. I'll focus on three SRE goals: maintaining SLOs, managing operational load, and maximizing leverage, and discuss failure modes without sufficient organizational support. These aren't tactics that can be fully implemented by an operations team. SRE is an organizational strategy that need to be adopted by the business.
Narayan Desai, Google
Narayan is a jack of many trades, having worked as a sysadmin, software engineer, computational biologist, and computer science researcher, and most recently as an SRE manager at Google. When not working with computers, or people working with computers, he spends his time in high stakes toddler negotiations.
Diversity and Inclusion in SRE: A Postmortem
Niall Richard Murphy, Google
Whether a cause or a consequence of diversity & inclusion problems, members of minority groups in SRE experience harassment, bullying, and anti-social exclusion far too often. Although primarily an ethical and behavioural issue, it also has extremely costly negative effects on team effectiveness, arising from loss of psychological safety and even attrition. The data supporting these assertions are reasonably clear, but what is perhaps less clear is what to do about it.
We therefore analyse the situation in the form of a postmortem, suggesting some root causes, presenting a timeline, and analysing factors which contributed to (and offset) The Incident, and propose some actions to remediate.
Niall Richard Murphy, Google
Niall Richard Murphy is the head of Ads Reliability Engineering for Google Ireland, where his group is responsible for the infrastructure underlying ~90% of Google's annual revenue. He is the instigator, co-author and co-editor of platinum-selling "Site Reliability Engineering" (O' Reilly, 2016), a history of the Irish Internet, and is the holder of degrees in Computer Science and Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.
10:20–11:00
Break with Refreshments
Prefunction
11:00–12:30
Lansdowne Room
Globalizing SRE in a Walkup Culture
Bill Lincoln, Wayfair, and Matt Coleman
The SRE discipline necessitates deep understanding of the organisation's business and technical needs, as well as clear objectives across the entire team.
What happens when you begin to distribute your SRE team across a large geological divide?
We'll talk about what went into creating Wayfair's global SRE presence, from initial hire to full contributing team—what worked, what didn't, and what we have planned.
Bill Lincoln, Wayfair
Assoc. Director of Platform Engineering Bill manages the global SRE - Platform Engineering team at Wayfair. SRE's scope includes all things production that have an impact on e-commerce at Wayfair. We have expanded to a global team with team members in both the US and EU.
Make Haste Slowly: Balancing SRE Diligence in Urgency Driven Organizations
Jason Hiltz-Laforge, Shopify
Shopify is a commerce platform which has grown to power hundreds of thousands of businesses in a little over ten years. Along with the company, the production engineering organization has evolved from a founder's part time job to a team of over seventy people. Because of all that rapid and continued growth, the culture highly rewards speed and urgency. "Move fast, break things" is fine…unless you are responsible for site reliability and availability. And the databases. Especially the databases.
This talk is about the tension between an urgency driven organization and the diligent SRE teams that operate within it. We'll examine how to build, nurture, and support those teams. We'll look at how to celebrate and reward them for being prudent, cautious, and skeptical. And because it is the deliberate pace of these teams that allows the rest of the organization to move quickly, we'll dive into how to concretely measure the benefits and sell them as positives to the rest of the organization. Attendees will leave with tools and techniques to highlight the importance of their work as SREs, when to trade speed for diligence and how to move fast and stay sane—all without cutting corners.
Jason Hiltz-Laforge, Shopify
I'm a production engineering lead at Shopify, where I try not to break too many things at once. Apart from computers, I enjoy naming things and trying to convince my coworkers that all food is essentially salad. Outside of work, I like spending time with my wife and two daughters. If you come too close, I'll probably show you a picture.
Want to Solve Over-Monitoring and Alert Fatigue? Create the Right Incentives!
Kishore Jalleda, Yahoo, Inc.
Telemetry monitors and their (constant) beeping is a pretty common sight in hospitals. I saw these at the NICU where my twins were being cared for after being born prematurely. My wife and I used to freak out every time one of these went off. Unlike a missed alarm that said your site's down, failing to act on an alarm at a hospital can have much more critical consequences; in 2010 at a hospital in Massachusetts, a patient's death was directly linked to telemetry monitoring after alarms signaling a critical event went unnoticed by 10 nurses.
I attempted to solve this problem when I joined Zynga (in 2013) as the head of SRE. I will go over our failed attempts including filtering the noise, adding heads, building more tools, etc. Will also cover how I came up with an initiative called "clean room" as a way to incentivize engineering teams to keep the noise levels low. Finally, go over some of the tactics that worked (and ones that didn't).
Most people I spoke to about "clean room" almost always walked away having learned something (some have said it's common sense). Share, learn, ask questions, participate - I'll try to make it fun!
Kishore Jalleda, Yahoo, Inc.
Kishore Jalleda is currently the head of production engineering in the Americas in Yahoo’s Publisher Products unit, which includes many popular destinations like Yahoo.com, Yahoo Finance, Yahoo Sports, Yahoo News and Flurry. Previously, Kishore was the head of SRE at Zynga and worked at IMVU, one of the pioneers of continuous delivery (with co-founder Eric Ries, the author of Lean Startup).
Pembroke Room
SRE Your gRPC—Building Reliable Distributed Systems (Illustrated with GRPC)
Grainne Sheerin and Gabe Krabbe, Google
Distributed systems have sharp edges, and we have a wealth of experience cutting ourselves on them. We want to share our experience with SREs elsewhere, so they can skip making the same mistakes and join us making exciting new ones instead!
We will share practical suggestions from 14 years of failing gracefully:
- In a distributed service, every component is a frontend to another one down the stack. How can it deal with backend failures so that the service as a whole does not go down?
- In a distributed service, every component is a backend for another one up the stack. How can it be scaled and managed, avoiding overload and under-use?
- In a distributed service, latency is often the biggest uncertainty. How can it be kept predictable?
- In a distributed service, availability, processing, and latency costs contributions are hard to assign. When things (inevitably) go wrong, what components are to blame? When they work, where are the biggest opportunities for improvement?
We will cover best and worst practices, using specific gRPC examples for illustration.
Grainne Sheerin, Google
Grainne is a Site Reliability Engineer for Google Ireland. She's a tech lead responsible for Ad Serving infrastructure and has 5 years of experience in production engineering. She a physicist, earning a doctorate in Nanoscience from Dublin City University. Prior to Google, she masqueraded as a strategic relationship manager for Reuters and a network engineer for HEAnet.
Gabe Krabbe, Google
Gabe Krabbe has been a Site Reliability Engineer at Google for over 12 years. He has worked on, and sometimes against, multiple generations of the Ads management and serving infrastructure. Before joining Google, he worked for various companies as a system administrator. He frequently tells his servers and his children that he doesn't care who started it, because it takes two to fight.
Profiling Node Applications
Sasha Goldshtein, CTO, Sela Group
Node runs on a powerful JavaScript engine, but that same engine can complicate things when it comes to obtaining accurate information on your application's performance. There are plenty of tools for profiling C++ or Java applications, but understanding JavaScript interactions with native code can be extremely challenging. In this talk we will discuss profiling options for Node.js, including perf_events, dtrace, the V8's engine built-in --prof switch, and tools based on the bleeding-edge kernel BPF technology. We will also talk about turning profiler results into flame graphs, an innovative visualization tool for understanding stack sample reports, and for figuring out the time split across the JavaScript and native parts of your application.
Sasha Goldshtein, CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft MVP, Pluralsight author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor to projects focused on system diagnostics, performance monitoring, and tracing—across multiple operating systems and runtimes. Sasha authored and delivered training courses on Linux performance optimization, event tracing, production debugging, mobile application development, and modern C++. Between his consulting engagements, Sasha speaks at international conferences world-wide.
Meeting Room 1+2
Load-Shedding: Overview of Different Methodologies
Acacio Cruz, Google
This talk gives an inventory and overview of the different methods for dealing with load-shedding and overload in production stacks, including an overview of the methods developed at Google and the open-source solutions.
We'll review the pros and cons, scope and effort levels of each method, and compare with existing approaches, including circuit-breakers.
Acacio Cruz, Google
Acacio has been an SRE manager since 2007, and manager of Google's Load-shedding & Traffic Management team since 2009. He is now a SWE Director in Frameworks and Software Infrastructure.
Managing SSH Access without Managing SSH Keys
Niall Sheridan, Intercom
Everyone uses SSH to manage their production infrastructure, but it's really difficult to do a good job of managing SSH keys. Many organisations don't know how many SSH keys have access to production systems or how protected those keys are. A trusted SSH private key can be years old, unprotected by passphrase, and shared among multiple people who may not even work for you.
With some tooling and configuration SSH keys can be replaced with limited-use ephemeral certificates, issued centrally and with better access controls and automatic key expiration, solving many of the shortcomings of using SSH keys.
This talk will cover:
- Managing SSH keys: The bad parts
- Replacing SSH keys with ephemeral certificates: how & why
- Discussion of an implementation of a CA for SSH certificates
- Call for participation, showing github source
Niall Sheridan, Intercom
Niall Sheridan is an SRE on Intercom's infrastructure team. His main interests are automation, monitoring, and he loves a good post-mortem
Meeting Room 9
SRE 101
Laura Nolan, Google
The purpose of an SRE team is to keep its services up, reliable, performant and efficient. How do effective SRE teams do this?
We'll run through an overview of key SRE competencies: monitoring and alerting, incident response, disaster recovery, performance and efficiency, change management and capacity planning.
We'll also look at the habits of successful SRE teams and some common pitfalls.
Laura Nolan, Google
Laura Nolan has been a Site Reliability Engineer at Google for four years, working on large data infrastructure projects and most recently, networking. Her background is in software engineering and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly SRE book, and is co-chair of SREcon17 Europe/Middle East/Africa.
12:30-13:40
Conference Luncheon
Sponsored by Palantir
Sussex Restaurant and Herbert Room
13:40–15:00
Lansdowne Room
The Dangers of Being Overly-Paranoid
Ingrid Epure, Intercom
Shipping code is not enough. You also need logs, tests and static analyzers to have the necessary confidence in the change you just deployed. Especially when you’re shipping to production every 10 minutes, at peak time.
A common philosophy for feeling that things are in control is simply adding more data. And when something bad happens, first reaction is even more instrumentation to cover that specific scenario. You need more information in the logs to troubleshoot, more inputs for your tests and more linting rules. And then you’ll never run into that problem again, right!?
Well…maybe, but you’ve just hit a bigger one:
When your app is small, you can easily get away with test duplication, log noisiness and introducing new tools.
As you grow, the noise becomes impactful. You’re being slowed down by the shortcuts and poor decisions you took earlier, until it becomes a non trivial problem to solve.
In this talk we will explore the approach we took in Intercom for introducing information sanity, by focusing on:
- logging myths, and dangerous fallacies
- being deliberate about operational and engineering needs
- better troubleshooting using structured and canonical logs
- getting more benefits and performance out of notoriously slow static analyzers
- writing tests with performance in mind
Ingrid Epure, Intercom
Ingrid is an engineer currently working for Intercom in Dublin, Ireland. She is passionate about distributed systems, automation and simplifying things. She is a conference speaker, an active member in the Python community and loves mentoring and helping with community-driven events.
Show Me the RIGHT Numbers! Are Our Users Happy?
Perry Statham, IBM
Are your users happy? If you’re not really sure, then you’re focusing on the wrong metrics.
In this session we will show you outside-in approaches to choosing service level indicators and objectives that reflect user happiness.
Perry Statham, IBM
Perry Statham is a Site Reliability Engineer with IBM’s Bluemix DevOps and Continuous Delivery (https://console.ng.bluemix.net/devops) product teams. As a veteran of the development vs. operations wars, he’s been doing DevOps since long before it was a buzzword.
Pembroke Room
Standing On the Shoulders of Giants: Unleashing the Power of Scriptable Load Balancers
Emil Stolarsky, Production Engineer, Shopify
Every year, our organizations continue adding more services. It’s unsustainable to have a dedicated team of SREs for each one. That’s why, as an industry, we’ve moved to the product-team SRE model. We’re now accustomed to building custom services that applications reach out to, but not middleware services that operate on requests before they reach their destination.
Load balancers have the potential to provide application-aware middleware without making changes to the application itself. However, traditional load balancers can’t be easily and deeply customized or redeployed quickly without significant risk. Instead, we can embed a scripting language to fulfill these requirements.
At Shopify, we do this with Nginx and LuaJIT via OpenResty. Our Nginx scripts deploy in 10 seconds, run through a thorough suite of automated tests, and have allowed us to solve sharding across data centers, handle some of the world’s biggest flash sales, and respond quickly to layer 7 DDoS attacks. What once took a large team of engineers can now be accomplished by one of any size.
The lessons learned from building this middleware framework are applicable to any service. By solving hard problems in your load balancers, you can benefit every application or service you run.
Emil Stolarsky, Production Engineer, Shopify
Emil is a production engineer at Shopify where he works on performance, scriptable load balancers, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's shivering over a spiked cup of coffee in the great Canadian north.
InStream: Large Scale Distribution using BitTorrent, Python, Salt, and Kafka
Harsh Sharma, LinkedIn
Deploying application/services to all the servers across every datacenter can be painful for any company with a big infrastructure, including LinkedIn.
Our deployment model had some centralized pieces which became bottlenecks at scale. This talk will describe how we built a service in Python, based on Saltstack and Kafka, which can deploy any service to all servers asynchronously with a P2P distribution model, rate limiting and fast rollbacks.
Harsh Sharma, LinkedIn
I've been an SRE at LinkedIn for over a year, working with Platform and Horizontal teams, and as one of the primary owners of InStream, building internal tools and supporting different platform services. I enjoy being an SRE and wish to contribute as much as I can to the global SRE community.
Meeting Room 1+2
Networks for SREs: What Do I Need to Know for Troubleshooting Applications
Michael Kehoe, LinkedIn
All of us depend on the underlying network to be stable whether in the datacenter or in the cloud. We all have a basic knowledge of how traditional networks run, however in the past 10 years, we’ve moved to building redundant physical topologies in our networks, optimized the routing methodologies accordingly, moved into the cloud and gotten greater visibility and tuneables in the Linux kernel network stack. A lot has changed!
However, the way we troubleshoot the network in relation to the applications we support hasn’t adapted. In this session, we’ll review the progress that network infrastructure has made look at specific examples where traditional troubleshooting responses fail us and demonstrate our need to rethink our approach to making applications and the network interact harmoniously.
Michael Kehoe, LinkedIn
Michael Kehoe, Staff Site Reliability Engineer in the Production-SRE team, joined the LinkedIn operations team as a New College Graduate in January 2014. Prior to that, Michael studied Engineering at the University of Queensland (Australia) where he majored in Electrical Engineering. During his time studying, he interned at NASA Ames Research Center working on the PhoneSat project.
Anycast Is Not Load Balancing
Murali Suriar, Google
We'll discuss IP anycast (what it is, how it works), what use cases it's more or less suited to, and some of the complexity it introduces (complete with war stories).
Murali Suriar, Google
Lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running software defined network control systems. Left Google to get on a boat. Got bored and came back.
Meeting Room 8
SRE Your gRPC—Building Reliable Distributed Systems (Workshop)
Grainne Sheerin, Gabe Krabbe, and Lisa Carey, Google
Distributed systems have sharp edges, and we have a wealth of experience in cutting ourselves on them.
In this workshop, participants will learn how to specify and use gRPC-based services (including an introduction to protocol buffers). Particular emphasis will be placed on engineering for reliability in the face of inevitable failures and errors. This will include identifying and implementing appropriate strategies for different requirements and circumstances, as well as enabling effective debugging through strong instrumentation.
All topics covered will include hands-on coding exercises.
Participants need to have a working knowledge of C++, Go, Java, or Python, and must bring a laptop running the Chrome browser (and a suitable charger).
Grainne Sheerin, Google
Grainne is a Site Reliability Engineer for Google Ireland. She's a tech lead responsible for Ad Serving infrastructure and has 5 years of experience in production engineering. She a physicist, earning a doctorate in Nanoscience from Dublin City University. Prior to Google, she masqueraded as a strategic relationship manager for Reuters and a network engineer for HEAnet.
Gabe Krabbe, Google
Gabe Krabbe has been a Site Reliability Engineer at Google for over 12 years. He has worked on, and sometimes against, multiple generations of the Ads management and serving infrastructure. Before joining Google, he worked for various companies as a system administrator. He frequently tells his servers and his children that he doesn't care who started it, because it takes two to fight.
Lisa Carey, Google
Lisa Carey is a Technical Writer for Google Cloud Platform in Dublin. She has written documentation for many technologies including Protocol Buffers, gRPC, and Cloud APIs, and regularly runs writing workshops for Google engineers. She holds degrees from Trinity College Dublin.
Meeting Room 9
Mastering Linux Performance Tools
Sasha Goldshtein, CTO, Sela Group
All kinds of applications run on Linux, from web servers to distributed database engines and embedded applications. Troubleshooting performance in the field, especially when invasive profilers can't be used, is a delicate art that requires a solid understanding of the system and low-overhead tools. In this workshop, we will visit a spectrum of Linux performance monitoring tools.
We will start with a simple performance checklist based on the USE method, including tools like top, iostat, vmstat, mpstat, sar, and others. Then, once we identify the overloaded resource, we will dig in deeper using perf: tracepoints, hardware events, dynamic probes, and USDT. We will also collect stack traces of heavy events (CPU usage, disk accesses, network) and visualize them using flame graphs.
Finally, we will discuss the emerging superpower for Linux performance monitoring: BPF and BCC. This is a new kernel technology that enables low-overhead, super-efficient monitoring and tracing tools, which perform aggregation closer to the source where the events occur and provide useful information at a fraction of the cost. We will review a performance checklist based on BCC tools, and explore one-liners from the general-purpose trace and argdist tools.
Sasha Goldshtein, CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft MVP, Pluralsight author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor to projects focused on system diagnostics, performance monitoring, and tracing—across multiple operating systems and runtimes. Sasha authored and delivered training courses on Linux performance optimization, event tracing, production debugging, mobile application development, and modern C++. Between his consulting engagements, Sasha speaks at international conferences world-wide.
15:00–15:40
Break with Refreshments
Prefunction
15:40–17:00
Lansdowne Room
Use Load Testing to Build a Proper Mental Model of Your Service
John Looney, Intercom
Large organisations often have teams dedicated to building and using load test frameworks for their production services. Intercom's engineering team was too small to have accurate load tests for all of it's systems, but as we acquired larger customers, having accurate load-test numbers, and being able to communicate them to the business became more critical. This talk will cover some things we learnt about load testing, and how it changed our mental models of some of our infrastructure.
John Looney, Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know the best use of their time and energy, but still hasn't worked out how to not burn himself out occasionally.
Traffic Steering using Rum DNS @ LinkedIn
Abhijeet Rastogi, LinkedIn
Do you serve customers across the world with varying network conditions? Do you struggle with automatically sending your users away from an unavailable POP/DC to the next best one? Wouldn’t it be great if you magically knew about the regional issues related to last mile connectivity from the users?
LinkedIn uses real user measurements backed by triple-vendor DNS. These measurements are collected from members’ browsers to gain insight into our performance from every last mile. We then leverage Big Data to send members to closest edges in real-time and deliver fast member experience.
LinkedIn members from Mumbai visiting LinkedIn will be sent to our Mumbai POP using member’s geolocation. If the POP is unreachable, member browsers elsewhere will report this to our RUM backend and our DNS will learn to stop resolving to the unreachable POP - all within a matter of a few seconds.
Attendees will learn about:
- Web Performance: CDNs, POPs, RUM steering
- Multi-vendor strategy for redundancy & performance
- Tools for cross-vendor consistency, vendor fail-out, global site monitoring & status boards
Abhijeet Rastogi, LinkedIn
Abhijeet has been working with LinkedIn for around 2.5 years and a total of 5 years as an SRE. He joined LinkedIn with experience of architecting VPS hosting using OpenStack and hosting email infrastructure at scale. He has worked on managing DNS, CDN and Traffic infrastructure at LinkedIn.
He has been a contributor to Logstash and his current favorite language is Golang. In his free time, he tries to find reasons for writing more Golang code to make his life easier.
Pembroke Room
Capturing and Analyzing Millions of Queries without Any Overhead
Karthik Appigatla, Sr. Database Engineers, LinkedIn, and Basavaiah Thambara, LinkedIn
This talk is about a new way of monitoring and analyzing millions of queries with no overhead.
Optimizing queries is the most important aspect of scaling database servers. Before we can optimize, we need to identify the problematic queries. We have slow-query log in MySQL where we can set a threshold and all the queries crossing threshold will be logged in a file and later can be used for analysis. Other way is to use performance_schema database inside MySQL which gives various metrics of queries.
The problem is that enabling the slow query log will incur a 25-35% overhead on the database, since we have to have to write to a file. Additionally, since only queries exceeding the threshold will be logged, we won't have any data about queries below that threshold. Meanwhile, enabling performance_schema incurs a 10-20% overhead, and is complex to understand.
To minimize overhead and effectively measure all queries, we have built a query analyzer which incurs less than 3% CPU overhead and no overhead on any other resources.
Karthik Appigatla, Sr. Database Engineer, LinkedIn
Karthik Appigatla is a database evangelist currently working for LinkedIn, Bangalore. Earlier he worked for companies like Yahoo, Pythian and Percona. Please check his LinkedIn profile to know about his work. LinkedIn Profile: https://www.linkedin.com/in/appigatla/
Basavaiah Thambara, Sr. Database Engineer, LinkedIn
Basavaiah Thambara (Basu) has decade of experience designing, building and scaling MySQL databases. He is currently working as a staff database engineer at LinkedIn managing Espresso, an in-house distributed NoSQL datastore. He currently lives in Bangalore,India https://in.linkedin.com/in/basavaiaht
OK Log: Distributed and Coördination-Free Logging
Peter Bourgon, Fastly
This talk explores the motivation, design, prototype, and optimization of OK Log, a distributed and coördination-free log system for big ol' (cloud-native) clusters.
We first motivate the need for a such a system, setting it apart from existing products like Elasticsearch. Then, we carve out a solution in the distributed systems space, paying due homage to the old gremlins of consistency and coördination. Finally, we review the component and architecture model, and demonstrate how it copes with typical operations and failure modes.
This talk is about an open-source product, but it is not a product pitch. Instead, it's meant to be a case study of a learning exercise: approaching a deceptively subtle problem domain from first principles, and using methodological software engineering to derive a solution. I hope it inspires others to reach for something more self-actualizing than the plumbing together of databases and message busses.
Peter Bourgon, Fastly
Peter Bourgon is a Go aficionado and is quite keen on distributed systems. He's written Go kit, a toolkit for microservices in Go, among several other OSS projects. He is a professional typist, and has typed for Bloomberg, SoundCloud, and Weaveworks; he currently types for Fastly, as a member of their Data infrastructure team.
Meeting Room 1+2
Bots Are Fast, Humans Are Smarter—Eliminate Unwanted Traffic and Defend Against DDoS
Felix Glaser, Shopify
In a world with ever-growing DDoS attacks, L7 attacks give even the most experienced engineers the sweats. Imagine if instead of following easy to detect patterns, bots could mimic the behaviour of customers. Well, that’s exactly what Shopify sees every day during flash sales.
Come and learn how we block nearly all bot traffic on our load balancers without any human intervention. We will share our challenges of differentiating between web crawlers and bots, users behind NATs and bots rotating user agents, as well as fast humans and browser extensions. When the stakes are blocking a customer completing a checkout, misclassification isn’t an option.
This is not yet another machine learning talk, but an example of how simple statistics, heuristics and some sane limits can give great results with minimal complexity. The lessons learned in this talk are applicable to any real-world problem with inexact constraints.
Felix Glaser, Shopify
Felix is a Production Engineer at Shopify where he thinks about how to keep its platform (and merchants!) safe. When he isn’t writing code he likes to climb, cycle and camp in the Rockies in Canada.
Google SDN Peering: An Early Engagement Case Study
Murali Suriar, Google
How do you build a new SRE team around a completely novel product? This talk will deal with some of the challenges involved in launching Espresso, Google's software defined peering architecture.
- How do you build an SRE team for a product which isn't serving real users yet?
- How do you build a cohesive team and structure out of many disparate teams? (Networking, SRE, software development)
- How do you build oncall discipline in a team which largely hasn't been oncall before?
And as an aside, we'll also get into some of the technical details of Espresso, since it's necessary to understand what made it so challenging and different.
Murali Suriar, Google
Lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running software defined network control systems. Left Google to get on a boat. Got bored and came back.
Meeting Room 8
(Continued from previous session)
SRE Your gRPC—Building Reliable Distributed Systems (Workshop)
Grainne Sheerin, Gabe Krabbe, and Lisa Carey, Google
Distributed systems have sharp edges, and we have a wealth of experience in cutting ourselves on them.
In this workshop, participants will learn how to specify and use gRPC-based services (including an introduction to protocol buffers). Particular emphasis will be placed on engineering for reliability in the face of inevitable failures and errors. This will include identifying and implementing appropriate strategies for different requirements and circumstances, as well as enabling effective debugging through strong instrumentation.
All topics covered will include hands-on coding exercises.
Participants need to have a working knowledge of C++, Go, Java, or Python, and must bring a laptop running the Chrome browser (and a suitable charger).
Grainne Sheerin, Google
Grainne is a Site Reliability Engineer for Google Ireland. She's a tech lead responsible for Ad Serving infrastructure and has 5 years of experience in production engineering. She a physicist, earning a doctorate in Nanoscience from Dublin City University. Prior to Google, she masqueraded as a strategic relationship manager for Reuters and a network engineer for HEAnet.
Gabe Krabbe, Google
Gabe Krabbe has been a Site Reliability Engineer at Google for over 12 years. He has worked on, and sometimes against, multiple generations of the Ads management and serving infrastructure. Before joining Google, he worked for various companies as a system administrator. He frequently tells his servers and his children that he doesn't care who started it, because it takes two to fight.
Lisa Carey, Google
Lisa Carey is a Technical Writer for Google Cloud Platform in Dublin. She has written documentation for many technologies including Protocol Buffers, gRPC, and Cloud APIs, and regularly runs writing workshops for Google engineers. She holds degrees from Trinity College Dublin.
Meeting Room 9
(Continued from previous session)
Mastering Linux Performance Tools
Sasha Goldshtein, CTO, Sela Group
All kinds of applications run on Linux, from web servers to distributed database engines and embedded applications. Troubleshooting performance in the field, especially when invasive profilers can't be used, is a delicate art that requires a solid understanding of the system and low-overhead tools. In this workshop, we will visit a spectrum of Linux performance monitoring tools.
We will start with a simple performance checklist based on the USE method, including tools like top, iostat, vmstat, mpstat, sar, and others. Then, once we identify the overloaded resource, we will dig in deeper using perf: tracepoints, hardware events, dynamic probes, and USDT. We will also collect stack traces of heavy events (CPU usage, disk accesses, network) and visualize them using flame graphs.
Finally, we will discuss the emerging superpower for Linux performance monitoring: BPF and BCC. This is a new kernel technology that enables low-overhead, super-efficient monitoring and tracing tools, which perform aggregation closer to the source where the events occur and provide useful information at a fraction of the cost. We will review a performance checklist based on BCC tools, and explore one-liners from the general-purpose trace and argdist tools.
Sasha Goldshtein, CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft MVP, Pluralsight author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor to projects focused on system diagnostics, performance monitoring, and tracing—across multiple operating systems and runtimes. Sasha authored and delivered training courses on Linux performance optimization, event tracing, production debugging, mobile application development, and modern C++. Between his consulting engagements, Sasha speaks at international conferences world-wide.
17:00–18:00
Happy Hour
Sponsored by Facebook
Herbert Room
18:00–21:00
Birds-of-a-Feather Sessions (BoFs)
BoFs, or Birds-of-a-Feather sessions, are an opportunity for informal and ad hoc discussion of a topic of shared interest between a group of conference attendees. SREcon Europe will have BoF sessions on the Wednesday and Thursday evenings. While some topics will be pre-arranged, there are slots for additional BoF sessions and at the event we encourage attendees to suggest additional BoF topics that are of interest and will provide facilities for groups to meet and discuss these.
Go to the BoFs page for information on scheduling BoFs.
31 August 2017
08:00–09:00
Morning Coffee and Tea
Prefunction
09:00–10:20
Lansdowne Room
How We Try to Make a Lion Bulletproof; Setting Up SRE in a Global Financial Organization
Janna Brummel and Robin van Zijll, ING
By now, most of us have read the O’Reilly book about Google SRE or have heard of other tech companies’ SRE teams. Our story is about doing SRE in a more traditional and more regulated environment: in the largest bank in the Netherlands.
This talk will address the history, present and future of SRE within a financial institution with a BizDevOps way of working. In doing so, we will talk about how our journey started at SREcon in Dublin last year, why we wanted and need to do SRE within ING, how our SREs started, the distributed way of working of SRE within ING, what technologies we actually work with and why, and our plans for the future.
Lastly, we will share our SRE dos-and-donts after a year of experience. We hope our talk inspires engineers or tech leads to start doing SRE within their (possibly more traditional) company and that those who already implemented SRE can learn from our journey.
Janna Brummel, ING
Janna Brummel currently works as an IT chapter lead (a line manager who still does day-to-day work) to the SRE team of ING Domestic Bank the Netherlands in Amsterdam, the Netherlands. Previously, Janna worked as business manager to the CIO of ING Domestic Bank the Netherlands and as a dev engineer developing software for debit and credit cards back end systems of ING.
Robin van Zijll, ING
Robin van Zijll is a site reliability engineer and product owner to the SRE team of ING Domestic Bank the Netherlands in Amsterdam, the Netherlands. He also has years of experience in being on call for all functionalities used by retail banking customers.
From Firefighting to Proactive Work: the Journey of a Small Infrastructure Team in a Hyper Growth Environment
Alex Gerlic, Intercom
Due to an incident on our main datastore, we react and spent an entire week trying to keep Intercom up, with the help of 20 engineers from other teams. During this tough week, we had obliged to drop any other projects and focus on building a firefighting organization.
After the urgency period, it became evident to us that we need to focus on reactive work to prevent the incident from happening again. It was the launch-pad for the conception of a brand-new organization for our team, focusing on ownership and high impact work.
Few months after, results ruled in favour of our hard work: we’ve reduced system interruptions by more than 80% ! But good news and radical changes also come with consequences: we need to deal with multiple implications and drastically change our way to work as a team
During this talk we will cover:
- our journey from a firefighting to a proactive work organization.
- good and bad organizational decisions we made
- impacts on the morale of the team
Pembroke Room
Incident Command at the Edge
Lisa Phillips, Fastly
As a content delivery network, Fastly operates an edge environment for many large scale web properties and APIs. In order to deal with emerging threats to its network, Fastly needed to develop processes that allowed it to respond effectively to availability and security incidents at scale. The network engineering, SRE, and security teams at Fastly leverage a protocol called “Incident Command” to rapidly engage various teams across the company, and make sure customer properties are protected. Let the Fastly VP of SRE take you to the far side of the edge, and learn more about the challenges a large global network faces and the protocols that we found helped for us.
Lisa Phillips, Fastly
Lisa Phillips is a leader in the reliability, with particular interest in social media and speeding up content delivery. She has worked for 20 years in tech and database operational roles for large sites Livejournal, Six Apart and Twitter - where she helped kill the fail whale. Lisa is returning from a year of world travelling and is happy to have landed at Fastly as Vice President of Site Reliability Engineering.
Resiliency Testing with Toxiproxy
Jake Pittis, Shopify
Fibers get cut, databases crash, and you’ve adopted Chaos Engineering to challenge your production environment as much as possible. But what are you doing to craft the resiliency test suites that minimizes the impact of failure on your application as much as possible? How do you debug resiliency problems locally and make sure single points of failures don't creep into the application in the first place? We’ve used the open-source Toxiproxy for the past two years to emulate timeouts, latency and outages in development environments. This talk will equip you with the tools to start writing resiliency test suites to harden your own applications, to supplement other chaos engineering practises.
Jake Pittis, Shopify
In between teaching his team about jazz, Jake can be found on the Production Engineering Team at Shopify. He's worked preparing the platform for massive celebrity sales, making Shopify run out of multiple data centres, and the resiliency stack to protect the app against misbehaving resources, and itself. Canadian Geese are his favourite animals. While the hipster movement of his nation has recently taken to eating these poor birds, Jake has yet to taste one. And never plans to. We don't eat our friends.
Meeting Room 1+2
Deploying Changes to Production in the Age of the Microservice
Samantha Schaevitz, Google
You decoupled your APIs from their implementations and put them behind RPC interfaces. You build and deploy services independently. You code health is impeccable. You put your user data in a persistent, replicated, and consistent store, where it belongs. Your developer velocity has skyrocketed.
Now we have new problems. We’ve got N independent services with M edges of interaction between them. That’s N services that need to be built, tested, and deployed on the infrastructure that expected you to have one service whose mess of entanglement was a secret you had with the compiler.
How do we deploy N binaries with N sources of static configuration and M sources of runtime configuration safely without losing our collective minds? In this talk, I’ll share some of how we grew that aforementioned N from 1 to many in Gmail. Specifically:
- Consistent naming schemas for services, environments -
- Maintaining lightweight, easy-to-change production configuration abstraction layers
- Release early, often
- Canary everything by sharding into more A/B environments than you'd think you’d need
- Encourage backwards compatibility in all APIs
- Validate and test all configuration before changing global state
And, of course, some of things we (Gmail) learned by breaking things along the way.
Samantha Schaevitz, Google
Samantha Schaevitz is the Technical Lead for the SRE team responsible for Gmail and Calendar at Google Zürich. Originally from California, she studied Computer Science (with a minor in French) at the University of California, Berkeley. Her maximum latitudinal position is 67.853°.
Application Automation with Habitat
Mandi Walls, Chef Software
Container Orchestration Systems make for a great operational experience for deploying and management of containers. But that’s only part of the story when running containers in production. How do you build containers that contain only what you need (like no build systems/tools)? How do you orchestrate configuration of your application after the containers have been launched? How do you make it easy to modify an application config while keeping the containers immutable? How can you give your developers a means to declare dependencies for their applications?
Habitat, our open-source project for application automation, simplifies container management by packaging applications in a compact, atomic, and easily auditable format that makes it easier to deploy your application on various container runtimes and manage them over their lifecycle.
Mandi Walls, Chef Software
Mandi Walls is Technical Community Manager, EMEA at Chef. For Chef, she travels the world helping technology organizations increase their effectiveness using configuration management and modern IT practices. She is a regular speaker at technical conferences, and is the author of the whitepaper “Building a DevOps Culture” published by O’Reilly. She is interested in the emergence of new tools and workflows to make the task of operating large complex computing systems more approachable.
Meeting Room 8
Distributed Systems Reasoning
John Looney, Intercom, and Theo Schlossnagle, Circonus
All distributed systems make tradeoffs and compromises. Different designs behave very differently with respect to cost, performance, and how they behave under failure conditions.
It's important to understand the tradeoffs that the building blocks in your systems make, and the implications this has for your system as a whole. In this workshop we'll look at several examples of different real-world distributed systems and discuss their strengths and shortcomings.
This workshop will include some practical elements. You will be given some system designs to read and to evaluate, and then we'll discuss the implications of each design together as a group.
John Looney, Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know the best use of their time and energy, but still hasn't worked out how to not burn himself out occasionally.
Theo Schlossnagle, Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed systems at JHU. Theo is the author of Scalable Internet Architectures (Sams) and a frequent speaker at IT conferences worldwide. He is a member of the IEEE and a senior member of the ACM and serves on the editorial board of the ACM’s Queue magazine. He holds undergraduate and graduate degrees in computer science from Johns Hopkins University.
Meeting Room 9
Tech Writing 101 for SREs
Lisa Carey and Betsy Beyer, Google
From post-mortems to operations manuals to code comments, writing things down for others is an unavoidable part of the life of an SRE.
In this workshop, you’ll learn writing principles to help you present technical information from two experienced Google technical writers - and each other! Through a series of pair-work exercises you’ll work through a variety of topics to improve the clarity, readability, and effectiveness of your writing, and possibly think about a toothbrush like you’ve never thought about one before. If you've never before taken any technical writing training, this workshop is perfect for you. If you've taken technical writing training, this class will serve as a great refresher.
There is a small amount of pre-reading (https://goo.gl/ssAALV) for participants in this workshop (~30 minutes of reading about basic technical writing concepts).
The workshop runs for two hours with a short break.
Lisa Carey, Google
Lisa Carey is a Technical Writer for Google Cloud Platform in Dublin. She has written documentation for many technologies including Protocol Buffers, gRPC, and Cloud APIs, and regularly runs writing workshops for Google engineers. She holds degrees from Trinity College Dublin.
Betsy (Adrienne) Beyer, Google
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees from Stanford and Tulane.
10:20–10:50
Break with Refreshments
Prefunction
10:50–12:30
Lansdowne Room
Building a Culture of Reliability
Arup Chakrabarti, PagerDuty
Getting customers to care about Reliability is hard. Getting stakeholders to care about Reliability is harder. Getting the entire company to care about Reliability is even harder.
In this talk, I will cover what steps that every leader in any organization can take to get more people to care about Reliability. Because Reliability is one of those things that people only notice when it goes in the wrong direction, it can be hard to show the value of it and why it is so important.
We will walk through cultural and management changes, metrics to watch and obsess over, and some tooling that can help along the way.
Arup Chakrabarti, PagerDuty
Arup has been working in the space of software operations since 2007. He started out at as an Operations Engineer at Amazon, helping to reduce customer defects with multiple teams for the Amazon Marketplace. Since then, he has managed and built operations teams at Amazon and Netflix to help improve availability and reliability. He currently works at PagerDuty, where he is part of the Infrastructure Engineering group.
Tech Leadership in SRE
Sean Rees, Google
The job of a technical lead (TL) is probably not what you think it is.
The role of a TL in a team is as much about fostering social bonds as it is about technical excellence. For a tech lead, it's not enough to be a technical expert, they must also foster unity within a team so they can move forward—together—towards the same goal.
This talk will (quickly) cover the role of a tech lead, some myths about tech leads, interactions/overlaps with other team roles (e.g; manager, individual contributor), and how to best make use of your own TL.
Sean Rees, Google
Sean Rees is a long-time SRE tech lead in Google, now working in the networking space after stints in storage and ads.
Pembroke Room
Case Study: Lessons Learned from Our First Worldwide Outage
Yoav Cohen, Imperva Incapsula
Last year, on March 10, Incapsula experienced the first worldwide outage in its history… While relatively short in duration, it affected thousands of websites that rely on our security and acceleration every day.
Rooted in a 3-year old dormant bug in our IncapRules code, this outage made us realize there were changes we needed to make in the way we write and qualify code. As VP of Engineering, the faulty code and our testing procedures are my responsibility, and it was up to me to lead the team to achieve an order of magnitude higher reliability.
One of the key things we were missing was a way to propagate customer configuration across our network in a way that is fast but without compromising on safety. The result was a new configuration sandbox system which achieved that.
In this talk I’ll present the process we took to analyze the true reliability of our system and the framework we use to reason about it, to prioritize tasks across teams and to design a more reliable service.
Yoav Cohen, Imperva Incapsula
Yoav is VP of Engineering for Imperva Incapsula, and has been with the company since they made their first sale. In between meetings you will find him working on build systems or nasty performance bugs. When not doing so he tries to sneak a few minutes on his guitar or doing laps in the pool. Yoav holds a M.Sc in Computer Science from Tel-Aviv University where he studied multi-core programming.
When Trouble Comes to Town
Michael Gorven, Facebook
One's inclination when tackling an incident is usually to dive to the bottom of the stack where the problem is occurring and start debugging the root cause. However, it's important to first take a step back and approach the incident at a high level to ensure the fastest and most efficient resolution possible. This talk proposes seven steps to consider when tackling an incident: assessing the impact; communicating internally; looking for what changed; trying to mitigate; investigating the root cause; confirming resolution; and documenting and following up. It also touches on various tools which help with these steps.
Michael Gorven, Facebook
Michael Gorven is a Production Engineer at Facebook, where he works on the Web Foundation team and previously Instagram. He fixes things when they break, improves the reliability of the system, helps engineer it to scale, and reverts diffs. Previously he was an early employee at South African startup Nimbula. Michael grew up in Durban and holds a BSc in Electrical and Computer Engineering from the University of Cape Town. He currently lives in London with his wife and two young children after spending five years in California.
Meeting Room 1+2
One Ring to Rule them...
John Tobin, Google
Rollout automation is something that every service and team needs, and many reinvent the wheel. I'll talk about - why the wheel gets reinvented - a system design that discourages reinvention, including an architecture diagram - the organisational challenges encountered when converting many services to use this new system design - how well the conversion attempt worked in practice This is based on my experience initiating and running a program to replace rollout automation across Storage SRE in Google.
John Tobin, Google
John Tobin manages Bigtable SRE and Cloud Bigtable SRE at Google Dublin, and has worked on several of Google's storage systems. He is currently involved in efforts to improve collaboration between teams across Storage SRE - standardising tools and processes, reducing duplicated effort, decreasing toil and snowflakes. He holds an MSc from Trinity College Dublin, where he worked before joining Google in 2010.
Dancing with Squads—Do you know what your Code Repos are Telling You?
Don Cronin and Rob Orr, IBM
Have you ever wondered why certain service teams are always at the center of issues? Code Commits fail in certain areas; have you ever wondered why? In the quest to understand our services, data and looking from the outside in, we will take the audience thru how we developed a methodology by our Data Scientists collaborating with University Research to understand the areas that most impacted Site Reliability and how the SRE team could use this data to develop new policies.
We found ourselves asking, are you listening to what your Code and Issue history is telling you ? Do you know what Risks you are taking with your code ? How does the squad organization and climate create patterns that impact availability. This session will take the audience thru how we answered those and more questions about our own code.
Don Cronin, IBM
Don has more than 25 years experience in developing software. He currently leads the DevOps Analytics mission. His focus is on improving the DevOps lifecycle using big data technics so developers can deliver greater quality with faster velocity. Previously he led an adtech group, incubating key technologies for cloud like Bluemix. He has also led development for Management, Security, and Networking software. Don has Bachelors’ in Computer Science and Mathematics from the University of Pittsburgh and a Masters in computer science from Syracuse University. In his free time he likes to run, develop his own website and listen to music.
Rob Orr, IBM
Rob Orr is a Program Director at IBM Cloud and leads the SRE team responsible for BlueMix DevOps Services. These services include developer tools for Toolchains, Pipline, 3rd party integrations and Git repository services. His current projects include Automation tooling, Developing SLI’s, and predictive monitoring. Mr. Orr joined IBM in 1984 holding various positions throughout his career with IBM including leadership roles in Audit, Security, Service Management and Operations. Rob holds a bachelor’s degree in Computer Science from the University of Maryland, and holds 5 patents and has published papers in the IBM systems Journal on IT Standards and Systems Management
Meeting Room 8
(Continued from previous session)
Distributed Systems Reasoning
John Looney, Intercom, and Theo Schlossnagle, Circonus
All distributed systems make tradeoffs and compromises. Different designs behave very differently with respect to cost, performance, and how they behave under failure conditions.
It's important to understand the tradeoffs that the building blocks in your systems make, and the implications this has for your system as a whole. In this workshop we'll look at several examples of different real-world distributed systems and discuss their strengths and shortcomings.
This workshop will include some practical elements. You will be given some system designs to read and to evaluate, and then we'll discuss the implications of each design together as a group.
John Looney, Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know the best use of their time and energy, but still hasn't worked out how to not burn himself out occasionally.
Theo Schlossnagle, Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed systems at JHU. Theo is the author of Scalable Internet Architectures (Sams) and a frequent speaker at IT conferences worldwide. He is a member of the IEEE and a senior member of the ACM and serves on the editorial board of the ACM’s Queue magazine. He holds undergraduate and graduate degrees in computer science from Johns Hopkins University.
Meeting Room 9
(Continued from previous session)
Tech Writing 101 for SREs
Lisa Carey and Betsy Beyer, Google
From post-mortems to operations manuals to code comments, writing things down for others is an unavoidable part of the life of an SRE.
In this workshop, you’ll learn writing principles to help you present technical information from two experienced Google technical writers - and each other! Through a series of pair-work exercises you’ll work through a variety of topics to improve the clarity, readability, and effectiveness of your writing, and possibly think about a toothbrush like you’ve never thought about one before. If you've never before taken any technical writing training, this workshop is perfect for you. If you've taken technical writing training, this class will serve as a great refresher.
There is a small amount of pre-reading (https://goo.gl/ssAALV) for participants in this workshop (~30 minutes of reading about basic technical writing concepts).
The workshop runs for two hours with a short break.
Lisa Carey, Google
Lisa Carey is a Technical Writer for Google Cloud Platform in Dublin. She has written documentation for many technologies including Protocol Buffers, gRPC, and Cloud APIs, and regularly runs writing workshops for Google engineers. She holds degrees from Trinity College Dublin.
Betsy (Adrienne) Beyer, Google
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees from Stanford and Tulane.
12:30–13:30
Conference Luncheon
Sponsored by Google
Sussex Restaurant and Herbert Room
13:30–15:30
Lansdowne Room
Building an SRE Capability Inside a Large Organization
Sriram Gollapalli, Agilent Technologies, Inc.
Agilent Technologies, while traditionally known as a hardware company, has started to deliver several Software-as-a Service offerings through acquisitions and in-house product development. This talk will discuss how to transform the thinking across a decentralized, large corporate organization to approach software delivery differently. Traditionally, we are used to delivering software by burning and shipping CDs/DVDs with annual release cycles. Fast-forward to today with SaaS products using CI/CD approaches, etc and releasing as frequently as twice a week, this requires a completely different operating model and doesn’t fit in the traditional organizational framework/processes.
Sriram Gollapalli, Agilent Technologies, Inc.
Sriram Gollapalli is the Director of Technology in the CrossLab Group at Agilent Technologies, Inc. He was the Co-Founder, CTO/COO at iLab Solutions from September 2006 until it was acquired by Agilent in August, 2016. The iLab Operating Software is a SaaS product serving academic and cancer research institutions. Prior to that, he was a Consultant with Deloitte Consultin for several years. He focused on developing and executing IT cost reduction and process improvement strategies for Fortune 250 clients and post-merger IT management support. He received a BS in Computer Science and a Masters in IS from Carnegie Mellon University.
The Why, What, and How of Starting an SRE Engagement
Richard Clawson and Josh Gilliland, Microsoft Azure
One of the hardest things to do is trust an outside voice. What are the boundaries between live site features and service features? How much expertise is required to be on-call? Who decides what’s in the best interests of the service? How is this not another Ops team or a staff augment? Who’s "in charge" and who makes prioritization calls? How do you build mutual trust? These are just some of the challenges in building a successful partnership between a product group and SRE.
In this talk we will present what we learned about the technical, organizational, and political systems that were needed to provide SRE to the Azure Internet-of-Things product group and how this can be used as a template for your services. We will discuss how to start an engagement, build partnerships and trust across organizations, provide ROI, keep a distinct identity and the frameworks that were developed to maintain tight organizational alignment including a new take on error budgets.
Let’s continue the conversation!
Richard Clawson, Microsoft Azure
Richard Clawson is a Site Reliability Engineer working on the Azure SRE Team. He is part of the team in Azure that is working to improve operations across the Azure stack. Currently he is focused on creating repeatable patterns and practices for SRE engagements. Before Azure he was a software engineering manager on the Cortana speech platform and on the MSN publishing platform.
Josh Gilliland, Microsoft Azure
Startup Systems Εngineer's Instruction Manual
Effie Mouzeli, Logicea LLC
What happens when you take the leap of faith and leave the security of a systems engineering team to become the first systems person at a startup? What should you expect?
This talk is about the challenges of being the sole systems engineer at a young company. The amount of work is overwhelming but the experience is worth it. We will explain the key elements of a newborn infrastructure and the stages leading it to maturity.
The challenges of this role are not limited to solving technical problems. The habits, processes and standards you will establish, will pave the way to go from a single engineer to a team.
Effie Mouzeli, Logicea LLC
Effie is a Systems Engineer at Logicea, a young software house. Her main responsibilities are operations, automation (Deployment Pipelines, Configuration Management etc.), assist in product architecture, work closely with developers and occasionally, pull rabbits out of hats and chase them.
Pembroke Room
Cognitive Bias and On-Call
Niall Richard Murphy, Google
This talk will be comprised of:
- An analysis a set of cognitive biases, with illustrated examples (e.g. anchoring/priming, substitition/availability, loss aversion, etc)
- Introduction of Kahneman/Tversky's "System 1/System 2" hypothesis (i.e. that our mental architecture is divided quite sharply into two modes of thinking/being)
- Description of the on-call experience for SREs
- Relation of this to previous cognitive biases; assertion that on-call is actually about using humans' infinite "jump out of the system" ability, otherwise the software could just fix itself
- Description of techniques to move an engineer from System 1 to System 2 thinking (which is what you actually want)
- Call to action for self-healing software
The attendees will learn:
- What psychological tricks might affect their next on-call shift
- What to do about them
- Why on-call sucks (no, really) and why there may be a future without it
Niall Richard Murphy, Google
Niall Richard Murphy is the head of Ads Reliability Engineering for Google Ireland, where his group is responsible for the infrastructure underlying ~90% of Google's annual revenue. He is the instigator, co-author and co-editor of platinum-selling "Site Reliability Engineering" (O' Reilly, 2016), a history of the Irish Internet, and is the holder of degrees in Computer Science and Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.
Reducing MTTR and False Escalations: Event Correlation at Linkedin
Michael Kehoe, LinkedIn
LinkedIn’s production stack is made up of over 900 applications, 2200 internal API’s and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner.
In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service.
We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.
Michael Kehoe, LinkedIn
Michael Kehoe, Staff Site Reliability Engineer in the Production-SRE team, joined the LinkedIn operations team as a New College Graduate in January 2014. Prior to that, Michael studied Engineering at the University of Queensland (Australia) where he majored in Electrical Engineering. During his time studying, he interned at NASA Ames Research Center working on the PhoneSat project.
The Never-Ending Story of Site Reliability
Kurt Andersen, LinkedIn
I strongly believe that the commonly proposed bogey-man of "automating ourselves out of a job" betrays a simplistic and highly incomplete understanding of the SRE field. SRE teams can always grow and develop.
The Dreyfus model is a model of professional expertise that plots an individual's progression through a series of five levels: novice, advanced beginner, competent, proficient, and expert. The idea in this talk is to take aspects of SRE practice (such as monitoring, measurement against SLOs, incident management and postmortems, etc) and provide indications of what these look like at the different Dreyfus levels - not so much at an individual level as at an organizational one.
Often, companies and teams will show uneven levels of proficiency - frequently due to pressures to develop some areas more than others. The intent of this talk is to provide a framework within which attendees can gauge their own company's progress and anticipate/plan for growing weak areas.
The concepts for this talk have emerged through exposure to companies at different skill/experience levels. It is not by any means definitive, but I hope to provide a useful rubric for discussion.
Kurt Andersen, LinkedIn
Kurt Andersen was one of the co-chairs for SREcon17Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon(US), and SANOG on various aspects of reliability, authentication and security.
Meeting Room 1+2
SRE 101, Revisited
Laura Nolan, Google
This presentation replaces the talk by Dinah McNutt, who is unable to attend. Laura Nolan will revisit her SRE 101 content from yesterday; if you missed the session due to the meeting room being at capacity, this is your opportunity to attend.
The purpose of an SRE team is to keep its services up, reliable, performant and efficient. How do effective SRE teams do this?
We'll run through an overview of key SRE competencies: monitoring and alerting, incident response, disaster recovery, performance and efficiency, change management and capacity planning.
We'll also look at the habits of successful SRE teams and some common pitfalls.
Laura Nolan, Google
Laura Nolan has been a Site Reliability Engineer at Google for four years, working on large data infrastructure projects and most recently, networking. Her background is in software engineering and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly SRE book, and is co-chair of SREcon17 Europe/Middle East/Africa.
Automated Debugging of Bad Deployments
Joe Gordon, Pinterest
Debugging a bad deployment can be tedious, from identifying new stack traces to figuring out who introduced them. At Pinterest we have automated most of these processes using using ElasticSearch to identify new stack traces and git-stacktrace to figure out who caused them. Git-stacktrace parses the stack trace and looks for related git changes. This has reduced the time needed to figure out who broke the build from minutes to just a few seconds.
Joe Gordon, Pinterest
Joe is an SRE at Pinterest, where he works on homefeed and performance. He has previously spoken at numerous conferences such as EuroPython, LinuxCon and LCA (Linux Conference Australia).
Debugging at Scale—Going from Single Box to Production
Kumar Srinivasamurthy, Microsoft Corporation
It's very easy to launch a debugger on your dev box, attach to the right process and step through code. However, things are different when you need to debug an issue in production that's getting tens of thousands of requests per second. What if the issue reproduces only in production? How do you debug without affecting production traffic? What techniques can you use in your development to make it easier to debug issues? Does your application use tracing? What debug logs are written out to aid in analysis?
This talk will cover:
- Challenges with debugging in production
- Various approaches that are used in the industry
- Examples from Bing & Cortana incidents and steady state problems to illustrate the techniques
- Service design ideas that make them easier to debug
Kumar Srinivasamurthy, Microsoft Corporation
Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, availability, performance, capacity, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale.
Meeting Room 8
(Continued from previous session)
Distributed Systems Reasoning
John Looney, Intercom, and Theo Schlossnagle, Circonus
All distributed systems make tradeoffs and compromises. Different designs behave very differently with respect to cost, performance, and how they behave under failure conditions.
It's important to understand the tradeoffs that the building blocks in your systems make, and the implications this has for your system as a whole. In this workshop we'll look at several examples of different real-world distributed systems and discuss their strengths and shortcomings.
This workshop will include some practical elements. You will be given some system designs to read and to evaluate, and then we'll discuss the implications of each design together as a group.
John Looney, Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know the best use of their time and energy, but still hasn't worked out how to not burn himself out occasionally.
Theo Schlossnagle, Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed systems at JHU. Theo is the author of Scalable Internet Architectures (Sams) and a frequent speaker at IT conferences worldwide. He is a member of the IEEE and a senior member of the ACM and serves on the editorial board of the ACM’s Queue magazine. He holds undergraduate and graduate degrees in computer science from Johns Hopkins University.
Meeting Room 9
Linux System Metrics
Nati Cohen, Here Technologies, and Avishai Ish-Shalom, Wix.com
While you can learn a lot by emitting metrics from your application, some insights can only be gained by looking at OS metrics. Yet OS metrics, despite being commonly used are also frequently misunderstood.
In this hands-on workshop, we will go over commonly used CPU, memory, disk and network metrics, and make sure we understand each of them. We will experiment with old & new tools to acquire and analyse metrics, evaluate a black-box workloads by looking at metrics, and assess the effect of extreme metric values on simple applications. During our adventure, we will learn about Linux internals, it’s underlying optimizations and unexpected limitations.
To participate, just make sure to bring a laptop, and have a chromium-based browser installed.
View the workshop's materials and exercises here.
Nati Cohen, Here Technologies
Production Engineer at Here Technologies, and a Teaching Assistant at the Interdisciplinary Center Herzliya. Previous experience includes: operations consulting, software development, *nix administration and security research in the Intelligence Corps as well as in multiple startup companies.
Avishai Ish-Shalom , Wix.com
Avishai is a veteran operations and software engineer with years of production experience. Currently masquerading as an engineering manager, Avishai is leading a team of software engineers at Wix.com core services group. In his spare time, Avishai is spreading weird ideas and conspiracy theories like DevOps and Operations Engineering.
15:30–16:00
Break with Refreshments
Prefunction
16:00–17:00
Lansdowne Room
The History of How We Came to Be
Niall Richard Murphy, Google
The talk will feature research from international standardisation committees, the history of women, work, and computerization, and personal anecdotes of the mainframe era to construct a theory as to how and why engineering and operations split apart. We will also examine the issues with an eye to gender and workplace politics.
In the course of this talk, we will look at:
- The earliest jobs in computers (~40s). What were they, why did they exist, and who held them?
- What different and distinct things happened in the 50s, 60s, 70s, 80s, 90s, and noughties?
- What was the difference between the keypunch operator and the computer operator?
- What was unglamorous about programming in the 40s, such that women were allowed to do it?
- What factors led to women being crowded out of programming as a profession?
- Why did system administration come into being as a job family, and why was it minimally viable to separate that from programming?
Niall Richard Murphy, Google
Niall Richard Murphy is the head of Ads Reliability Engineering for Google Ireland, where his group is responsible for the infrastructure underlying ~90% of Google's annual revenue. He is the instigator, co-author and co-editor of platinum-selling "Site Reliability Engineering" (O' Reilly, 2016), a history of the Irish Internet, and is the holder of degrees in Computer Science and Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.
Hiring SREs May Be Literally Impossible
Chris Sinjakli, SRE at GoCardless
If we're gonna do this SRE thing, we need to find the right people to do it.
After a few recent discussions, it became clear just how much everyone—at large companies and small—is struggling to find those people.
You can barely get enough applicants in the door, and by the time you've run your interview process you're left making a handful of offers.
Hiring SREs from the outside world is a competitive, expensive game to play. So why focus so much on people outside your company? You've got potential SREs sat all around you!
In this talk, we'll set the scene with a little look at the realities of hiring SREs. We won't stay there for too long though, because that's not what's going to save us!
The bulk of the talk will be spent looking at ways to discover budding SREs in your organisation, how to nurture their interest, and how to coach them in a role that's new to them.
Chris Sinjakli, SRE at GoCardless
Chris enjoys all the weird bits of computing that fall between building software users love and running distributed systems reliably. All his programs are made from organic, hand-picked, artisanal keypresses.
Pembroke Room
Gamifying Reliability Excellence—The Service Score Card
Daniel Lawrence, LinkedIn
What makes a “good” service is a moving target. Technologies and requirements change over time. It can be impossible to ensure that none of your services have been left behind. The Service ScoreCard approach is to have a small check for each service initiative we have, this could be anything measurable; deployment frequency, the oncall team all have phone; ensuring the latest version of the JVM. The Service ScoreCard, gives each service a grade from 'F' to 'A+', based on passing or failing the list of checks. As soon as anyone see the service grade’s slipping everyone rallies to improve the grades. We can then set up rules based on the grades, “Only B and above services can deploy 24 / 7”, “moratorium on services without an A+” or “No SRE support until the services below C grade”.
Daniel Lawrence, LinkedIn
Daniel will fix anything with python, even if it's not broken. He is an Aussie on loan to LinkedIn in the USA as an SRE, focusing on looking after the jobs and recruiting services. When he is not working on tricky problems for LinkedIn, he plays _a lot_ of video games.
Incident Management and Chatops at Shopify
Daniella Niyonkuru, Shopify
SREs are expected to be incident management experts. Yet, incident handling is hard, often messy, and exhausting. We encounter new incidents, look up everywhere for possible explanations, sometimes tunnel on symptoms, and, under pressure, forget some good practices.
At Shopify, we care not only about handling incidents quickly and efficiently, but also SRE well-being. We have a special IMOC (Incident Manager On Call) rotation and an incident chatbot to assist IMOCs. In this talk, I’ll first explain the IMOC role and how training SREs for this duty is essential to handling incidents well.
Our chatbot assists the IMOC by reducing manual effort and context switching. We integrated the bot with our conversation tool and several third-party tools (PagerDuty, StatusPage, Github) to send timely reminders. It also binds the incident to a discussion channel where all communications happen, allows status page updates directly from the chat room, keeps notes and records event times, and generates service disruption content. To avoid burnout for long-running incidents, the chatbot also reaches out to other IMOCs.
Our chatbot supports best practices and "streamlines" incident response. Attendees will leave with strategies for incorporating chatbots into their incident management and considerations for automating precisely and smartly.
Daniella Niyonkuru, Shopify
Daniella Niyonkuru is a Production Engineer at Shopify where she helps build a better, faster and more resilient platform. Previously, Daniella worked as an Aircraft System Software Specialist, and researched Formal Model Driven Development for Embedded Systems.
Meeting Room 1+2
Fast and Safe Production Monitoring of JVM Applications with BPF Magic
Sasha Goldshtein, CTO, Sela Group
All of us have seen these evasive performance issues or production bugs in the field, which standard monitoring tools don't see or catch. BPF is a Linux kernel technology that enables fast, safe, dynamic tracing of a running system without any preparation or instrumentation in advance. The JVM itself has a myriad of insertion points for tracing garbage collections, object allocations, JNI calls, and even method calls with extended probes. When the JVM tracepoints don't cut it, the Linux kernel and libraries allow tracing system calls, network packets, scheduler events, off-CPU time, time blocked on disk accesses, and even database queries. In this talk, we will see a holistic set of BPF-based tools for monitoring JVM applications on Linux, and revisit a systems performance checklist that includes classics like fileslower, opensnoop, and strace—all based on the non-invasive, fast, and safe BPF technology.
Sasha Goldshtein, CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft MVP, Pluralsight author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor to projects focused on system diagnostics, performance monitoring, and tracing—across multiple operating systems and runtimes. Sasha authored and delivered training courses on Linux performance optimization, event tracing, production debugging, mobile application development, and modern C++. Between his consulting engagements, Sasha speaks at international conferences world-wide.
Meeting Room 8
(Continued from previous session)
Distributed Systems Reasoning
John Looney, Intercom, and Theo Schlossnagle, Circonus
All distributed systems make tradeoffs and compromises. Different designs behave very differently with respect to cost, performance, and how they behave under failure conditions.
It's important to understand the tradeoffs that the building blocks in your systems make, and the implications this has for your system as a whole. In this workshop we'll look at several examples of different real-world distributed systems and discuss their strengths and shortcomings.
This workshop will include some practical elements. You will be given some system designs to read and to evaluate, and then we'll discuss the implications of each design together as a group.
John Looney, Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know the best use of their time and energy, but still hasn't worked out how to not burn himself out occasionally.
Theo Schlossnagle, Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed systems at JHU. Theo is the author of Scalable Internet Architectures (Sams) and a frequent speaker at IT conferences worldwide. He is a member of the IEEE and a senior member of the ACM and serves on the editorial board of the ACM’s Queue magazine. He holds undergraduate and graduate degrees in computer science from Johns Hopkins University.
Meeting Room 9
Linux System Metrics
Nati Cohen, Here Technologies, and Avishai Ish-Shalom, Wix.com
While you can learn a lot by emitting metrics from your application, some insights can only be gained by looking at OS metrics. Yet OS metrics, despite being commonly used are also frequently misunderstood.
In this hands-on workshop, we will go over commonly used CPU, memory, disk and network metrics, and make sure we understand each of them. We will experiment with old & new tools to acquire and analyse metrics, evaluate a black-box workloads by looking at metrics, and assess the effect of extreme metric values on simple applications. During our adventure, we will learn about Linux internals, it’s underlying optimizations and unexpected limitations.
To participate, just make sure to bring a laptop, and have a chromium-based browser installed.
View the workshop's materials and exercises here.
Nati Cohen, Here Technologies
Production Engineer at Here Technologies, and a Teaching Assistant at the Interdisciplinary Center Herzliya. Previous experience includes: operations consulting, software development, *nix administration and security research in the Intelligence Corps as well as in multiple startup companies.
Avishai Ish-Shalom , Wix.com
Avishai is a veteran operations and software engineer with years of production experience. Currently masquerading as an engineering manager, Avishai is leading a team of software engineers at Wix.com core services group. In his spare time, Avishai is spreading weird ideas and conspiracy theories like DevOps and Operations Engineering.
17:00–18:00
Lightning Talks
Lightning Talks Video
- 6 Ways a Culture of Communication Strengthens Your Team’s Resiliency
Jaime Woo, Shopify - Dynamic Documentation in 5 minutes
Daniel Lawrence, LinkedIn - Resource management and isolation, the non-shiny way
Luiz Viana, Demonware - Collecting metrics with Snap - the open telemetry framework
Guy Fighel, SignifAI - Decentralized Data
Jason Koppe, Indeed - Failovers
Emil Stolarsky, Shopify
18:00–19:30
Conference Reception
Sponsored by Circonus
Herbert Room
19:30–21:30
Birds-of-a-Feather Sessions (BoFs)
BoFs, or Birds-of-a-Feather sessions, are an opportunity for informal and ad hoc discussion of a topic of shared interest between a group of conference attendees. SREcon Europe will have BoF sessions on the Wednesday and Thursday evenings. While some topics will be pre-arranged, there are slots for additional BoF sessions and at the event we encourage attendees to suggest additional BoF topics that are of interest and will provide facilities for groups to meet and discuss these.
Go to the BoFs page for information on scheduling BoFs.
1 September 2017
08:00–09:00
Morning Coffee and Tea
Prefunction
09:00–10:30
Lansdowne Room
Why Work with a Tech Writer?
Betsy Beyer, Google
The sparsely-attended SREcon17 Americas Tech Writing talk focused on HOW to work with Tech Writers. I'm instead focusing on WHY you should work with a TW--because they make your life easier, and can make the work you're already doing have more impact.
Reasons to engage a TW:
- If you need to explain your product/service/etc. to users: Chances are, the most satisfying, enjoyable, and rewarding part of your job is engineering work—creating a tool, fixing a problem, redesigning infrastructure, etc. It *isn't* answering the same questions from users over and over. → Get solid documentation in place to free up engineer time.
- If your team internal documentation is a mess: It might be hard to find the docs you need when you get paged, hard to identify current content, or some information might just be flat-out missing. → A TW can help you whip a documentation rat's nest into shape, and give you the tools to maintain your docs easily moving forward.
- If you want to make your work more visible (so other people can leverage it and learn from it): → A TW can help you get that information out there!
Betsy (Adrienne) Beyer, Google
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees from Stanford and Tulane.
Postmortem Action Items: Plan the Work and Work the Plan
John Lunney, Google SRE
We discuss best practices and challenges for developing high-quality action items (AIs) for a postmortem, plus methods of ensuring these AIs actually get implemented.
John Lunney, Google SRE
John Lunney is a Senior Site Reliability Engineer at Google Zürich. His team manages G Suite, productivity apps for Enterprise customers. He holds a degree in Computational Linguistics from Trinity College in Dublin, Ireland. Before Google, he worked on several lexicography projects for the Irish language.
Pembroke Room
Building an On-Premise Kubernetes Cluster For a Large Web Application
Daniel Turner, Shopify
Recently, Shopify began migrating from our custom container management system to Kubernetes. This switch will makes us more efficient at running our large Rails monolith, as well as the current and future microservices that run alongside. The first step in migrating was building a cluster using our own hardware. Running Kubernetes on-premise requires building services that cloud providers hide from their customers: Etcd, high-availability master nodes, scalable networking, Ingress, and persistent storage. We believe that understanding the challenges and tradeoffs in providing these services is beneficial to not only those who run their own cluster, but also to those who use cloud providers.
Beyond building the cluster, we also had to modify our core application and tooling to fit Kubernetes’ container-centric framework. We expect that most applications currently on homegrown deployment systems will have to similarly overcome host-based assumptions. In our case: unbounded jobs, hard coded assumptions about hosts, and services exposed to external monitoring tools via global DNS.
Attendees will leave this talk equipped to decide if running their own Kubernetes cluster is right for them and how to make the shift as successful as possible.
Daniel Turner, Shopify
Daniel Turner is a Production Engineer at Shopify. He is part of the team building our Kubernetes clusters as well as maintaining Shopify’s data centers.
Distributed Systems, Like It or Not
Theo Schlossnagle, Circonus
Over the last twenty years, complex distributed systems have been deployed to solve the leading challenges in the systems resiliency and robustness realm. At this point in systems architecture design, distributed systems are everywhere in everything; even the most simple architectures incorporate distributed software and carry with that the failure scenarios they bring.
SREs are put in an even more complicated situation, because of their wide net or responsibilities, to manage distributed systems of distributed systems. Things can and will go wrong and one of the fundamental skills for SREs going forward will be strong distributed systems reasoning skills.
In this talk we discuss the types of failure scenarios that distributed systems bring with them (with anecdotes) and develop various reasoning skills that can be used to tackle these challenges with increased confidence.
Theo Schlossnagle, Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed systems at JHU. Theo is the author of Scalable Internet Architectures (Sams) and a frequent speaker at IT conferences worldwide. He is a member of the IEEE and a senior member of the ACM and serves on the editorial board of the ACM’s Queue magazine. He holds undergraduate and graduate degrees in computer science from Johns Hopkins University.
Avoiding and Breaking Out of Capacity Prison
Jake Welch, Microsoft
Capacity management at any scale has many moving pieces and requires a range of activities from capacity forecasting to emergency response. Capacity issues can directly impact your service scalability, performance and availability. Lead time to acquire new capacity can make a capacity management plan as important as your service monitoring. Being prepared can help ensure a great customer experience even during difficult times.
In this talk, we will present a comprehensive set of activities necessary to execute a capacity management plan for a storage service of any size. We will discuss learnings from Microsoft Azure Storage - one of the largest and fastest growing storage systems on the planet and how SREs used code to proactively scale and remove complex manual effort and toil through automation. The work here has resulted in an improved customer experience, better work/life balance and reduced cost.
Jake Welch, Microsoft
Jake Welch is a Site Reliability Engineer on the Microsoft Azure team in NYC. He has worked on large scale services for a decade, in both dev and operational roles. At Microsoft, he primarily works on infrastructure services with focus on Storage and Security.
Meeting Room 1+2
The EU's New Data Protection Law - a Survival Guide
Simon McGarr, Data Compliance Europe, and John Looney, Intercom
What data do you hold?
Are you processing the data, or controlling it?
Do you have the consents to use that data like that?
Do you have a register of all that data and every way you use it, and what for?
Can you find every piece of data you hold that relates to an individual, copy it and send it to them—for free—within 30 days?
What happens when they say they want it erased?
The General Data Protection Directive comes into force on the 25th May 2018. New powers mean regulators can impose fines for breaches up to 4% of annual turnover. This workshop is for anyone trying to make sure that their organisation isn't in breach by the implementation date.
GDPR isn't just a compliance project. It's a business culture change project. Let's struggle our way through together.
Simon McGarr, Data Compliance Europe
Simon McGarr is recognised as one of Ireland’s leading experts in Data Protection. A practising solicitor, he has lectured in the Law Society and regularly appears on national TV and radio and in the press discussing data issues. He has been involved in most of the landmark cases developing Data Protection law in the EU and focuses much of his work on helping organisations to understand their data protection law needs.
John Looney, Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know the best use of their time and energy, but still hasn't worked out how to not burn himself out occasionally.
Meeting Room 9
Statistics for Engineers
Heinrich Hartmann, Circonus
Statistics is the art of extracting information from data. In this workshop, we will visit the statistical methods that are relevant for operating modern IT infrastructures. Containerized cloud architectures are incredibly difficult monitoring targets. Creating probabilistic models of the behaviors of these systems, that can be used for reliable predictions is a very difficult task. In fact, it's so difficult that I don't think anyone has done that, yet. We will certainly not try to here.
Instead, we will take a different path in this workshop, and talk about statistical methods that are known to work and provided value for your daily job as a SRE. In this workshop you will learn:
- How to measure the quality of APIs you provide and consume.
- How to interpret the telemetry data that is emitted from the systems you are running.
- How to aggregate metrics from single nodes to service-level views.
Topics we will cover in depths include: data visualisation, averages, percentiles, histograms, regressions, robustness and mergeability. We will cover the material from a theoretical and a practical perspective. Bring pen and paper as well as your laptop!
Course material can be found here.
Heinrich Hartmann, Circonus
Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician (PhD in Bonn, Oxford). Later he transitioned into computer science and worked as consultant for a number of different companies and research institutions.
10:30–11:00
Break with Refreshments
Prefunction
11:00–12:00
Lansdowne Room
Service with an Angry Smile: Passive-Aggressive Behavior in SRE
Lauri Apple, Zalando
Awareness and discussion of psychological safety as a key ingredient for productive and successful teams has grown recently, thanks to media coverage and pioneering research by companies and scholars. While flagrant forms of disrespect like angry shouting and insults obviously threaten psychological safety in teams, so too can passive-aggressive behaviors such as complaining, pouty silence, “forgetting” to complete tasks, and stubbornness.
In the SRE context, passive-aggressive behaviors can have disastrous consequences. These include outages and incidents that could have been avoided with better preparation or notification; narrowly focused quick-fixes instead of systemic, long-term maintenance efforts; blame instead of solutions-oriented post-mortems, and refusal to share knowledge. In many cases, few or no words are spoken; silent resistance is the hostile act.
This talk will bring attention to passive-aggressive behavior as a set of hostile acts that one can (and should) identify, manage and overcome in the tech/SRE environment. For context, it will draw upon psychological research, history, a bit of pop culture for fun, and anecdotes from SREs. And for guidance, you’ll hear some tried-and-true agile and communications methods for managing or eliminating passive-aggressive behavior in your teams and interactions.
Lauri Apple, Zalando
Based in Berlin, Lauri Apple develops and evangelizes Zalando’s open source efforts. She's also a producer/agile project manager for the company's core search engineering team and co-leads Zalando’s InnerSource initiative.
The Cult(Ure) of Strength
Emily Gorcenski, Simple
"Strength," "Courage," and "Bravery" are virtues often heaped upon individuals undergoing hardship. These compliments come from a deep-rooted cultural value that sacrifice should be praiseworthy and that performing in the face of difficulty is a sign of virtue. In tech, strength is valued to the point of caricature, creating a culture of depersonalization and overwork that disproportionately affects people who by their identities or job descriptions are asked too often to "take one for the team."
Through the lens of my 15+ year journey through the STEM pipeline, I'll talk about the culture of strength and how we can better set expectations to manage hardship and workload in the workplace or community.
Emily Gorcenski, Simple
A data scientist and technologist with a background in aeronautical engineering, plasma physics, and biotechnology, Emily likes exploring the intersection of society and technology and is driven towards building good technological citizenship.
Pembroke Room
Run Less Software; Use Less Bits
Rich Archbold, Intercom
At Intercom, we believe that to enable us to:
- improve availability and reduce risks,
- save time and money,
- improve operability,
- and enable us to move fast for the long term
- we should build and run Intercom using the smallest sensible set of core infrastructure components.
- we are cautious about adding new technologies to the mix
- we’d often rather consider using one of our existing/established technology components and write (and maintain) more software ourselves, rather than taken on the overhead of learning and maintaining expertise in a new / more powerful technology.
- where tools, systems or workflows are required to support building/managing Intercom, especially in areas that could be deemed “undifferentiated heavy lifting”, we’d rather not write any software or operate any systems ourselves at all, and instead use world-class 3rd-party services.
We summarise this beliefs in a infrastructure design principle we call “run less software, use less bits”.
In this talk we use some real examples to go deep on how we use this principle to make hard decisions and good, informed, deliberate trade offs.
Rich Archbold, Intercom
Richard Archbold is an Engineering Director at Intercom, a highly successful and fast growing Irish technology startup company that provides customer communication software to Internet businesses. Intercom's mission is to make web business personal. Previous to Intercom, Richard has worked as a senior technology manager at Facebook Dublin where he was part of the team responsible for keeping Facebook online. Rich also spent eight years at Amazon in Dublin, helping found their Dublin office. While at Amazon, Rich held a number of positions ranging from engineer to senior engineering manager in both Amazon.com and AWS.
Monitoring Cloudflare's Planet-Scale Edge Network
Matt Bostock, Cloudflare
Cloudflare operates a global anycast edge network serving content for 6 million web sites. This talk explains how we monitor our network, how we migrated from Nagios to Prometheus and the architecture we chose to provide maximum reliability for monitoring. We'll also discuss the impact of alert fatigue and how we reduced alert noise by analysing data, making alerts more actionable and alerting on symptoms rather than causes.
This talk will cover:
- The challenges of monitoring a high volume, anycast, edge network across 100+ locations
- The architecture we chose to maximise the reliability of our monitoring
- Why Prometheus excels as the new industry standard for modern monitoring
- Approaches reducing alert noise and alert fatigue
- Triaging alerts into a ticket system
- Analysing past alert data for continuous improvement
- The pain points we endured
- Effecting change across engineering teams
Matt Bostock, Cloudflare
Matt is a Platform Operations engineer at Cloudflare, where he has spent the last year promoting a monitoring utopia. He was previously tech lead for the GOV.UK Infrastructure team and is a keen contributor to open source software. He also loves bacon, avocado, running, and the Oxford comma.
Meeting Room 1+2
(Continued from previous session)
The EU's New Data Protection Law - a Survival Guide
Simon McGarr, Data Compliance Europe, and John Looney, Intercom
What data do you hold?
Are you processing the data, or controlling it?
Do you have the consents to use that data like that?
Do you have a register of all that data and every way you use it, and what for?
Can you find every piece of data you hold that relates to an individual, copy it and send it to them—for free—within 30 days?
What happens when they say they want it erased?
The General Data Protection Directive comes into force on the 25th May 2018. New powers mean regulators can impose fines for breaches up to 4% of annual turnover. This workshop is for anyone trying to make sure that their organisation isn't in breach by the implementation date.
GDPR isn't just a compliance project. It's a business culture change project. Let's struggle our way through together.
Simon McGarr, Data Compliance Europe
Simon McGarr is recognised as one of Ireland’s leading experts in Data Protection. A practising solicitor, he has lectured in the Law Society and regularly appears on national TV and radio and in the press discussing data issues. He has been involved in most of the landmark cases developing Data Protection law in the EU and focuses much of his work on helping organisations to understand their data protection law needs.
John Looney, Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know the best use of their time and energy, but still hasn't worked out how to not burn himself out occasionally.
Meeting Room 9
(Continued from previous session)
Statistics for Engineers
Heinrich Hartmann, Circonus
Statistics is the art of extracting information from data. In this workshop, we will visit the statistical methods that are relevant for operating modern IT infrastructures. Containerized cloud architectures are incredibly difficult monitoring targets. Creating probabilistic models of the behaviors of these systems, that can be used for reliable predictions is a very difficult task. In fact, it's so difficult that I don't think anyone has done that, yet. We will certainly not try to here.
Instead, we will take a different path in this workshop, and talk about statistical methods that are known to work and provided value for your daily job as a SRE. In this workshop you will learn:
- How to measure the quality of APIs you provide and consume.
- How to interpret the telemetry data that is emitted from the systems you are running.
- How to aggregate metrics from single nodes to service-level views.
Topics we will cover in depths include: data visualisation, averages, percentiles, histograms, regressions, robustness and mergeability. We will cover the material from a theoretical and a practical perspective. Bring pen and paper as well as your laptop!
Course material can be found here.
Heinrich Hartmann, Circonus
Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician (PhD in Bonn, Oxford). Later he transitioned into computer science and worked as consultant for a number of different companies and research institutions.
12:00–13:00
Conference Luncheon
Sussex Restaurant and Herbert Room
13:00–14:00
Lansdowne Room
Monitoring Design Principles
Theo Schlossnagle, Circonus
In this presentation we'll re-examine monitoring to understand how to formulate valuable goals and align monitoring design and implementation with those goals. With a focus on outcomes and behavior that leads to outcomes we'll focus on performance data and not security monitoring.
Attendees will learn to ask the right questions when approaching the monitoring of systems and businesses. They will understand why and how monitoring should fit into the overall systems architecture to reduce risk and increase value.
Theo Schlossnagle, Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed systems at JHU. Theo is the author of Scalable Internet Architectures (Sams) and a frequent speaker at IT conferences worldwide. He is a member of the IEEE and a senior member of the ACM and serves on the editorial board of the ACM’s Queue magazine. He holds undergraduate and graduate degrees in computer science from Johns Hopkins University.
Pembroke Room
And the CFO Wept: AWS Cost Control
Corey Quinn, The Quinn Advisory Group
"I'll just spin up an instance to test something this afternoon" you say with the best of intentions. Unbeknownst to you, you'll retire before that instance does. In this hilarious talk, Corey delves into the details of how the AWS bill goes from a few cents an hour into something suspiciously reminiscent of a phone number.
From low hanging fruit to weird Amazon billing gotchas, this talk serves as a survey of all of the sharp edges around Amazon's least understood product—its ridiculous monthly bill.
Corey Quinn, The Quinn Advisory Group
Corey is a Cloud Economist at quinnadvisory.com, which helps companies large and small with their horrifying AWS bills. He also runs lastweekinaws.com, a snarky weekly newsletter on the happenings in Amazon's cloud ecosystem.
Meeting Room 1+2
CRE: Expanding SRE to inside Your Customer's Organisation
Stephen Thorne, Google
The Cloud is a scary place. You're trusting your entire business to a platform out of your control. This applies to any platform, any SaaS, PaaS, IaaS provider. If they have an outage it's out of your hands!
Introducing CRE: Expanding SRE to inside your customer's organisation. Making a reliable system for your own business is one thing, but doing it when you are providing a platform to another company is a new and exciting area.
Learn how Google is exploring Customer Reliability Engineering. Sharing everything from design decisions to monitoring.
Stephen Thorne, Google
Meeting Room 8
Panel: AMA for New SREs
Moderator: Murali Suriar, Google
Panelists: Gráinne Sheerin, Google; John Looney, Intercom; Chris Sinjakli, GoCardless; Ola Klapcinska, Google
If you're new to SRE, or considering becoming an SRE, and you have questions, come to this session. You'll get the opportunity to ask a variety of experienced SREs for their opinion on topics related to SRE teams and culture, hiring, oncall, troubleshooting, performance, release management, and more.
Murali Suriar is lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running software defined network control systems. Left Google to get on a boat. Got bored and came back.
Gráinne Sheerin is a Site Reliability Engineer for Google Ireland. She's a tech lead responsible for Ad Serving infrastructure and has 5 years of experience in production engineering. She a physicist, earning a doctorate in Nanoscience from Dublin City University. Prior to Google, she masqueraded as a strategic relationship manager for Reuters and a network engineer for HEAnet.
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know the best use of their time and energy, but still hasn't worked out how to not burn himself out occasionally.
Chris Sinjakli enjoys all the weird bits of computing that fall between building software users love and running distributed systems reliably. All his programs are made from organic, hand-picked, artisanal keypresses.
Ola has been a Site Reliability Engineer at Google London for three years. She has been SREing at Ads and Cloud fronted teams, and most recently focusing on Monitoring. When not at work, she roams around Europe and occasionally other continents.
Meeting Room 9
Being an Effective Ally to LGBTQ+, Non-Binary, Women, and Poc in the Tech Industry
Chris Stankaitis, The Pythian Group
There are a lot of white, male, heterosexuals in the Tech Industry, this demographic makes up the majority by a large margin. This is also the demographic which holds the most privilege.
Homogenization in our industry is bad. We need the creative ideas and mindset that diversity brings to allow us to innovate and build amazing things.
Those with privilege and power need to understand it, and learn how to use it be become an ally to those people who are in marginalized parts of our industry to help create safe space and welcoming spaces where people can feel that they can express themselves and be celebrated for their differences.
This talk will unpack privilege and discuss how you can be an effective ally to those who have no voice.
Chris Stankaitis, The Pythian Group
Chris Stankaitis has worn many hats, a veteran SysAdmin, SRE, Speaker and People Manager. Chris builds, runs, and maintains complex systems at scale using technology and quite a bit of duct tape to keep many of the services you know and love up and running. Chris has been an active member and educator in their local LGBTQ+ community for several years working with both adults and youth to help bring about a better understanding of the challenges facing LGBTQ+ people in today's world and society.
14:00–14:40
Break with Refreshments
Prefunction
14:40–17:00
Pembroke and Lansdowne Rooms
Have You Tried Turning It off and Turning It on Again?
Tanya Reilly, Google
Most of us have a backup strategy, many of us have a restore strategy and several of us have a fully tested restore strategy. But backups are not the whole story. I'll talk about the parts of disaster recovery we're less prepared for, and dependencies that you might not think about until one day when you really do turn an entire service, entire site or (perish the thought!) an entire company off and on again.
This talk will cover managing complexity, testing your fallback plan and avoiding dependency cycles that make it impossible to restart groups of systems. Like, where do you store the documentation on how to recover the documentation server?
Tanya Reilly, Google
Tanya Reilly has been a Site Reliability Engineer at Google since 2005, working on low level infrastructure like distributed locking, load balancing, and bootstrapping. Before Google, she worked as a Systems Administrator at eircom.net, Ireland's largest ISP, and before that she was the entire IT Department for a small software house.
100 Teams, 100 Ways to Fail
John Keiser, Microsoft Azure, and Ben Broderick Phillips, Microsoft
Every SRE organization hits the same problems at some point: How do we convince teams to let us help, and own the work and results together? As an SRE, you will encounter different kinds of resistance from the teams you work with.
Azure has 100+ teams, and Azure SRE has gained experience with every type of engagement on the map. If any of these scenarios sound familiar to you:
- Engineers that do not understand your utopian visions (nobody understands me)!
- Everyone is rational, no one is right
- “This too shall pass” i.e. the team that knows you will eventually sort out the rough edges in your tooling or find another job, and can safely ignore you until then
Come join Azure SRE as we share stories about teams we’ve worked with, the resistance we’ve run into, and sometimes even how we fixed it.
John Keiser, Microsoft Azure
John Keiser is a Mad Scientist of the internet age, having developed, tested and led teams for the last 20 years at places like Netscape, Bing, and Chef. Microsoft Azure now lets him play with their service, having been convinced he wouldn’t rewrite anything too critical.
If his secret lab exploded, John would leave behind a wife, three young kids, a large collection of video games, and a unicycle team.
Ben Broderick Phillips, Microsoft
Persistent SRE Antipatterns: Pitfalls On the Road to Creating a Successful SRE Program Like Netflix and Google
Jonah Horowitz, Stripe, and Blake Bisset
What isn’t Site Reliability Engineering? Does your NOC escalate outages to your DevOops Engineer, who in turn calls your Packaging and Deployment Team? Did your Chef just sprinkle some Salt on your Ansible Red Hat and call it SRE? Lots of companies claim to have SRE teams, but some don’t quite understand the full value proposition, or what shiny technologies and organizational structures will negatively impact your operations, rather than empowering your team to accomplish your mission.
You’ll hear stories about anti-patterns in Monitoring, Incident Response, Configuration Management, and more that we’ve tripped over in our own teams, seen actually proposed as good practice in talks at other conferences, and heard as we speak to peers scattered around the industry. We'll also discuss how Google and Netflix each view the role of the SRE, and how it differs from the traditional Systems Administrator role. The talk also explains why freedom and responsibility are key, trust is required, and when chaos is your friend.
Jonah Horowitz, Stripe
Jonah Horowitz is a Site Reliability Engineer with Stripe. He works with all of the individual engineering teams at Stripe to drive reliability efforts. This includes monitoring, alerting, deployment pipelines and chaos resiliency. Before coming to Stripe he worked at several startups around the Bay Area including: Netflix, Quantcast - a leading ad-tech startup where he grew their network to process over 3 million events per second, Looksmart - a contextual advertising company, and he was on the founding team of Wal-Mart.com (now Walmart Labs) where he built out their software deployment pipelines and their product image management systems.
Blake Bisset
17:10–17:20
Closing Remarks
Program Co-Chairs: Avishai Ish-Shalom, Wix, and Laura Nolan, Google