SREcon19 Asia/Pacific Program Grid
View the program in mobile-friendly grid format.
Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)
Wednesday, June 12
8:00 am–9:00 am
Morning Coffee and Tea
Level 3 Foyer 5
9:00 am–9:10 am
Opening Remarks
Summit Room 2
Program Co-Chairs: Frances Johnson, Google, and Avleen Vig, Facebook
9:10 am–10:35 am
Summit Room 2
A Tale of Two Postmortems: A Human Factors View
Tanner Lund, Microsoft
Many companies become frustrated with their postmortem and incident review process, feeling that it is a burden, or that it does not provide meaningful insights, or that the repairs and learnings generated do not help prevent repeats or other incidents. Fortunately, there is a better way to do things, backed by decades of scientific rigor and proven in industries where outages can mean a lot worse than lost revenue.
Join our fictional company, "Potato Systems‚" as they deal with the aftermath of a catastrophic incident. As they struggle to learn from it and move forward, they—and we—will come to understand the stark contrast in outcomes and effectiveness of Safety I vs Safety II thinking.
Tanner Lund, Microsoft
Tanner Lund has been a part of Azure's SRE organization from the beginning. He has worked in a variety of roles, including crisis management, developing SREBot, building data pipelines, and leading services through SRE/DevOps transitions. Throughout it all his focus has been on understanding complex systems and how we achieve our goals through them, seeking to unlock their secrets.
Availability—Thinking beyond 9s
Kumar Srinivasamurthy, Bing, Microsoft Corp
It's very easy and convenient to build metrics at the service level. These often hide a wide array of issues that users might face. Having the right metrics is a key component of building sustainable SRE culture.
In this talk, you will learn:
- How do you measure Availability for your product, not just a service?
- How to think beyond just 9's
- What are the common pitfalls for a beginner engineer?
- Mistakes in metric calculations
- Some examples of issues faced by our product and lessons learnt
Kumar Srinivasamurthy, Bing, Microsoft Corp
Kumar works at Microsoft and is currently a Group Engineering Manager for the Bing Team. For the last several years, he has focused on building reliable high scale systems, availability, performance, capacity engineering, online safety, data mining, metrics, and educating teams on how to build services that run at scale.
Use Interview Skills of Accident Investigators to Learn More about Incidents
Thai Wood, Resilience Roundup
Learning from others' experience is a critical skill, especially after an incident or outage. Typically we do this by asking questions. But it turns out that how we ask questions, and how we interact with the person we're asking questions of matters, a lot.
Often times attempts at knowledge elicitation in this stage can be ineffective. There are techniques, backed by decades of research, currently in use by accident investigators in the US when investigating aviation, highway, or marine accidents.
I'll go over some core techniques that can be implemented right away. These techniques are grounded in research and are not adversarial, but can make a big impact on the quality of information you receive when trying to learn from others.
Thai Wood, Resilience Roundup
Thai helps teams build more resilient systems and improve their ability to effectively respond to incidents. A former EMT, he applies his experience managing emergency situations to the software industry. He writes about resilience engineering each week at ResilienceRoundup.com
10:35 am–11:00 am
Break with Refreshments
Level 3 Foyer 5
11:00 am–12:30 pm
Summit Room 2
Leading without Managing: Becoming an SRE Technical Leader
Todd Palino, LinkedIn
Increasingly, technical organizations are developing career paths to build and recognize leaders outside of the traditional management roles. But what should an SRE who wants to be a leader be focusing on? Through the eyes of an engineer who reinvented his career in one of the largest SRE organizations, we will examine what technical leadership looks like, and how an individual can help guide the strategic path of a team, department, or company without taking on the role of a people manager. You'll pick up tactical work that you can start immediately to set yourself up for success, and some pointers to be able to identify the opportunities when they show up.
Todd Palino, LinkedIn
Todd Palino is a Senior Staff Engineer in Site Reliability at LinkedIn on the Capacity Engineering team, where his team is creating a framework for application capacity measurement, analysis, and change intelligence. Prior to that, he was responsible for architecture, day-to-day operations, and tools development for one of the largest Apache Kafka deployments. In his spare time, Todd is the developer of the open source project Burrow, a Kafka consumer monitoring tool, and is the co-author of Kafka: The Definitive Guide, now available from O’Reilly Media.
Out of the office, you can find Todd at conferences like SREcon and LISA, sharing his experience from years in SRE technical leadership, and at Kafka Summit or ApacheCon talking about how to feed and water Kafka infrastructures. Or maybe out on the trails, training for the next marathon.
The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation
Stig Sorensen, Bloomberg LP
Changing culture is hard and cannot be done overnight. This is the story of our "not so straight" path to building SRE teams at Bloomberg. In this presentation, I will share some practical tips to help others avoid some of the initial challenges we had, to smooth your path to implementing SRE teams, and ultimately improving your organisation.
Stig Sorensen, Bloomberg LP
I’m a Bloomberg veteran who started his career as a financial software developer by building trading systems in the early 2000s. Looking back, the team I started on then would be called an SRE team today, so I can say that built a solid foundation. After 12 years of building software and software teams in the Trading Systems department, I moved to manage a new Production Visibility group responsible for Telemetry and inventory management tools, as well as serving as the executive sponsor for the SRE movement across our Engineering department.
Room 331–332
Anomaly Detection on Golden Signals
Yu Chen, Baidu
Anomaly detection on golden signals, including latency, traffic, errors, and saturation, can detect system failures and provide important clues for failure diagnosis. In this talk, we will introduce our algorithm toolbox for anomaly detection on the golden signals.
The toolbox leverages historic data from the signals to build appropriate probability models. The alerts are hence generated based on the probability calculated from the observation and the probability model. The probability directly relates to the false positive rate of classification and is able to represent the SRE engineers' feeling. Furthermore, the probability values are comparable across different signals. So, it becomes a good feature for failure diagnosis. From our production system, the alerting precision ranges from 70% to 90%, and the recall is around 90%.
Yu Chen, Baidu
Yu Chen is a Data Architect at the IOP group of Baidu’s SRE department. His work focuses on developing algorithms for alerting and diagnosis, in order to improve the stability of production systems. Previously, he worked at Microsoft Research Asia. His research interests are distributed systems, consensus protocols, search ranking, and query recommendation.
Practical Instrumentation for Observability
Gabe Krabbe, Google
This 20-minute talk intends to fill in some of the gap between "you need good SLIs" and "the code increments a counter": what exactly should be gathered, for which purpose? There will be concrete examples for good data to gather and export, so that Prometheus, Nagios, Opencensus and their friends and relatives provide useful information instead of distracting noise and misleading lies.
Gabe Krabbe, Google
Gabe Krabbe has been a Site Reliability Engineer at Google for over 14 years. He has worked on, and sometimes against, multiple generations of the Ads management and serving infrastructure. Before joining Google, he worked for various companies as a system administrator. Gabe frequently tells his servers and his children that he doesn't care who started it because it takes two to fight.
Room 334–336
Building Blocks of Distributed Systems: Parts 1 & 2
John Looney, Facebook
This is an interactive class where the important concepts around orchestration and load balancing of distributed systems are discussed, through the lens of designing a large data processing pipeline. The class ends with a collaborative design review of a theoretical pipeline system.
- Part 1, Orchestration and Load Balancing
- Part 2, Databases and Storage Services
John Looney, Facebook
John Looney has been an SRE since 2005, working with large distributed systems for Google and Facebook. He enjoys teaching SRE concepts with concrete examples. His day job is supporting teams that manage and deploy operating systems and firmware for Facebook.
12:30 pm–2:00 pm
Luncheon
Nicoll Room
Sponsored by Baidu
2:00 pm–3:30 pm
Summit Room 2
Retrospectives for Humans (a Crash Course)
Courtney Eckhardt, Heroku, a Salesforce Company
Seattle has two of the longest floating bridges in the world, and in 1990, one of them sank while it was being repurposed. This accident was a classic complex systems failure with a massive PR problem and great documentation. That combination is an excellent frame for talking about incident retrospectives—the good, the bad, the vaguely confusing and unsatisfying. Come for the interesting disaster story, stay to learn about the language of blame and how to ask warm, thoughtful engineering questions.
Courtney Eckhardt, Heroku, a Salesforce Company
Courtney Eckhardt first got into retrospectives when she signed up for comp.risks as an undergrad (and since then, not as much has changed as we’d like to think). Her perspectives on engineering process improvement are strongly informed by the work of Kathy Sierra and Don Norman (among others). You can find her knitting in the audience of conference talks, and she's always interested in cat pictures.
Using ML to Automate Dynamic Error Categorization
Antonio Davoli, Facebook
Logs analysis and information extraction in highly dynamic production environments is a complicated task. This talk will present how we designed a platform that, by leveraging unsupervised machine learning techniques, is able to dynamically categorize errors on logs generated by several micro services in our provisioning pipeline. It will focus not only on how is it important to select the most appropriate clustering algorithm, but equally on how is fundamental to invest in production services with well defined logging.
Antonio Davoli, Facebook
Antonio Davoli is a Production Engineer at Facebook, where he is leading the usage of machine learning and data analytics techniques in the provisioning space. He has worked on distributed caching systems at Amazon AWS and on data mining solutions for the maritime and telecommunications industries.
Room 331–332
What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services
Kaz Sato, Google
This session features the concept of "ML Ops" (DevOps for ML), solutions and best practices bringing ML into production service. We will learn how to combine Google Cloud Kubeflow Pipelines for building a data pipeline for continuous training and validation, version control, scalable serving, and ongoing monitoring and alerting.
Kaz Sato, Google
Kaz Sato is a Staff Developer Advocate at Google Cloud for machine learning and data analytics products such as TensorFlow, Cloud ML, and BigQuery. Kaz has been invited as a speaker at major events including Google Cloud Next, Google I/O, Strata, NVIDIA GTC, etc. He has authored many GCP blog posts supporting developer communities for Google Cloud for over eight years. He is also interested in hardware and IoT and has been hosting FPGA meetups since 2013.
The AWS Billing Machine and Optimizing Cloud Costs
Stripe uses our AWS invoice to gain observability into our cloud infrastructure spend: it takes a Redshift cluster and enough SQL queries to power 1000 homes in Portland. We run our infrastructure on AWS because their services enable rapid prototyping and the cloud’s elasticity enables immense scale. Our efficiency engineering and finance teams are tasked with analyzing and optimizing the system that emerges from this flexibility. Optimizing cloud spend requires iterating on our code, infrastructure, observability, and organizational processes.
In this talk, I will explain how we added observability to our AWS infrastructure using the Cost & Usage report and custom reporting. I will show how our custom reports led to cost optimizations through internal scoreboarding, alerting using SignalFX, and enabling teams to independently assess the cost of deploying infrastructure. I will outline the impact that reserved instances, CI autoscaling, and cross-AZ network traffic minimization had on our costs. I will show that it is feasible to reduce AWS costs up to 50% by operating a reserved instance strategy and discuss how we created incentives in our engineering organization to keep costs optimized.
Ryan Lopopolo, Stripe
Ryan founded and led Stripe's Efficiency Engineering team, which focuses on improving the efficiency and rigor of infrastructure decisions through data reporting and tooling. Ryan has worked on ETLs for cost attribution, infrastructure for capacity forecasting, and observability for costs, all to help find areas to optimize Stripe's cloud. These systems provide visibility to engineers, product, and leadership and are used to drive organizational initiatives.
Room 334–336
Let's Build a Distributed File System
Sanket Patel, LinkedIn
Let's explore something that we use and rely on every day. The file systems. Typical and distributed.
We first will look at a typical file system, the architectural components and how they all work together when you perform a read or write. We then will take those components and evolve that into a distributed file system architecture. While the architectures we'll explore will not be of a specific file system, they will be generic enough to be relatable with many file system implementations that exist today.
We also will then implement a tiny distributed file system in Python to see all those components playing together in action. Please note that this will be a very simple, minimal example, not suitable for real usage. If you are a file system hacker, this session will be too basic for you.
Sanket Patel, LinkedIn
Sanket is currently a Site Reliability Engineer with LinkedIn where he is working on Capacity Engineering initiative. He previously worked with Directi where he took care of Hadoop infrastructure, metrics, monitoring, and incident management pipelines. He is also into cycling and blogging.
Ensuring Site Reliability through Security Controls
Vijay Janakiraman and Anuradha Narayanan, PayPal
This talk will explain how security controls can ensure the high availability of a site. This session outlines how a security control like layer-7 application defense system will make the site reliable and also the various detection and mitigation capabilities against bad/malicious traffic.
Vijay Janakiraman, PayPal
Currently working as an Architect with Core Security Products Team at PayPal, focusing on architecture, design, and development of large scale, distributed products for Application Security. I have overall experience of 13 years in designing and developing various platform products and solutions, and 4 years of experience in Security domain. I received my bachelor's degree in Computer Science from National Institute of Technology (NIT) Trichy, India.
Anuradha Narayanan, PayPal
Anuradha Narayanan works as a Senior Manager with Core Security Products Team at PayPal Chennai. 17+ years of experience in the industry; 3+ in the Security domain. Currently, she manages the team and product that protects PayPal web and mobile applications from attacks/malicious traffic. Prior to PayPal, she worked for eBay, building products and solutions for billing and pricing. Also, she is a Diversity & Inclusion Champion leading the Inclusion initiatives at PayPal Chennai.
3:30 pm–4:00 pm
Break with Refreshments
Level 3 Foyer 5
4:00 pm–5:30 pm
Summit Room 2
Shipping Software with an SRE Mindset
Theo Schlossnagle, Circonus
Most SRE techniques revolve around resiliency and reliability of service delivery. Most "product" is the type of product that is deployed, not shipped. At Circonus, we deal with a lot of on-premise software shipment due to hybrid customer requirements. It turns out that many SRE techniques can apply directly to the construction, packaging, and shipment of installed software as well. In this talk, we'll learn all about it.
At Circonus we build a large, complicated, quickly-changing product. It has a lot of moving parts and all of the typical challenges diagnosing and fixing issues that arise from complex and distributed systems. To add pain to this common challenge, we also ship the whole product stack to run on-premise "behind the air-gap." It is the same product. As you can imagine, the challenges here are many. How do you allow high-velocity, risk-managed development that results in many production releases daily and have the resulting product packaged for on-premise installation at the same time? I will articulate what I believe are the techniques from SRE that make this possible.
Releasing the World's Largest Python Site Every 7 Minutes
Perry Randall, Facebook
This talk details the practical steps that we took to build Instagram's continuous deployment pipeline, highlights the problems we faced, and explains the key ideas that scaled our server release process. These steps are generic and are relevant to any development team looking for reason and/or blueprint towards a continuous deployment system. Development teams that already do some form of continuous deployment will identify with the problems and hear how we got it to work at Instagram. Others who did not already do so will hopefully be inspired to start when they learn that they can always start with the simple thing and progress step by step.
Perry Randall, Facebook
Perry is a southern California native living in the Bay Area, he started his career at Blur Studio doing tech for the entertainment industry and now works at Facebook where he has been for the last 4.5 years. He enjoys badminton and traveling and release engineering!
Room 331–332
Why Does (My) Monitoring Suck?
Todd Palino, LinkedIn
What do you do when your infrastructure systems have evolved, but the means of watching them has been stagnant? The struggle between uptime and sleep is real, and we need to make sure that monitoring is effective without drowning in a sea of non-actionable alerts. The path to success is to instrument everything, but only monitor what truly matters.
Todd Palino, LinkedIn
Todd Palino is a Senior Staff Engineer in Site Reliability at LinkedIn on the Capacity Engineering team, where his team is creating a framework for application capacity measurement, analysis, and change intelligence. Prior to that, he was responsible for architecture, day-to-day operations, and tools development for one of the largest Apache Kafka deployments. In his spare time, Todd is the developer of the open source project Burrow, a Kafka consumer monitoring tool, and is the co-author of Kafka: The Definitive Guide, now available from O’Reilly Media.
Out of the office, you can find Todd at conferences like SREcon and LISA, sharing his experience from years in SRE technical leadership, and at Kafka Summit or ApacheCon talking about how to feed and water Kafka infrastructures. Or maybe out on the trails, training for the next marathon.
NetRadar: Monitoring the Datacenter Network
Yun Chen, Baidu
The quality of a datacenter network directly affects the stability and performance of production systems. A network outage can happen on various devices and influence different scope. The SRE needs to quickly identify the scope of the outage to determine the remediation actions.
At Baidu, we built NetRadar, a datacenter monitoring system, for this purpose. NetRadar applies multi-dimensional analysis algorithm on various kinds of network quality data to identify the scope of the network outages. In this talk, we will introduce our consideration in designing the monitoring system, as well as the analysis algorithm.
Yun Chen, Baidu
Yun Chen is a Senior Software Engineer at Baidu. Yun's work focuses on datacenter network monitoring and operation data analysis, including time-series anomaly detection and service diagnosis.
Room 334–336
Operating Elasticsearch with Ease at Scale
Aishwarya Sankaravadivel and Vikram Ramakrishnan, PayPal
Search is ubiquitous! From booking a cab in a ride sharing platform to searching for a job on LinkedIn, Elasticsearch has emerged as a prominent solution for search and has been increasingly adopted for Log analytics and Application Performance monitoring.
In this talk, we would focus on how SRE Partners can leverage Elasticsearch to turn data into actionable insights by helping them to find a needle in a haystack. While Elasticsearch's out of the box defaults works for smaller implementations, the nuances of handling a massive elastic deployment which can scale to serve billions of ingests and millions of searches per day requires a deeper knowledge on Elasticsearch internals.
At PayPal, we have been proficient in managing elastic at this scale and we will share our learnings from our past crunching experiences with specific interesting details around how things can go wrong and effective ways to mitigate them along with demos.
In a gist we will have the following sections:
- Elasticsearch overview
- You are not alone! (We will discuss about the common issues that we could face with elastic)
- Making elastic truly elastic...(We will cover the following sections w.r.t to ingests, searches and monitoring)
- Handling the dark queries
- Policing the police
- Do more with less
- Elasticsearch in Action! (Demo)
Aishwarya Sankaravadivel, PayPal
A senior software engineer working for the Search Platform at PayPal having deep expertise in managing Elasticsearch for global customers. A passionate woman technologist having intense hands-on coding experience in Scala and Reactive Programming. She has grabbed several awards like CPI India Champion, Wonder Woman, Star of the month for her exemplary work at PayPal and besides work, she is an avid reader and reviewer of many books and novels.
Vikram Ramakrishnan, PayPal
A senior engineering manager at PayPal having 15+ years of experience in the software industry. A very dynamic and a result oriented leader having full-stack software engineering expertise. Hands-on technologist and a problem solver having tremendous interest in large scale distributed systems and Big Data (Hadoop/Spark/Druid/ElasticSearch).
Enhance Your Python Code beyond GIL
Nitin Bhojwani, Priya Pandian, and Arabinda Das, VMware
In this talk, we'll present the audience on what is GIL, why is it there in first place and how can we enhance our code for better performance of I/O and compute intensive tasks. Along with that, a brief introduction and usage of AsyncIO will be given for the I/O & Network intensive tasks.
Nitin Bhojwani, VMware
Nitin Bhojwani has experience of around 5 years working in Python, in which he has worked on various scalable web applications. He has 2+ years of experience working as SRE for VMware Cloud on AWS. He has been contributing to scripts execution framework and unified UI for all the SRE services. He likes Python and keeps exploring better ways of getting things done in Python.
Priya Pandian, VMware
I have 15+ years of industry experience and many years at EMC in streamlining and standardizing processes based on comprehensive knowledge of build and release tools engineering techniques. My experience has imparted continuous on the job training to the associates for enhancing their productivity and operational efficiencies through knowledge enhancement and skill building. Currently moved into the role of SRE, incident manager and also managing SRE teams for the new VMware service offering VMC on AWS. My personal passion is to automate anything and everything that demands human touch.
Arabinda Das, VMware
I have 12+ yrs of experience in the software industry, mostly in cloud application development. I have worked with various products and SaaS services which function at scale in distributed environment with global presence.
In the last 2 years I have been working with VMware as SRE engineer for VMC on AWS. We build the framework for remediation and troubleshooting service, which executes automated distributed remote actions on the software defined data centers based on VMware's infrastructure virtualization.
I like to spend time exploring new technology and frameworks for SaaS development and contribute towards them.
Thursday, June 13
8:00 am–9:00 am
Morning Coffee and Tea
Level 3 Foyer 5
9:00 am–10:30 am
Summit Room 2
Reliable by Design: Adding Value in the Design Review Process
Laura Nolan, Slack
Reviewing designs written by other engineers becomes an increasingly large (and important) part of our work life as we become more senior in our careers. We review designs for entire new systems written by partner developer teams. We review designs for pieces of automation to be developed and run by our own teams. Eventually we may find ourselves using review as a way to keep many teams in sync technically.
Most of us however, don't have a systematic way to approach reviews. We read the proposal or attend the meeting, and we look to our experience to try and predict problems. This is valuable, and experience can't be replaced—but I believe we can do better by applying both our expertise and a checklist of things to consider for each design.
Laura Nolan, Slack
Laura Nolan is an SRE who believes in the power of checklists to help us tame complexity and chaos. She is one of the contributors to the books Site Reliability Engineering and Seeking SRE, both published by O'Reilly.
How (Not) to Scale a Project: A Post-Mortem
Giacomo Bagnoli, Facebook
This talk is a multi-year retrospective about a real life project, from the initial wins as a proof of concept to the challenges and problems of scaling it up that almost jeopardized it.
A network monitoring tool with initial promising results but that scaled too fast, too soon; how it got a lot of traction with the proof of concept but failed to scale and productionize it; how expectations got dis-aligned with results, and how the customer perceived it afterwards.
This is ultimately a success story on how applying best practices that "we all know" helped rectify a potentially bad situation. We'll go through the history of this project, and how applying such practices to customer communications, software design, system design, and operations excellence got it back on track.
Giacomo Bagnoli, Facebook
Giacomo Bagnoli is a Production Engineer at Facebook in Dublin, where he works on network monitoring tools. Previously at Etsy, Amazon, and various small startups, he has been breaking and fixing systems for more than a decade.
Room 331–332
A Dashboard Is Worth a Thousand Words: Better Monitoring for Better Ops
Luca Magnoni, CERN
Not everyone is doing SRE. Consider a large scale scientific organisation with decades of experience in distributed systems and IT service operations, it may have a solid well-established ops culture and still benefit from the adoption of some of the new concepts and practises that SRE defined in the recent years. This is the story on how the creation of a new monitoring system, gathering together metrics and logs for infrastructure and services, based on a well known technologies stack (e.g. Kafka, Grafana, InfluxDB, Elasticsearch) lead not only to better service operations but also to raise awareness toward SRE practises and culture among service managers. The talk will discuss the design decisions, the operational challenges in building and scaling the system up to tens of thousands of hosts and the strategy adopted to enhance the monitoring practises, introducing concepts as SLI/SLO and the benefits derived.
Luca Magnoni, CERN
Luca is a Senior Software Engineer with more than ten years of experience in designing and operating distributed systems. He is currently a computing engineer and solution architect at the CERN IT Department working on monitoring infrastructures for the data centre and IT services.
Taming a Beast: Improving the Reliability of a Monolithic Web Service
Syed Humza Shah
Many startups find themselves wrestling with a monolith after achieving business success. CI builds take too long, deployments are a pain, and investigating root causes of incidents is a challenge. The team's trajectory from there is towards service oriented architecture for an improved state. But this is a slow transition during which the high-traffic business-critical monolith needs to stay afloat. Strategies are needed to make the monolith less painful to work on and more reliable to run. This talk will cover some of these strategies.
Syed Humza Shah[node:field-speakers-institution]
Syed Humza Shah is a software engineer with expertise in scalable architectures, distributed systems, systems reliability, and development workflows.
Room 334–336
Implementing Distributed Consensus
Dan Lüdtke, Google
May I introduce "Skinny", an education-focused, distributed lock service.
With the help of Skinny, we will:
- briefly look at the Paxos protocol
- see an example of a typical Paxos run
- design a simple distributed consensus protocol
- learn the tricky parts of implementing our simple distributed consensus protocol
- gradually move from theory-level to coding-level, solving small challenges (network, availability, fault-tolerance) along the way
This talk addresses engineers who had little exposure to the inner workings of distributed consensus, who want to learn about distributed consensus as they start building distributed systems, and who worked with ready-made distributed consensus solutions such as Zookeper and etcd but strive to understand the underlying theory as well.
Disclaimer: This work is not affiliated with any company (including Google) and purely educational!
Dan Lüdtke, Google
Dan is a Site Reliability Manager in Munich. He contributes to open source software projects, regularly helps to organize large hacker events, runs an autonomous system for fun, and dreams of space travel. Prior to Google, Dan served his country, worked as a security consultant, joined a start-up, and wrote a book about IPv6.
Dan earned a master's degree in Computer Aided Engineering from the Munich University of the German Federal Armed Forces.
Edge Computing: The Next Frontier for Distributed Systems
Martin Barry, Fastly
Over the years the prevalence of distributed systems has grown, driven by everything from database replicas through load-balanced resource pools to geographically redundant services.
Cloud computing has also changed the way we think about, and implement, distributed systems. Whether the components are at the level of infrastructure (VMs), services (datastores) or even single bits of software ("serverless") we need to consider the benefits and tradeoffs our solution embodies.
The next step in this evolution is "edge computing," where cloud services are offered as close to the client as is reasonable. This brings obvious benefits in reducing latency but also exacerbates most of the usual distributed systems problems to extreme levels.
Martin Barry, Fastly
Martin has many years of experience operating systems and networks, leading projects and teams, scaling fast-growing companies. He has worked in many corners of the industry, from hosting to fintech, enterprise SaaS to CDN.
10:30 am–11:00 am
Break with Refreshments
Level 3 Foyer 5
11:00 am–12:30 pm
Summit Room 2
Slack at the Edge
Brett Pemberton, Slack
Recently, a power outage in an AWS data centre, meant that our London region went offline without any of our connected users experiencing any downtime—something that would have been near impossible for us a year ago! In this session, I will share how we invested in automation to make horizontal scaling easy—too easy in fact, the perils of "too much" automation, how we collaborate with teams across time zones and how SRE can influence product architecture decisions for the better.
Brett Pemberton, Slack
Brett has been professionally swearing at computers for the last 18 years. He's spent the last couple of those doing so at Slack, with love. He only sometimes breaks things.
Critical Path Analysis—Prioritizing What Matters
Althaf Hameez, Grab
As your business expands, not every service requires the same amount of attention and rigour devoted to it. Grab faced the same problem as it grew rapidly and expanded into different markets beyond ride-hailing. We then came up with a way to define a Critical Path for the business which allowed everyone to be aligned on where time and rigour needs to be invested.
Althaf Hameez, Grab
Althaf is an SRE Lead at Grab, where he has been for more than four years. He originally joined as a Ruby developer before switching to form the SRE team and has never looked back! When he's not working, he's probably busy playing World of Warcraft.
Traffic Forecasting and Stress Testing Infrastructure
Sumit Sulakhe, LinkedIn
As we add more products to our offering, the customer base increases, which leads to an increase in the amount of traffic served by the various microservices in our platform. Such a surge, if not handled in a timely manner, might lead to degraded user experience. To avoid such situations we have to plan our infrastructure capacity. The challenge here is not just getting the traffic prediction right but also how we will test our infrastructure to make sure we can handle it. In this session we will discuss the importance of traffic forecasting, what are the challenges faced while forecasting, and how we can verify if our infrastructure can handle the forecasted traffic.
Sumit Sulakhe, LinkedIn
Sumit Sulakhe is an SRE at LinkedIn Bangalore currently working with the production SRE team. He has been working as an SRE for the last 3 years.
Room 331–332
Ignite Karaoke
Capacity Planning with Stress Testing
Cheng Zhao, MOGU
MOGU is a destination of fashion, which allows people to discover and share fashion trends while fully enjoying a high-quality shopping experience.
Behind this business scenario, there are many complex distributed systems, which go through nationwide shopping carnivals every year, such like the Singles' Day (double-11), double-12, Valentine's day, and so on.
In this talk, I will mainly share the process of e-commerce big promotion capacity planning, the stress testing solutions from single-machine, single-chain to full-chain, the analyzing methods of core-chain, the evaluation methods of traffic model, and the effective management combined with SLO.
Cheng Zhao, MOGU
Cheng is the Technical Lead of the Cloud-Platform Architecture Team at MOGU, responsible for the Middleware, High Availability Architecture, Tools Platform, PE, DBA, and so on.
Collective Mindfulness for Better Decisions in SRE
Kurt Andersen, LinkedIn
Studies in the fields of safety, incident response, and organizational performance have identified a common characteristic shared by the best performing groups. This characteristic is called "collective mindfulness" or "mindful organizing" and it is especially pertinent to decision making under stressful conditions.
In this talk, you'll learn what collective mindfulness is, training exercises which promote the development of collective mindfulness in teams or wider organizations. Specific suggestions relevant to the SRE practice will be provided along with exercises drawn from parallel organizations. You'll be able to take these exercises back to your day job after the conference to help your team.
Kurt Andersen, LinkedIn
Kurt Andersen has been one the co-chairs for SREcon Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon, and SANOG on various aspects of reliability, authentication, and security. He also serves on the USENIX Board of Directors and as liaison to the SREcon conferences worldwide.
Room 334–336
TCP—Architecture, Enhancements, and Tuning
Dinesh Dhakal, LinkedIn
This talk will dive deep into the workings of one of the crucial internet protocols, the Transmission Control Protocol, and further highlight the challenges around it. We will discuss the architecture and design principles that led to the development of the protocol, followed by the detailed study of its workings covering the 3-way handshake, the sliding window mechanism and how TCP achieves reliability and congestion control. Moreover, there have been various enhancements and extensions that have been proposed over the years that have led to better efficiency of the protocol in multiple scenarios. We will then discuss the possible improvements in speed and efficiency that can be achieved easily by tuning a few parameters available to us today.
Dinesh Dhakal, LinkedIn
Dinesh Dhakal is a Site Reliability Engineer at LinkedIn currently working with the Data SRE team in Bangalore. In his previous capacities, he was a Software Engineer with Juniper Networks, where he developed a feature for L2 protocols on Junos. He also worked briefly with the OSPF team at Cisco where he designed and developed the APIs to implement the IETF YANG model for OSPF. Outside of work, he is a foodie, a voracious reader, loves to travel, and can banter about anything under and beyond the sun.
BGP—The Backbone of the Internet
Michael Kehoe, LinkedIn
Without the Border Gateway Protocol (BGP), there is no modern-day internet. BGP has grown over the past 30 years to be a key protocol that connects the world. BGP has evolved significantly since its first implementation with numerous versions as well as extensions and features constantly being added. This session is going to look at the history of BGP, explain the basics of the protocol and how it is implemented, then look at some of the new features and uses of BGP outside of just connecting the internet.
Michael Kehoe, LinkedIn
Michael is a Staff SRE at LinkedIn working on Incident Response, Disaster Recovery, Visibility Engineering & Reliability Principles. He specializes in maintaining large system infrastructure as demonstrated by his work at LinkedIn (applications, automation & infrastructure) and at The University of Queensland (networks). Michael has also spent time building small satellites at NASA and writing thermal environments software at Rio Tinto.
12:30 pm–2:00 pm
Luncheon
Nicoll Room
Sponsored by Bloomberg
2:00 pm–3:30 pm
Summit Room 2
Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up
Mahak Lamba, LinkedIn
Behind our platform serving over 600 million users, there is an ever scaling infrastructure comprising of hundreds of servers running in different geographic locations and hosting a multitude of services including database services like Espresso, streaming services like Kafka, offline jobs and ML ranking services, written across various languages like Java, Python, and Go. To keep up with these ever growing member and infrastructure needs, we need to scale our monitoring systems accordingly, in order to efficiently deliver a seamless user experience. But is this possible using the existing tools and technologies that exist?
This talk will focus on the scale that we operate at, the challenges we faced while scaling our monitoring system and a 360-degree view of how we monitor our microservice architecture.
Mahak Lamba, LinkedIn
I joined LinkedIn as an SRE Intern while I was graduating in Computer Science. After completing my graduation in 2017, I joined LinkedIn as a Site Reliability Engineer in Production-SRE team, majorly responsible for building applications and tools for efficient troubleshooting, issue detection & correlation.
Cross Continent Infrastructure Scaling at Instagram
Sherry Xiao, Facebook
Deploying a service across multiple continents is difficult, especially when you have a stateful service. As a service grows to serve a more global userbase, the speed of light becomes an issue. Come and learn how did we scale our service across the ocean at Instagram, and what are the problems we faced during the deployment.
Sherry Xiao, Facebook
I'm a Production Engineer working on scaling Instagram infrastructure. My team supports all engineering teams at Instagram, and gets involved with a large number of areas like rapidly scaling infrastructure, capacity planning, designing and practicing disaster recovery plans for Instagram.
Room 331–332
Detecting Service Degradation and Failures at Scale through Distributed Log Processing
Yegya Narayanan and Veeramani Gandan, PayPal
Detecting degradation and failures in distributed systems is a significantly complex problem, especially with 2000+ services running across multiple data centers. In this session, we will cover how the distributed log processing infrastructure in PayPal scales to process over 1PB of log volume per day and generate metrics to detect degradation in performance and failures in real time.
Yegya Narayanan, PayPal India
As an architect and lead engineer, Yegya has led diverse engineering teams within PayPal. Currently, he is an engineer and a member of the monitoring platform team at PayPal. In the current role, he is responsible for scaling the logging platform that provides near real-time metrics to monitor the applications.
Veeramani Gandan, PayPal India
Veeramani Gandan is a senior manager in PayPal focusing on monitoring platform. He is responsible for building and scaling the logging platform from Gigabytes to Petabyte system. When not at work he enjoys playing table tennis and badminton. Currently committed to building a strong SRE community in South India and working towards the same.
Ad Hoc SSH Access Using Signed Tokens
Daniel Bourque, Facebook
Static SSH access permissions can severely slow down troubleshooting and testing in large organizations. Do you enjoy waiting for DBAs to be granted access to servers you're trying to fix ? Not only is this frustrating for you and your customers, it often results in teams unnecessarily granted permanent admin access to large portions of your fleet.
This talk will demonstrate how we built a secure, yet convenient temporary SSH access granting mechanism that is both ad-hoc and peer based. After a brief recap of SSH Certificate based access, I will detail how to use x509 signed tokens to grant short lived SSH certificates on demand.
Daniel Bourque, Facebook
Dan has been building distributed unix/linux systems at various scales for over 15 years. He loves automation, reliability and clean, simple designs. He currently focuses his work around security.
Room 334–336
Linux Memory Management at Scale: Under the Hood
Chris Down, Facebook
Memory management is an extraordinarily complex and widely misunderstood topic. It is also one of the most fundamental concepts to understand in order to produce coherent, stable, and performant systems, especially at scale.
In this talk, we will go over how to build compose reliable memory-heavy systems. We will go over fundamental concepts of Linux memory management which are important for site reliability with a SRE who works on the Linux memory subsystem, busting commonly held misconceptions about things like swap, and giving advice on key and bleeding-edge kernel concepts like PSI, cgroup v2, memory protection, and other important topics along the way.
Chris Down, Facebook
Chris Down is a Production Engineer on Facebook's Web Foundation team, based in London. He is responsible for debugging and resolving major production issues and improving the reliability and efficiency of Facebook's systems. He also is a contributor to the Linux kernel, systemd, and many of Facebook's open source efforts.
The Definitive Guide to Make Software Fail on ARM64
Ignat Korchagin, Cloudflare
Cloudflare operates a large distributed network: we have more than 165 data centres across 75 countries. We recently decided to integrate a second CPU architecture into our infrastructure. The obvious choice was ARM64. Apart from doing the basic hardware bring up we also needed to port all our software stack to ARM64, which includes a lot of in-house and third-party open-source components. Turns out, even if the software is written in a cross-platform architecture-agnostic language, there are a lot of potential ways it can fail on a different architecture. This presentation describes the issues we encountered, when porting our software to ARM64 and provides some advice for developers and SRE on how to avoid them, when adopting new code.
Ignat Korchagin, Cloudflare
Ignat Korchagin is a security engineer at Cloudflare working mostly on platform and hardware security. Ignat’s interests are cryptography, hacking, and low-level programming. Before Cloudflare, Ignat worked as a senior security engineer for Samsung Electronics’ Mobile Communications Division.
3:30 pm–4:00 pm
Break with Refreshments
Level 3 Foyer 5
4:00 pm–6:00 pm
Summit Room 2
How to Champion SRE Investment to Different Levels of Leadership
Lyon Wong, Blameless, and Christina Tan, EloquentSpeaking
Reaching higher levels of the organization is essential to achieving broader adoption of SRE. Without the right buy-in even if parts of SRE are rolled out, behaviors will regress. How then do you get this support? It starts with finding out the incentives needed by key levels of leadership and how to effectively speak to those. It also requires listening to the resistance to SRE adoption, and then effectively address that resistance with reason and metrics. In this talk, I share how to persuade the right level of leadership so that an organization can progress the adoption of SRE through key stages. I provide case studies of how real companies have succeeded or failed with their SRE adoption. This talk aims to equip the audience with the tools to promote SRE adoption through a grassroots/bottoms-up approach.
Lyon Wong, Blameless
Lyon Wong is Cofounder of Blameless (2018 backed by Accel and Lightspeed), a Silicon Valley company dedicated to SRE toolsets and platform. Lyon was previously a PM in leading User Experience on the Windows team at Microsoft. Lyon graduated from Waterloo and Stanford University.
Christina Tan, EloquentSpeaking
Christina Tan is Founder of EloquentSpeaking and helps leaders and executives communicate to have the highest impact. Christina has coached over 1000+ engineers, leaders, founders, and Ph.D.s from around the globe. She has also trained speakers for TEDx and conference keynotes. Christina graduated from Waterloo with CS and Business.
Challenges of Distributed Teams Such as Languages and Timezones, Especially Those Unique to Asia/Australia
Edwin Christopher, Symantec
- Cultural differences
- Culture and hierarchy
- Culture and technology
- Timezone difference
- Skillset equality
- Building trust
Edwin Christopher, Symantec
Edwin Christopher is working as a Sr Princ Site Reliability Engineer at Symantec and has around 15 years of experience in Information Technology and Information Security. He is proficient in determining and resolving technical issues quickly and skilled at providing effective leadership in fast-paced, deadline-driven environments.
Distributed Sys Teams
Srivatsa Ray, Fastly
Single Points of Failure is a term we all dread in the technology world. We go through the pain of making sure services are resilient and distributed and yet, more often than not, we fail to give the same treatment to the most critical part of any system—the Humans.
This talk will focus on the importance of hiring remote and hiring around the world. Distributed teams not only add value to the core systems but also help us bring each other closer to one another. Distributed teams promote diversity, inclusion, break down political borders, cultural understanding of one another AND can absolutely be more productive while giving you 24-hour coverage on teams!
Srivatsa Ray, Fastly
Sri Ray works at the intersection of DevOps, Security, and doing the right thing. He searches for solutions that respect and compliment the human element of systems. While not architecting systems or dreaming of the next improvement to make, he spends most of his time on planes traveling around and getting to know the only place we have ever called home—Earth. He uses this opportunity to understand cultures and, more importantly, relish local food.
Room 331–332
Open Source Firmware at Facebook: Design, Deployment, and Demos
Andrea Barberio and David Hendricks, Facebook
In 2018 we set out to improve our boot and provisioning flow by turning our Linux engineers into firmware engineers and leveraging open source in our firmware stack. In this talk, we'll share some of the progress we've made to develop, deploy, and test open source firmware, as well as some of the difficulties and successes. In particular, we will cover provisioning, advances in measured boot, platform bring-up, and a new open-source system testing framework.
Andrea Barberio, Facebook
Andrea is a Production Engineer at Facebook, where he worked on the DNS infrastructure, cluster lifecycle automation, OS provisioning and in the last year on open source system firmware. In his past lives, Andrea worked on the design and development of large scale network monitoring systems for AWS, embedded firmware development, pentesting, offensive security trainings, reverse engineering, and computer forensics.
David Hendricks, Facebook
David got his first taste of BIOS tweaking as a kid overclocking gaming PCs and was drawn into the coreboot world where he learned real firmware hacking. Since then he's developed, deployed, and advocated for open source firmware as part of Google's ChromeOS team and more recently as part of Facebook's infrastructure team.
Automating OS/Platform Upgrades for Service Owners
Rodrigo Menezes and Adam McKenna, Pinterest Inc
Ever burned out or even quit your job after an arduous migration? Keeping up to date with operating systems versions, runtimes like Docker or JDK, and new instance types are a significant source of toil. The complexity increases when we consider several hundred microservices across tens of thousands of machines. At Pinterest, we've had multiple migrations that have gone on YEARS beyond their original schedule, hurting developer productivity and morale. In this talk, we analyze the reasons behind these failures, share our learnings, and demonstrate our pipeline for automating upgrade validation and deployment.
Rodrigo Menezes, Pinterest
Rodrigo Menezes is a member of the Core Site Reliability Engineering team at Pinterest. While at Pinterest, his main focus has been on Docker and creating applications to support running Pinterest's stateless prod infrastructure in a container. Outside of work, Rodrigo loves rock climbing, surfing, and anything outdoors, as well as working on various DIY projects.
Adam McKenna, Pinterest
Adam works on the Core SRE team at Pinterest, primarily collecting tech debt and developing automation tools with a focus on developer experience. He has over 25 years experience with Linux and other Unix-like operating systems, including several years as a member of the Debian project. As a hybrid systems engineer/administrator and coder, he was naturally attracted to DevOps work. Outside of work, he is a gardener, gamer and father of 3 boys.
Unified Reporting of Service Reliability
Helen Zhang, Google
We built a unified reporting system to bring together data from different sources that lived in unconnected silos (such as SLO reporting metrics, postmortems, incident response tools, customer support tickets, etc.). The system ingests and correlates data from these different sources and stores the processed data in a new database. People from a variety of teams would use the data to create customized dashboards that suit their particular reporting needs.
Helen Zhang, Google
Helen Zhang is a staff software engineer at Google SRE. During her nine years with Google, she has worked with hundreds of developers across the company to launch mission-critical production services. She recently led a team to build a unified service reporting system for service reliability.
Room 334–336
Software Networking and Interfaces on Linux
Matt Turner, Native Wave
These are the days of VMs, containers, and service meshes. The network, for a long time the sysadmin's mysterious domain, is now at the forefront: providing overlays, security features, and headaches. It's vital to be able to understand what's going on under the hood of a cloud-native platform if you ever hope to debug it, but do you know a TAP from a TUN, let alone an ipvlanL3? This talk will take you through all the network interface types on modern Linux, from good old eth0 to the vEths used by Docker and the tunnels used by Calico.
Matt Turner, Native Wave
Matt is CTO at Native Wave, a consultancy that designs, builds, and manages cloud-native platforms using the best open source software. Native Wave works with the whole business to re-architect and refactor applications to get the most from modern cloud technologies, whilst instilling DevOps/NoOps best practices.
Tuning Java's G1 Garbage Collector for Realtime Services
Andi Chalfant, Facebook, Inc
Java's G1 Garbage Collector is here, and will be replacing Concurrent Mark and Sweep as the new default for JDK10 and later. The G1 collector is a pure stop-the-world collector, and for services with hard and fast latency constraints the premise of a pure stop the world collector can seem daunting. Understanding how the collector works and being able to measure your service workload are crucial to effectively tuning the G1 collector. With the right settings, the G1 collector can work effectively for realtime applications, even on very large heaps. This talk discusses how the G1 collector works, measuring G1's performance, and tuning ergonomics with a focus on latency sensitive services.
Andi Chalfant, Facebook, Inc
Andi is a Production Engineer based in Seattle, working on archival storage systems. Andi first started at Facebook in 2012, and in their time at the company they have worked on scaling the data warehouse, wrangling build systems, and refining the production of Messenger's MQTT-based network transport edge service. Andi holds an A.B. in Political Science and East Asian Languages and Civilizations from the University of Chicago.
HBase Internals and Operations
Biju Nair, Bloomberg LP
HBase is a key value datastore built on top of Hadoop HDFS. This talk will walk through HBase's architecture, various components it uses and the key metrics that can be observed to operate it efficiently for high availability.
Biju Nair, Bloomberg LP
Biju Nair is interested in system software design, development, deployment, and tuning. Currently at Bloomberg, he is focused on helping product teams adopt and efficiently use distributed systems like Hadoop, HBase, Cassandra, Kafka etc to meet their strict SLAs. In the data storage and information retrieval domain, he is experienced implementing solutions using VSAM, network, hierarchical, relational and MPP database management systems.
6:00 pm–8:00 pm
Conference Reception
Nicoll Room
Friday, June 14
8:00 am–9:00 am
Morning Coffee and Tea
Level 3 Foyer 5
9:00 am–10:30 am
Summit Room 2
How We Used Kafka to Scale Database Infrastructure
Basavaiah Thambara, LinkedIn
This talk will introduce our home grown NoSQL document store and its initial design based on MySQL replication and the challenges faced, will touch base on some of the major changes incorporated in the product for replacing MySQL replication with Kafka-based replication and cover Kafka based replication implementation in some depth. Followed by an overview of Kafka configuration options to support reliable delivery along with details of the application logic to ensure "exactly once delivery" and "rejection of out-of-band messages."
Audience takeaways are
- How our home grown NoSQL datastore works
- Challenges we faced with MySQL based replication
- How Kafka helped scale the database infrastructure
- Over all how to use kafka for database replication at scale
Basavaiah Thambara, LinkedIn
Basavaiah Thambara (Basu) has more than a decade of experience designing, building, and scaling MySQL databases. He is currently working as a staff database engineer at LinkedIn managing Espresso, an in-house distributed NoSQL datastore built on top of MySQL. Prior to LinkedIn, he worked at Yahoo! after his Masters in Computer Science. He currently lives in Bangalore, India.
How to Start On-Boarding of SRE
Takeshi Kondo, Quipper Ltd.
SRE's responsibility extends to all systems and is very extensive. The On-Boarding for effectively growing new joiners is very important. In this talk, I explain how to design the On-Boarding we actually did.
This talk contains the following:
- Tell values from Mission and Responsibility
- Breakdown from on Boarding goal to concrete learning plan
- To apply beyond SRE
From this talk you can learn how to build On-Boarding that can be applied to any organization, not just SRE.
Takeshi Kondo, Quipper Ltd.
SRE at Quipper. Interested in improving development experience. Recently, I'm building a system to make microservices production ready, include readiness check, clarification of technology/business responsibility, mechanism for notification to stakeholder and documentation.
Room 331–332
Getting More out of Postmortems and Making Them Less Painful to Do
Ashar Rizqi, Blameless, Inc.
For teams that don't do postmortems well, they are missing out on an effective tool for driving positive organizational change. Crucial insights can be missed which can result in repeat outages which hurt customer trust in the company. This talk briefly discusses the elements of an effective postmortem and what makes it challenging to identify those elements. It introduces concrete methodologies that alleviate the cognitive overhead and emotional burden of doing postmortems. The talk also outlines concrete case studies of how companies have meaningfully benefited from their postmortem learnings.
Ashar Rizqi, Blameless, Inc.
Ashar Rizqi is Co-Founder and CEO of Blameless Inc., a Silicon Valley company building the next generation SRE platform. Prior to Blameless, Ashar led SRE and Platform Engineering teams at Box and Mulesoft after a stint as an Enterprise Systems Engineer at Fidelity Investments.
The MTTR Chronicles: Evolution of SRE Self Service Operations Platform
Jason Wik, Jayan Kuttagupthan, and Shubham Patil, VMware
Running a Cloud Platform reliably comes with its own set of challenges. Irrespective of the source of solution used, In-House or Off-The-Shelf, with the Incident Management spread across integrated systems, it is easy to lose sight of the latent issues. Impact Assessment, Communication & Coordination to add in to the mix.
Being an SRE is a tough job since the SRE is expected to know almost all aspects of software delivery. The life of an SRE becomes easier & empowered when equipped with the right set of tools.
Join us at our talk and we'll walk you through our experience of building a SRE Operations Platform for VMware Managed Cloud on AWS. We'll talk about how we have combined Automation, Monitoring and Incident Management under a single umbrella to drive down MTTR, increase productivity, tools to effectively communicate fleet-wide health to incident managers and customer success engineers and above all, reducing the toil of an SRE.
Jason Wik, VMware
I have been focused on service reliability and operating services at scale for 20+ years. My experience has been shaped by the challenges of engineering and supporting many large global services. I am the Director of the VMC SRE teams for the VMWare Managed Cloud on AWS service. Defining and measuring service health is one of my areas of passion.
Jayan Kuttagupthan, VMware
I am into software development since 9 years contributing to both backend and front end development. I am an SRE for VMware Managed Cloud on AWS (VMC on AWS) contributing to the development of Automation & Reporting platforms. My current interest is exploring around how ML, Deep Learning, and AI can contribute to the SRE arena.
Shubham Patil, VMware
I currently work on problems ranging from Service Health Measurement to developing Scalable Automation Platforms for VMware Managed Cloud. In the past, I have worked on VMware's ESXi Kernel to optimize schedulers and memory management in the hypervisor. In my spare time, I like playing around with Distributed Systems Design and Artificial Intelligence problems.
Room 334–336
Latency SLOs Done Right
Theo Schlossnagle, Circonus Inc
Median, average, 90th, 99th percentile. We've all seen these metrics on our monitoring systems, both open source and from commercial vendors, but often they are used incorrectly when constructing Service Level Objectives. This session will show three different approaches to correctly calculating latency SLOs, and how histograms can be used to calculate mathematically correct quantiles and set SLOs based on those.
How to Trade off Server Utilization and Tail Latency
Julius Plenz, Google
When running large scale systems, we strive to deliver both low tail latency and high utilization of servers. However, these two dimenions are at odds: increasing the average utilization of a system will have a detrimental impact on the tail latency.
This talk provides a light-weight walkthrough of the important basics of queueing theory (avoiding unnecessary formalism), illustrates graphically several typical outcomes of this analysis, and closes with a few basic rules on how to think about utilization and tail latency.
Julius Plenz, Google
Julius studied Math in Berlin and has been with Google in Sydney for four years, where he’s worked mostly on low-latency distributed storage systems.
10:30 am–11:00 am
Break with Refreshments
Level 3 Foyer 5
11:00 am–12:30 pm
Summit Room 2
Familiar Smells I've Detected in Your Systems Engineering Organization and How to Fix Them
Dave Mangot
Over the course of my career, I've had the opportunity to work with a number of organizations on their operational maturity. After doing "systems archeology" a number of times, starting at new organizations, I began recognizing certain signature "smells" that indicated that there was something that could be improved, and often had a pretty good idea how those situations came to be.
Things like the volume of pager alerts can be indicators of poor signal to noise ratios, or overworked infrastructure, or broken architectures. Things like elaborate change control can be signs of inadequate testing, or lack of automation (as if a review by people unfamiliar with the changes makes it safer). Recovery mechanisms that are never tested are never going to actually work in the case that they are needed except in the most trivial of cases.
There are many such examples with single points of failure, competing change mechanisms, scaling challenges, outsourcing of manual automation (not a typo), badly scoped runbooks, immature monitoring, multi-generational monitoring systems, and more, that are signs that we can do better.
In this talk, we'll talk about some fun that was had over the years, maturing different infrastructures, learning from failure and success, and how we can take lessons from "mistakes were made" scenarios to increase our performance, lower our MTTR, and help those in the systems engineering organization love their job.
Dave Mangot[node:field-speakers-institution]
Dave Mangot is the author of Mastering DevOps from Packt Publishing. He was previously the head of Site Reliability Engineering (SRE) for the SolarWinds Cloud companies and is an accomplished systems engineer with over 20 years' experience. He has held positions in various organizations, from small startups to multinational corporations such as Cable & Wireless and Salesforce, from systems administrator to architect. He has led transformations at multiple companies in operational maturity and in a deeper adherence to DevOps thinking. He enjoys time spent as a mentor, speaker, and student to so many talented members of the community.
SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal
Shyam Sunder VR and Deepa Elumalai, PayPal India Pvt Ltd
Conceived and embarked on in 2009, PayPal's SRE has been the A-team that has been a significant factor in PayPal's ability to serve millions of users, day in and day out. It has aided PayPal in becoming and remaining one the most trusted payment solutions the industry has ever seen. This talk will attempt to build a narrative around the transformation that PayPal SRE has gone through with milestones that are truly remarkable and ahead of the curve.
Shyam Sunder VR, PayPal India Pvt Ltd
Deepa Elumalai, PayPal India Pvt Ltd
Deepa Elumalai is a technologist with over 14 years of experience in software development. She is a passionate problem solver and machine learning enthusiast with a heightened desire to explore technology. She's a strong advocate of the "automate everything possible" ideology. It all started with her journey at HCL where she built simulators and automation test frameworks to automate protocol testing for VoIP devices. Her next venture saw her gain strong PayPal domain expertise as a Site Reliability Engineer and with her problem-solving savviness, she spearheaded the team and built Snap, an "Auto Triage Platform" to troubleshoot incidents and alerts. She explored technologies, brought in insights to triage data by applying machine learning techniques. This transformed the platform to pronounce accurate triage results with very quick response time. She loves cooking and enjoys playing cricket and football with her two adorable sons!
Room 331–332
Building Centralized Caching Infrastructure at Scale
James Won, LinkedIn
Caching is integral to any large-scale web operation. LinkedIn formed a dedicated caching team in 2017 and since then we have built out automation and infrastructure to support over 7 million queries/second across more than one-hundred clusters.
In this talk, I will be speaking through:
- Why this team needed to exist
- What we wanted to improve (e.g. tighter integration with existing deployment infrastructure)
- How we integrated a third-party product into our deployment system
- Things we wish we did differently after implementing our initial automation/tooling
- Implementing seamless upgrades (compare it to how things were in the past)
- Transitioning from running in root to non-root
- Tooling we created to provision stores quickly
- Where we want to take caching at LinkedIn
- Things to consider about if your team provides a datastore as a service
James Won, LinkedIn
James Won is a Staff Site Reliability Engineer at LinkedIn, responsible for keeping its caching infrastructure running smoothly and scalable. He not only spends time in the day-to-day operations of maintaining caching infrastructure but is also a huge fan of Python and thus automates as much as he possibly can to reduce human error and make tasks as self-service as possible.
Hybrid XFS—Using SSDs to Supercharge HDDs at Facebook
Skanda Shamasunder, Facebook
Hard drive capacities are going up but the IOPs they supply are staying flat. This has led the Exabyte-scale storage community to get inventive in how we use our disks to get the most out of them. This talk is the story of how we at Facebook coupled SSDs with HDDs to unlock their full potential by productionizing a little used feature in XFS that was designed with a whole other use case in mind. What started off as an experiment has now been rolled out across the Facebook fleet and has allowed us to scale our storage systems to meet the ever growing IO demand. It is a lesson in how combining existing hardware and software in novel ways by taking informed risks can have huge payoffs and help us scale our systems.
Skanda Shamasunder, Facebook
Skanda Shamasunder is a Production Engineer at Facebook and has been working on Facebook's exabyte-scale storage system for the past three years. He works at the intersection of hardware and software, providing Facebook's storage systems with a reliable and performant foundation to run on. In his spare time, he writes about himself in the third person.
Room 334–336
Lightning Talks
- Developing Effective Project Plans For SRE Internships
Andrew Ryan, Facebook - Zero Downtime Cross-Cluster Migration of Microservices in Kubernetes
Swati Singhvi and Vrinda Malhotra, VMware - How to Ruin an SRE-Dev Relationship in 3 Simple Steps
Raushaniya Maksudova, Google - An Effective Agile SRE Workflow
Jay Chin, Grab - 5 Actions for Training Your SRE Team
Dorian Basuyau, Ubisoft Singapore - Managing Terraform State at Stack Overflow
Mark Henderson, Stack Overflow - Transparency—How Much Is Too Much?
Amiya Adwitiya, Squadcast Inc - Dashboards for Thousands of Services
Andreas F. Bobak, Google - What SREs Can Learn from Moms about the Preventative Paradigm
Rayappa Mayakunthala, Salesforce - Our Practices of Delegating Ownership in Microservices World
Daisuke Fujita, Mercari, Inc. - Error Budgets in Banks—Challenges & Way Forward
Chaitanya Gorrepati and Alex Titlynanov
Note: A single PDF file containing all the slides submitted by presenters is available for download below.
12:30 pm–2:00 pm
Luncheon
Nicoll Room
2:00 pm–3:30 pm
Summit Room 2
Extending a Scheduler to Better Support Sharded Services
Laurie Clark-Michalek, Facebook
Sharded services are incredibly common in modern computing, but schedulers don't usually consider them when making scheduling decisions, preferring to take a task based view of service health. This disparity between models can lead to ugly workarounds, which can in some case jeopardize the reliability of a service. This talk describes a project that extends a task based scheduler to take shards into account when scheduling. This project has sped up updates and reduced downtime for many services across Facebook, and has a surprisingly diverse set of uses.
High Availability Solution for Large Scale Database Systems
Guowei Zeng, Baidu Inc
At Baidu, over 95% of the OLTP requests go to the MySQL database system, which processes over 100 billion queries each day, and supports important applications such as search engine, advertisements, finance, and so on. Its availability is therefore of paramount importance. As the master/slave architecture of MySQL, the database HA solution includes two key points: accurate failure detection and remediation maintaining data consistency. The accuracy of failure detection should be as high as possible because a false positive will activate the lossy slave promotion and a false negative could lead to disasters when the failure persists. The remediation procedure should fill the data gap to ensure data consistency.
This talk will introduce a complete solution for database HA. Specifically, we will share our successful experience at Baidu to fulfill different needs from applications, such as large scale clusters, or high consistency for financial systems.
Guowei Zeng, Baidu Inc
Guowei Zeng has been a Database Architect and Senior R&D Engineer in the DBA group of Baidu for seven years. He is now responsible for the technical infrastructure of the database management platform, and focuses on database high availability and resource management continuously.
Room 331–332
Understanding Business Metrics Can Make You a Better SRE
Mohit Suley, Microsoft, and Kurt Andersen, LinkedIn
An SRE who understands the business her company is in will be enabled to effectively communicate with various levels of management and understand how their day to day work impacts the success of the company. With this perspective, you can seek global optimization in ways that transcend simple availability. Being fluent with these concepts will support your career growth also.
Why should you care if your company makes any money? Business metrics do not need to be a mysterious world that is maintained by people with MBAs. Technical people will find both corporate and personal benefit in their careers by expanding their focus to include business fundamentals which are affected by services they run.
Business metrics are all oriented around two main axes: who your customers are, and how you make money. All of the mysterious acronyms and abbreviations are just different ways of mixing and matching those two components along with time.
Attendees will walk away with a clear understanding of standard business metrics and how they relate to Site Reliability. We will use simplified examples of different business models to show the application of these metrics across multiple domains.
Mohit Suley, Microsoft
Mohit is an Engineer on Bing's Live Site Engineering team. Designing systems to proactively improve availability and route around problems is a core mission of the team. In his spare time, he loves long walks, tinkers with hardware, and chases his goal of reading more books than Bill Gates.
Kurt Andersen, LinkedIn
Kurt Andersen has been one the co-chairs for SREcon Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon, and SANOG on various aspects of reliability, authentication, and security. He also serves on the USENIX Board of Directors and as liaison to the SREcon conferences worldwide.
Yes, No, Maybe? Error Handling with gRPC Examples
Gráinne Sheerin, Google Ireland
Hello World! When it all works, it's easy. What can you do when your client gets a response which isn't success? Let's focus on how the client and server can have different views of the outcome, and some simple code snippets for handling these.
Gráinne Sheerin, Google Ireland
Gráinne is a Site Reliability Engineer for Google Ireland. She's a tech lead responsible for Ad Serving infrastructure and has seven years of experience in production engineering. She's a physicist, earning a doctorate in Nanoscience from Dublin City University. Prior to Google, she masqueraded as a strategic relationship manager for Reuters and a network engineer for HEAnet.
Room 334–336
Aperture: A Non-Cooperative, Client-Side Load Balancing Algorithm
Ruben Oanta and Bryce Anderson, Twitter
Twitter's RPC framework, Finagle, employs non-cooperative, client-side load balancing. That is, clients make load balancing decisions independently. Although this architecture continues to serve Twitter well, it also comes with some unique trade-offs and challenges. In particular, it scales poorly as service clusters grow to thousands of instances. In this talk, we will dive deeper into the problem space and how we addressed it via an algorithm we call "Aperture."
Ruben Oanta, Twitter
Ruben has been working on Twitter’s RPC stack for the past five years. In that time, he has made substantial contributions to both the design and implementation of Finagle which have markedly improved the resiliency and operability of Twitter services.
Bryce Anderson, Twitter
Since 2016 Bryce has been with Twitter's Core System Libraries team working predominantly on the Finagle RPC library. Bryce enjoys long walks through RFC's and analyzing the potential for graph-wide meltdowns in service-mesh load balancers.
Building a Scalable Network Event Executor Using GO
Marek Denis, Facebook
With ever increasing growing networks, automation plays a critical role. We want engineers to have time to work on exciting projects versus staring at graphs 24/7 and reacting to events.
In this talk, we provide an introduction on how our monitoring and remediation systems work; we will also describe the initial steps on how to build a simple remediation system—scalable machinery that can watch for symptoms of errors/failures and react to these in a pre-defined way.
We will cover not only architecture, but also give recommendations on how to keep the system scalable. The talk will present a walkthrough of an open source system (URL Hidden) written in GO—a static typing language with concurrency built in its philosophy. To finish the presentation, we will do a live demo of the system.
Marek Denis, Facebook
Marek Denis is a member of the Production Engineering team for the Network org at Facebook. His responsibilities include maintaining, monitoring, and improving the global production network infrastructure with automation and synthetic traffic probing systems.
3:30 pm–4:00 pm
Break with Refreshments
Level 3 Foyer 5
4:00 pm–5:00 pm
Summit Room 2
Ironies of Automation: A Comedy in Three Parts
Tanner Lund, Microsoft
As much as we often wish we could eliminate that "squishy humans" from the loop in order to maximize our system reliability, automation usually has unintended consequences. "The Ironies of Automation," a seminal paper on the problems that automation, spelled these out quite clearly and still stands the test of time—over 30 years later.
Join our fictional SRE as they attempt to automate a set of tasks. As the side effects and newly introduced problems continue to mount, they will learn an important lesson about human error, and how to strike the all important balance for SREs between human and machine.
Tanner Lund, Microsoft
Tanner Lund has been a part of Azure's SRE organization from the beginning. He has worked in a variety of roles, including crisis management, developing SREBot, building data pipelines, and leading services through SRE/DevOps transitions. Throughout it all his focus has been on understanding complex systems and how we achieve our goals through them, seeking to unlock their secrets.
Ethics in SRE
Laura Nolan, Slack
Laura Nolan is an SRE who believes in the power of checklists to help us tame complexity and chaos. She is one of the contributors to the books Site Reliability Engineering and Seeking SRE, both published by O'Reilly.
5:00 pm–5:10 pm
Closing Remarks
Summit Room 2
Program Co-Chairs: Frances Johnson, Google, and Avleen Vig, Facebook