9:00 am–9:10 am
Opening Remarks
Nicoll Room 2–3
Program Co-Chairs: Jamie Wilkinson, Google, and Brendan Gregg
9:10 am–10:40 am
Nicoll Room 2–3
How the Sony PlayStation Network Does SRE
Suefuji Yu and Miyahara Yuya, Sony Interactive Entertainment
Sony Interactive Entertainment (SIE) is responsible for the PlayStation brand and family of products and services including the PlayStation Network which boasts over 100 million active users around the world. Despite the unprecedented and unplanned traffic increase during the COVID-19 lockdown, the team has successfully launched PlayStation 5 (PS5) – delivering a transformative, stable experience that delighted our fanbase. SIE’s team based in Tokyo will be sharing the evolution of their SRE along with the unique approaches taken by four SRE teams across regions, Tokyo, San Diego, Los Angeles, and San Francisco, including their pros/cons. The hope is that through this presentation we can encourage teams to continue to select and develop an appropriate SRE model that will lead to this success carrying forward for many future generations to come.
Suefuji Yu, Sony Interactive Entertainment
Yu Suefuji is a Director of SRE and Platform Engineer at Sony Interactive Entertainment (SIE), the company responsible for the PlayStation brand and family of products and services. During his time at SIE, Yu has been responsible for developing the Tokyo-based team's Performance Test platform and led Performance Test activities for the PlayStation 5 launch in Tokyo with global team members in 2020. He loves to play video games, but recently he spends his time with his two daughters (0 years old and 2 years old) and wife.
Miyahara Yuya, Sony Interactive Entertainment
Yuya Miyahara is a Director of SRE at Sony Interactive Entertainment (SIE), the company responsible for the PlayStation brand and family of products and services. He joined SIE’s PlayStation Network Service Operations team in 2007 and has over 15 years of system design, deployment, monitoring and troubleshooting experience. Outside of work, he enjoys playing with his 6-year-old daughter outdoors.
Unleashing Generative AI: Improving Developer Productivity in SRE
Sandeep Hooda, DBS Bank
We will talk about how Site Reliability Engineering (SRE) teams can use Generative Artificial Intelligence (AI) to improve productivity and streamline operations. SRE teams are responsible for ensuring that systems are reliable and available, while delivering new features and updates. However, the pressure to deliver updates can lead to burnout, decreasing productivity among developers. Generative AI can help alleviate some of these issues by automating repetitive tasks, allowing developers to focus on more complex problem-solving. By defining what generative AI is, we can then explore how SRE team can leverage it to automate repetitive tasks like code reviews, testing, and deployment. Attendees will gain insights into how to delegate tasks to AI so that they can spend more time on higher-level tasks, leading to increased productivity and job satisfaction.
Sandeep Hooda, DBS Bank
Sandeep is an Engineering Manager at DBS with over 19 years of experience. In this leadership role, he is responsible for engineering innovative and strategic solutions. He has deep technical expertise in Platform engineering, SRE, DevOps, Risk management, solution architecture and systems engineering. He has been instrumental in driving digital transformation and promoting SRE culture. He also had the privilege of speaking at several tech conferences and enjoys writing on SRE and DevOps topics. He enjoys his free time out in the ocean, practicing to sail around the world.
10:40 am–11:10 am
Break with Refreshments
Level 3, Foyer 4
11:10 am–12:35 pm
Nicoll Room 2–3
Patterns, Not Categories: Learning Across Incidents
Tanner Lund, Indeed
Outage pattern analysis is hard! There have been many attempts to learn across multiple incidents. Folks look for categories, tags, causes, etc. to identify what's brittle or risky in their system, sometimes even using statistical models to help make sense of the data. However, their results often prove unsatisfying, non-actionable, or don't tell you anything you didn't already know from other sources.
An alternate approach is to find patterns via Christopher Alexander's "Pattern-Centered Inquiry". Complex systems fail according to certain patterns or fundamental laws. We can identify and learn from these patterns and then see how their individual, diverse manifestations in our systems develop and manifest. An understanding of patterns and how to spot them then underpins better informed reliability decision making.
Tanner Lund, Indeed
Tanner Lund has been studying incidents and what they can tell us about systems for the better part of a decade. During his time supporting cloud platforms, building data pipelines, managing crises, and improving site reliability, he's found there is a lot more to understand about how software and people work (and don't work) together. Throughout it all his focus has been on understanding complex systems and how we achieve our goals through them, seeking to unlock their secrets. That may take a while...
Observability in the MLOps Lifecycle with Prometheus
Shivay Lamba
MLOps is widely talked about and used to make the practice of deploying, managing, and monitoring ML models in production easier. Monitoring ML training or evaluation jobs is obviously very important however it is more important to monitor once an ML model is deployed.
This talk first starts by giving a gentle introduction about how ML deployments should be monitored, briefly talking about edge cases in production, data drift, concept drift, model metrics as well as the standard system and resource metrics. We give the audience an overview of observability and monitoring in the context of MLOps. This monitoring could also provide valuable results in terms of whether a model should be retrained, if more data should be collected, if different kinds of data should be collected, and more.
We show how one can handle the very important task of monitoring and performing the aforementioned tasks in the context of MLOps with Prometheus. We also show how one could take their existing deployments and add the power of easy and useful monitoring with Prometheus. Finally, we also show demos about how one could use Prometheus paired with their Flyte or Seldon Core, or FastAPI ML deployments.
Shivay Lamba[node:field-speakers-institution]
Shivay Lamba is a software developer specializing in DevOps, Machine Learning and Full Stack Development.
He is an Open Source Enthusiast and has been part of various programs like Google Code In and Google Summer of Code as a Mentor and has also been a MLH Fellow. He is actively involved in community work as well. He is a TensorflowJS SIG member, Mentor in OpenMined and CNCF Service Mesh Community, SODA Foundation and has given talks at various conferences like Github Satellite, Voice Global, Fossasia Tech Summit, TensorflowJS Show & Tell.
Nicoll Room 1
Towards Zero Carbon: Implementing Sustainable Battery Lifecycle Management in Data Centers
David Cesarano and Fanjing Meng, IBM
Batteries play a critical role in providing uninterrupted power during outages, but managing their lifecycle can be challenging and has environmental impact. In this presentation, we will introduce a sustainable solution for managing the lifecycle of batteries in data center infrastructure. Our solution employs IoT sensors, mechanism models, and AI models to monitor battery performance in real-time, analyze health, detect anomalies, and predict end-of-life. Real-time monitoring dashboards visualize battery performance and behaviors, while health analysis capabilities detect anomalies and predict failures. The solution recommends proactive maintenance to prevent costly downtime and triggers sustainable waste management processes. It reduces costs and environmental impact towards zero-carbon waste diversion goals. We will showcase real-world examples of our solution and how to improve waste diversion. The solution is deployed and running in production, supporting daily operations.
David Cesarano, IBM
David Cesarano is a Solutions Architect at IBM and is located in Phoenix, Arizona, USA. He has over 20 years of experience with IT and a Bachelor of Science degree in Computer Information Systems from Northern Arizona University. He has several data and cloud certifications and a couple pending patents at IBM. His current area of focus is industry and data center management.
Fanjing Meng, IBM
Dr. Fanjing Meng is the CTO of IBM China System Development Lab, with over 20 years of experience in cutting-edge technology research, development and management. She specializes in sustainable computing, AIOps, ITOA, cloud computing, software and solution engineering. Her current focus is on developing a sustainable computing optimization and management platform to accelerate the digital transformation of enterprises. Dr. Meng has published over 30 academic papers and holds more than 40 international patents in innovative fields. She has received over 30 awards for her contributions to technological innovation from IBM and IEEE. Additionally, she actively participates in technical and academic communities, serving as a General Chair and committee member for international conferences, and as a project leader for IEEE WIE Beijing Affiliate and a speaker for IEEE Women in Services Computing (WISC).
Leveraging Analytics for Technical Efficiency and Enhanced User Experience
Muskan Prajapati and Renisha Fernandes, VMware
In today’s technology-driven world, efficiency is crucial in all aspects of site reliability engineering. Analytical methods play a vital role in achieving efficiency by identifying areas for improvement and optimizing various systems. Join us in out talk where we will discuss three services developed by our team to improve site reliability engineering efficiency using analytics: Outage Management Service (OMS), a Slackbot, and Service Analytics. OMS automatically detects and resolves outages by analyzing past incidents, while the Slackbot predicts solutions based on past conversations. Service Analytics uses event data collection to generate reports for improving user engagement. These services significantly reduce Mean Time to Repair and alleviate on-call engineers’ burden, resulting in improved efficiency and productivity.
Muskan Prajapati, VMware
Muskan Prajapati has 3+ years of experience as a full stack developer and a year of experience as an SRE, she has been passionate about ensuring code quality and scalability. Currently exploring the field of SRE, she is enthusiastic about learning scaling techniques and delivering exceptional user experiences.
Renisha Fernandes, VMware
Renisha Fernandes has been into software development for the past 10 years, contributing to both backend and front end development. For the past 5 years, she has been contributing to the development and scaling of the automation platform which is actively being used by VMware VMC SRE. She likes playing around with Distributed Systems Design and effective scaling.
12:35 pm–1:55 pm
Lunch
Summit Room 1
Sponsored by Citadel/Citadel Securities
1:55 pm–3:20 pm
Nicoll Room 2–3
LiveMLP: ML Platform for Assisting Contact Center Agents in Real-Time
Aashraya Sachdeva, Staff Engineer-ML, Observe.AI
Contact centers are essential for customer support, but managing high call volumes can lead to agent stress and high attrition rates. Traditional methods of improving performance include supervisor oversight and post-call systems that use ML to analyze recordings for behavioral and communication issues. However, these approaches have limitations in knowledge retention and product/service knowledge. A real-time system is needed to guide agents during calls, but this presents engineering challenges in throughput vs latency and fault tolerance compared to post-call systems. Real-time ML systems also differ in terms of batch vs non-batched inference and context. The talk presents a real-time ML platform that scales horizontally and ensures low latency, discusses approaches to make the system robust and stable, and demonstrates its efficacy through implementation and load testing with up to 10,000 concurrent calls. Real-world evidence shows that such a system positively impacts business metrics.
Aashraya Sachdeva, Observe.AI
Aashraya Sachdeva is a technology enthusiast who is passionate about creating accessible AI products. A Machine Learning expert with a focus on platform engineering, he has years of experience handling ML projects across data assimilation, modeling, deployment, and scaling. A graduate of IISc, Bengaluru, he is currently working as Staff Engineer, Machine Learning at Observe.AI. With his extensive experience in machine learning and platform engineering, he believes in converting research into practical products that are easy to use.
How Safe Is Your Domain?
Michael Kehoe, Confluent
All of us at this conference are responsible for the services that run on our company's domain, but who is responsible for the domains and subdomains that our infrastructure uses? And how secure and available is our domain? This talk is going to be a deep-dive on domain and DNS safety.
The safety of our domains is an often overlooked and taken for granted, however, how many of us have really thought deeply about how to perform a threat and availability assessment of our domain and DNS infrastructure? This session will run through common availability and security threat vectors for domains and DNS and will demonstrate how to detect and mitigate them.
Michael Kehoe, Confluent
Michael Kehoe is an author, speaker and Sr Staff Cloud and Reliability Architect at Confluent, leading a whole organization initiative to redesign the company’s cloud platform. Previously, he was a Sr Staff Site Reliability Engineer (SRE) at LinkedIn, architecting LinkedIn’s move to Microsoft Azure. Before graduating with a Bachelor of Electrical Engineering from the University of Queensland (Australia), Michael interned at NASA Ames Research Center building small-satellites known as Phonesats.
While working at LinkedIn, Michael led the companies work on Incident Response, Disaster Recovery, Visibility Engineering & Reliability Principles. He has also been embedded with the profile, traffic, espresso (KV Store) teams. After leading LinkedIn’s last physical data-center build, he was the architect for how LinkedIn builds its infrastructure in Azure.
Michael has spoken at numerous events all over the world is the co-author of the book “Cloud Native Infrastructure with Azure” and “Reducing MTTD for High Severity Incidents”.
Nicoll Room 1
From Push to Pull: Managing Mutable Infrastructure at a Global Scale
Holly Mooneyham, Cisco Meraki
In a brownfield, the transition to immutable infrastructure can take years of concentrated effort; leaving teams with a complex set of mutable infrastructure and technical debt that requires a robust set of tools and processes to support as the architecture evolves. Managing these tools and processes across a massive hybrid cloud spanning more than 5,000 compute instances across 12+ data centers and public cloud is a daunting prospect for any SRE team.
I’ll share my story of evolving deployment infrastructure in this world from a developer-centric and push-oriented model to a distributed, eventually consistent, and pull-oriented model despite the challenges associated with traditional infrastructure and technical debt. I’ll also share what I’ll be keeping an eye out for in the future and how teams with similar problems but different environments can apply my design concepts to solve the problems and technical debt they face in their own work.
Holly Mooneyham, Cisco Meraki
Holly has worked in SRE for 7 years, looking after everything from kernel bugs to massive globe spanning distributed systems. She loves taking things apart to understand how they work and they joy of solving a complicated mystery someone dropped by her desk with. She takes the idea of curiosity as a skill to heart, and one of her favorite things is teaching others to "let's go see". Outside of work you'll find her hiking around the San Francisco Bay Area with her dog, leading raid groups in Final Fantasy XIV, or making lots of Japanese food.
Autonomous Automation: How Cloudflare Handles Server Diagnostics and Recovery at Scale
Jet Mariscal, Cloudflare
This talk describes the difference between automation and autonomy, and shares the thought process of how one can transform automation into an autonomous automated system, and includes a synopsis of a system that autonomously handles server diagnostics and recovery at scale at Cloudflare, having fleets of servers in data centers all over the globe, and how it was designed -- highlighting how a few specific principles including some of the essential SRE principles played a crucial role to its success.
This presentation, which is applicable to anyone regardless of size and industry, will help attendees looking to implement, improve, or transform existing automations to become autonomous automations that will drive value and lead to increased efficiency, productivity, and competitiveness in the long run.
Jet Mariscal, Cloudflare
Jet was an SRE and is currently working as the Infrastructure Engineering Tech Lead at Cloudflare. Previously, an SRE at Teralytics working on Big Data systems across several data centers around the world. Jet specializes in architecting and implementing large-scale fault-tolerant and high-availability distributed systems. Over his career, he’s built various systems and authored internal tools for automation in multiple programming languages.
3:20 pm–3:50 pm
Break with Refreshments
Level 3, Foyer 4
3:50 pm–4:45 pm
Nicoll Room 2–3
Real World Debugging with eBPF
Zhichuan Liang, Isovalent
In this talk, we'll explore the use of eBPF for debugging real-world production issues in a Golang environment. We'll cover the limitations of traditional debugging tools like gdb and delve, and dive into challenges and potential solutions for using eBPF in this environment. Through real-world use cases, we'll demonstrate how eBPF tools can help you debug production issues immediately without special debug modes. This talk will provide attendees with practical knowledge and deep insights into eBPF technology, helping them to think deeper about debugging other environments and inspiring them to further debugging any environment.
Zhichuan Liang, Isovalent
Zhichuan Liang is a software engineer working on the Cilium datapath in Isovalent, the one who loves troubleshooting and debugging by creating eBPF tools.
Nicoll Room 1
An SRE Guide to Linux Kernel Upgrades
Ignat Korchagin, Cloudflare
The Linux Kernel lies at the heart of many high profile services and applications. And since the kernel code executes at the highest privilege level it is very important to keep up with kernel updates to ensure the production systems are patched in a timely manner for numerous security vulnerabilities discovered almost every day.
Yet, because the kernel code executes at the highest privilege level and a kernel bug usually crashes the whole system, many SREs, production engineers and system administrators try to avoid upgrading the kernel too often just for the sake of stability. In many companies we have seen a tendency to create more obstacles to Linux kernel releases (requiring more approvals, harder update justifications, requiring more time in canary testing etc). But introducing all these obstacles and not treating kernel updates like any other software updates usually significantly increases the risk for the company and their service of being exploited.
One of the reasons SREs and production engineers are too afraid of ANY kernel upgrade is that they don’t actually know the details about Linux kernel release process and policy. This talk tries to demystify Linux Kernel releases and provides a guide on how to distinguish a kernel bugfix release from a feature release. We also try to explore why commonly established perceptions and patterns around production kernel releases are wrong and how you actually risk the stability of your systems by not releasing the kernel regularly. In the end we describe how kernel releases are implemented in our company and propose possible approaches to deploy kernel upgrades regularly with minimal risk.
Ignat Korchagin, Cloudflare
Ignat is a systems engineer at Cloudflare working mostly on Linux. Ignat’s interests are cryptography, hacking, and low-level programming. Before Cloudflare, Ignat worked as a senior security engineer for Samsung Electronics’ Mobile Communications Division. His solutions may be found in many older Samsung smart phones and tablets. Ignat started his career as a security researcher in the Ukrainian government’s communications services.
9:00 am–10:25 am
Nicoll Room 2–3
Taming Spiky Log Volumes: Maintaining Real-Time Log Accessibility with Kaldb
Suman Karumuri
Logs, much like currency, are subject to decreasing value over time. Observability teams face the challenge of ensuring high availability of recent logs, especially during incidents or deployments. Traditional log search systems struggle to auto scale cost-effectively and respond fast enough during an incident. In this session, we will discuss how Slack tackles spiky log volumes in ElasticSearch, from detecting log spikes at various stack layers to handling them using rate limiting, quotas, sampling, and back fills.
While these techniques help, they may result in data loss. Therefore, we will delve into the automation of handling log spikes using Kaldb, an open-source log search engine. We will explore trade-offs to minimize data loss, such as prioritizing the ingestion of fresh data over older data while auto-scaling. Powered by Lucene and OpenSearch, Kaldb allows Slack to prioritize fresh log data and rapidly scale capacity within a Kubernetes-based architecture.
Suman Karumuri[node:field-speakers-institution]
Suman Karumuri is a Principal Software Engineer and the tech lead for Observability at Airbnb. As an expert in distributed tracing, Suman has been a tech lead of Zipkin and a co-author of the OpenTracing standard, a Linux Foundation project under the CNCF. With extensive experience, Suman has spent years building and operating petabyte-scale log search, distributed tracing, and metrics systems at notable companies like Slack, Pinterest, Twitter, and Amazon. In his leisure time, Suman enjoys engaging in board games, exploring the outdoors through hiking, and spending quality time with his children.
Better Observability with No Code Changes
Tyler Benson, Lightstep
Many observability tools only instrument popular open source frameworks. A common path to deeper visibility requires code modification with manual instrumentation. However, SRE’s don't always have the ability to directly modify applications they are responsible for supporting. This restriction is often in place due to separation of job responsibility, or because the application is provided by an external vendor.
With configuration-based runtime instrumentation SRE’s can get visibility into applications without the need to recompile.
In my talk, I'll share some heuristics I use to identify useful functions/methods that can be instrumented in unfamiliar applications. I will also showcase a tool I wrote to apply these heuristics in a Java codebase. Once these functions/methods are identified, attendees will learn to instrument them without any code modification using OpenTelemetry's Javaagent.
Tyler Benson, Lightstep
Tyler has been working in the Observability space and writing instrumentation for over 10 years (ex-New Relic, ex-Datadog, currently at Lightstep). He was a founding maintainer for the OpenTelemetry Java Instrumentation project. Tyler loves Asian food and can speak casual Mandarin.
Nicoll Room 1
The Secret Weapon for a Successful SRE Career - And It's Not What You Think!
Luke Mundy, Virtual Gaming Worlds
There's one skill that has been a large part of the success I've had as an SRE and it has nothing to do with technical knowledge or understanding the inner workings of Kubernetes. It’s my soft skills and how I work with others. But what even are "soft skills"? How do you develop those skills if you don’t have them? How do you get better at "working with people"?
This talk will use personal and professional anecdotes to explain the concepts behind "soft skills" and give you real usable advice and drills to develop these skills - regardless of whether you are an introvert or an extrovert. I’ll break it down and deliver it in context for technical professionals and walk through why I think these skills and concepts are such an integral part of the SRE discipline.
Luke Mundy, Virtual Gaming Worlds
Luke is a Senior SRE from Perth Australia with a passion for working at the intersection of people and software engineering. His career started on an IT Helpdesk for a small MSP but with the ambition of working in the games industry as a Software Engineer. Luke was instrumental in bridging the gap between development and operations teams when DevOps first swept over the industry and today is a key part of a team that builds and runs a highly successful social games platform providing entertainment to hundreds of thousands of players every day across North America.
SRE Engagement Model Transition in Building and Expanding SRE Team
Shimpei Sasano, JCB Co., Ltd., and Ryotaro Takeda, NTT Data Corporation
In this talk, we will share our transition of the SRE team's engagement model from a start-up to a team capable of supporting a 400-person organization. We responded to challenges as the business and organization grew by changing the team's engagement model and mission through three main phases: Launch, Specialization, Expansion. We hope our transition story can provide some insights for those planning to build or expand an SRE team.
Shimpei Sasano, JCB Co., Ltd.
Shimpei Sasano is the product owner of the SRE team at JCB Co., Ltd. In his previous job, he worked as an in-house SE for a retail system, participating in the introduction of public clouds, development of applications for consumers. He joined JCB in 2020, and has been participating in a project to accelerate business construction by utilizing cloud native technology.
Ryotaro Takeda, NTT Data Corporation
Ryotaro Takeda is a Site Reliability Engineer at NTT Data Corporation. He joined NTT Data in 2014, and has been helping customers of retail and financial business to adopt Agile and DevOps practices. Currently, he leads SRE practice for a financial service customer and is also dedicated to promoting SRE culture in NTT Data.
10:25 am–10:55 am
Break with Refreshments
Level 3, Foyer 4
10:55 am–12:20 pm
Nicoll Room 2–3
Untangling the Tangled Cloud
Joshua Fox, DoiT International
How do you arrange virtual machines, databases, and other services into logical groups?
Whether with Google Cloud projects, AWS accounts, or Azure resource groups, my consulting customers find that either lumping all the resources together or parceling them out into tiny groups makes management, security, and cost analysis too difficult: It’s tough predicting the impact of a change.
In this talk, related to my article at Usenix :login;, I’ll explain how I advise architects to make their infrastructure follow the logical boundaries of microservices and the organization.
We'll see simple metrics that support the principles of high cohesion, low coupling; and high correlation between the stability of units and the fraction of inbound dependencies. To help in this, we'll review Ferent, a new open-source analysis tool that I developed in Clojure for measuring inter-project dependencies in Google Cloud.
Joshua Fox, DoIT International
Joshua Fox advises tech startups and growth companies about the cloud. Along with that, he writes open source, publishes technical articles, and speaks to cloud engineers as a Google Developer Expert.
Before that, he was a software architect in innovative technology companies in Israel for 20 years.
He has a PhD from Harvard University and a BA in math from Brandeis.
Read more at joshuafox.com
Functional Resonance Analysis: Diagramming Your System
Tanner Lund, Indeed
Nobody's system works exactly the way they think it does. On top of that, systems of people and software are constantly changing, resulting in a regular need to update our limited understanding of how things actually work - where the sources of our success are, where our risks are, and how things behave.
The Functional Resonance Analysis Method (FRAM) is one way to study complex systems. It models them in terms of their functions, dependencies, and interactions - identifying variance in function outputs (which can be good too!) instead of a "success/failure" paradigm. This approach allows for a better understanding of how systems work and - importantly - how they interact.
At the end of this session you should be able to understand such a model and evaluate whether it can help you better understand your own systems.
Tanner Lund, Indeed
Tanner Lund has been studying incidents and what they can tell us about systems for the better part of a decade. During his time supporting cloud platforms, building data pipelines, managing crises, and improving site reliability, he's found there is a lot more to understand about how software and people work (and don't work) together. Throughout it all his focus has been on understanding complex systems and how we achieve our goals through them, seeking to unlock their secrets. That may take a while...
Nicoll Room 1
Start Small, Scale Big: Building and Scaling Platforms and SRE Culture at Startups
Yash Shanker Srivastava
Platform and Site Reliability Engineering is just as crucial for startups as it is for bigger organizations, if not more, as it creates enabling technologies and culture for the various Engineering, Product Development, and Business use cases of organizations, and imbibes a culture of continuous feedback. SRE, done "suitably" right, can lead companies to efficiently deliver high-quality, secure, compliant, robust, and reliable products to their customers. This talk proposes to share the lessons learned from setting up teams and building Platforms and SRE culture for Startups and Scaleups. After the talk, the audience will take away a Framework of principles for setting up a Culture of Site Reliability Engineering, and practical learnings and insights to build and scale SRE teams and platforms that can handle the demands of a growing user base, in a dynamic and fast-changing startup environment.
Yash Shanker Srivastava[node:field-speakers-institution]
Yash Shanker is an Engineer based in Bangalore, India. Yash started his journey in Software Engineering by developing payment products at PayU Payments in India. Over the last 6 years, he has helped organizations across Germany, India, Singapore, and Thailand adopt and mature into the DevOps culture. He is currently working as Engineering Manager for the DevOps team at Toplyne, building Platforms and DevEx for the Engineering and Data teams.
Cultivating Accountability and Resilience
Sandeep Hooda, DBS Bank
In the dynamic world of technology, where challenges abound, a strong and resilient infrastructure is pivotal. DBS’ framework is built on a set of techniques that acts as a catalyst embodying a collaborative, resilient and forward-thinking culture across the organisation. By shifting the focus from individuals to the system as a whole, our proactive approach termed as the ABCD and E’s of cultural transformation includes how everyone in the organisation should behave (by understanding the concept of AAI: Awareness, Acceptance, and Intention), how we can create a safe and secure environment, how we include data driven analysis to drive conversation, and how we encourage collaboration.
Sandeep Hooda, DBS Bank
Sandeep is an Engineering Manager at DBS with over 19 years of experience. In this leadership role, he is responsible for engineering innovative and strategic solutions. He has deep technical expertise in Platform engineering, SRE, DevOps, Risk management, solution architecture and systems engineering. He has been instrumental in driving digital transformation and promoting SRE culture. He also had the privilege of speaking at several tech conferences and enjoys writing on SRE and DevOps topics. He enjoys his free time out in the ocean, practicing to sail around the world.
12:20 pm–2:10 pm
Lunch
Summit Room 1
2:10 pm–3:35 pm
Nicoll Room 2–3
Finding the Needle in the Haystack: Predicting Storage Device Failures in Data Centers
Fanjing Meng and David Cesarano, IBM
Data is a valuable asset for organizations and its growth is exponential. However, storage device failures can result in data loss, service unavailability, and economic loss. Site Reliability Engineers face significant challenges managing and monitoring the millions or billions of storage devices deployed. Existing approaches to failure prediction have limitations in accuracy, performance, and cost-effectiveness. In this talk, we will present a practical, multi-phase proactive sampling-based approach and system that addresses these challenges. We will also provide a live demonstration of the system and practices in our data center, which has a multi-tiered cloud storage pool based on various storage devices. This talk aims to encourage practical storage failure prediction research to solve real-world challenges.
Fanjing Meng, IBM
Dr. Fanjing Meng is the CTO of IBM China System Development Lab, with over 20 years of experience in cutting-edge technology research, development and management. She specializes in sustainable computing, AIOps, ITOA, cloud computing, software and solution engineering. Her current focus is on developing a sustainable computing optimization and management platform to accelerate the digital transformation of enterprises. Dr. Meng has published over 30 academic papers and holds more than 40 international patents in innovative fields. She has received over 30 awards for her contributions to technological innovation from IBM and IEEE. Additionally, she actively participates in technical and academic communities, serving as a General Chair and committee member for international conferences, and as a project leader for IEEE WIE Beijing Affiliate and a speaker for IEEE Women in Services Computing (WISC).
David Cesarano, IBM
David Cesarano is a Solutions Architect at IBM and is located in Phoenix, Arizona, USA. He has over 20 years of experience with IT and a Bachelor of Science degree in Computer Information Systems from Northern Arizona University. He has several data and cloud certifications and a couple pending patents at IBM. His current area of focus is industry and data center management.
Lessons Learned Running GKE Clusters on Spot Instances.
Olga Mirensky, Australia and New Zealand Banking Group, ANZx
Reducing cloud costs is one of the major concerns for tech companies today. One of the most cost effective ways to save on compute is to utilise Spot provisioning model. All major cloud vendors offer Spot Instances with up to 91% discount compared to on-demand prices and it’s tightly integrated in the respective vendor’s ecosystem, in particular in managed Kubernetes services like GKE, EKS and AKS. From our experience running a fleet of GKE clusters on Spot Instances, there’s much more to it than meets the eye. Losing capacity at a moment’s notice is only one part of the story and in this talk, we will delve into under-the-hood mechanisms of GKE Spot implementation, edge cases, and why teams collaboration and solid SRE principles are absolutely crucial in this environment.
Olga Mirensky, ANZx
Olga is a Platform Engineer in Australia and New Zealand Banking Group focusing on building cloud infrastructure for the new digital bank. Her recent roles span years of experience working with Kubernetes of various shapes and flavours running on AWS, GCP and Azure, she has also developed managed OpenShift on Azure (ARO) while serving as a RedHat SRE. She loves exploring modern cloud native technologies and currently is experimenting with Cluster API, Cilium, eBPF and system performance.
Nicoll Room 1
Are We All on the Same Page? Let's Fix That
Luis Mineiro, Delivery Hero
The industry defined as good practice to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused.
Organizations with complex distributed systems that span dozens of teams can have a hard time following such practice without burning out the teams owning the client-facing services. A typical solution is to have alerts on all the layers of their distributed systems. This approach almost always leads to an excessive number of alerts and results in alert fatigue.
Adaptive Paging is an alert handler that leverages the causality from tracing and Opentracing/OpenTelemetry's semantic conventions to page the team closest to the problem. From a single alerting rule, a set of heuristics can be applied to identify the most probable cause, paging the respective team instead of the alert owner.
The approach enables an effective symptom-based alerting strategy with thresholds derived from the respective operation service level objective.
Luis Mineiro, Delivery Hero
Luis's broad background in software engineering includes experience in DevOps, networks, mobile development, and more. He is passionate about reliability engineering, with an obsession about on-call health and getting rid of false positives. Luis has been with Delivery Hero since 2022 creating a developer platform focused on self-service and automation.
Humane On-call
Martin Barry, Fastly
On-call is part of the working life for many folks in SRE, technical operations and software development roles.
It's a key enabler of the 24x7 operations that most businesses require and yet it is often an afterthought in the hiring, on-boarding and managing phases of team building.
This talk will aim to underline the importance of on-call and why it deserves more thoughtful consideration.
Attendees will take-away ideas for adoption of on-call at their company or changes they could make to their current on-call situation.
Martin Barry, Fastly
Martin is the Interconnection Manager at Fastly where he focuses on the delivery of new edge capacity to add to their 252 Terabit network (as of December 2022). He has over two decades of experience in technical operations, starting with system and network administration, through to the design, deploy and operation of large scale internet infrastructure.
3:35 pm–4:05 pm
Break with Refreshments
Level 3, Foyer 4
4:05 pm–5:00 pm
Nicoll Room 2–3
Giving Away Your Secrets: Opening Metrics Up to Users
Alexander Ananiadis, Bloomberg
Does your platform host applications written by your customers who need to monitor their performance? Is your managed service mission-critical to performance-sensitive clients? Or perhaps your “customers” are actually other teams using your internal platform. Metrics and monitoring systems usually give us insight into the performance of our own services, but giving data directly to users – both internal and external – can help them detect and solve problems on their own, increasing their satisfaction and reducing the support load on our teams.
We will explore different methods for providing users and customers with this direct level of visibility and discuss which approaches make sense in different real-world scenarios, along with some common challenges and pitfalls.
Alexander Ananiadis, Bloomberg
Alexander is a team lead at Bloomberg L.P. where his team builds monitoring and observability solutions in the Real-time Enterprise Technology group. His professional experience is mainly in Python, working on back-end services and distributed systems. Alexander has a bachelor's degree in materials science & engineering from Johns Hopkins University and in a past life performed research on metal alloys for the US military.
Nicoll Room 1
From "Keeping the Lights On" to "Designing the LEDs": A Detailed Review of Our Journey Transforming 500+ Engineers
Ian David Hamilton and Sriram Subramanian, Standard Chartered Bank
In 2020, inspired after attending the Oct 2019 SRECon in Dublin, we embarked on our own SRE (and data engineering) transformation. Now 3 years later we would like to share our story transforming 500+ Production facing engineers from a process driven way of working to a dynamic, award winning, software engineering team who place SRE at their heart. The session will include a detailed review of the "how did we do it", lessons learned, key enablers that accelerated our transformation, and ongoing challenges – hopefully with some crowd sourced solutions! Key topics will be People and Culture, Observability, and measuring client experience with a 360 degree lens using quantitative and qualitative indicators.
Ian David Hamilton, Standard Chartered Bank
Ian is a seasoned technology professional with 20 years of experience working in technology financial services teams across investment and wholesale banking. His expertise in this field has led to a deep understanding of the intricacies of financial operations and the need for efficient and reliable systems. He is a passionate advocate for Site Reliability Engineering (SRE) and its power to transform technology operations. When not working in Technology or speaking about SRE, Ian enjoys a range of sports including cricket, football, and cycling. When not exercising Ian is exploring blockchain, daydreaming of being financially free or spending time with his family. Ian would like to thank ChatGPT for contributions towards curating his bio.
Sriram Subramanian, Standard Chartered Bank
Sri has been in the banking financial technology space for more than 25 years and has worked across different geographies globally across various financial institutions. He is based out of Singapore and has wide experience with build, maintain and transformation across both the investment and wholesale banking, working across front, middle and back office. In his current role, he is responsible to drive strategy, adoption, and engineering for SRE. He is an evangelist for the SRE Community of Practice within the bank and likes solving for reliability and resilience of services helping elevate customer experience.
5:00 pm–7:00 pm
Conference Reception
Summit Room 1
Sponsored by DBS
9:00 am–10:25 am
Nicoll Room 2–3
Fighting Financial Crimes as an SRE
Anisha Manoharan, IMTF - Excellence in RegTech Solutions
The Key takeaways from this talk would be the Technologies used in the fight against financial crime, As an SREs.
Anisha Manoharan, None
Anisha is an SRE with expertise in implementing automation, quality, security and monitoring technologies and solutions in the area of AML/CFT regulations and other aspects of financial crime. She has worked with various government organisations and currently works at a leading software provider of anti-financial crime applications to financial institutions, IMTF. She's delighted to present at SREcon23 and share her experience using a range of technologies to deliver solutions and products in a sustainable way.
Beyond Observability - Aligning Technology Performance to Business Outcomes
Stephen Townshend, SquaredUp
In the Digital Age we don’t know what we don’t know until we do something and get feedback.
So, how do we build that feedback loop? Many organisations have Data divisions focused on business and customer reporting, but they lack a connection back to engineering. What if we extended the scope of observability to include that missing feedback loop? What if we treated our business objectives, customer outcomes, and engineering maturity as things that we monitor continually and in real-time (just like our technology)?
In this talk I explore the bringing together of BI and observability into something new, which I’m calling “bigger picture observability“ for lack of a better term. Something that provides a compass for organisations to navigate the ocean of chaos we call the Digital Age.
Stephen Townshend, SquaredUp
Stephen is currently working as a Reliability Advocate for unified dashboarding company SquaredUp. Previously he worked as a performance engineer for many years years before switching to SRE. Stephen is passionate about making complex ideas easy to understand and implement, and promoting empathy and psychological safety in technology. He shares his SRE learning journey in his podcast Slight Reliability.
Nicoll Room 1
What Is Linux Kernel Keystore and Why You Should Use It in Your Next Application
Ignat Korchagin, Cloudflare
Did you know that Linux has a keystore ready to be used by any application or service? Applications can securely store and share credentials and keys, sign and encrypt data, negotiate a common secret - all this by never touching a single byte of the underlying cryptographic material.
This is especially useful in cloud-native environments, where services authenticate and securely talk to each other. But if a network-facing service also has some secret in its process address space, it sets itself up for a failure as any potential out-of-bounds memory access vulnerability may allow the secret to be leaked. Imagine a world where you don’t have to run an SSH agent just to protect your SSH keys.
On top of keeping your secrets secret Linux keystore integrates with security hardware, like TPMs and HSMs and may provide a single entry point for applications to obtain their secrets.
Ignat Korchagin, Cloudflare
Ignat is a systems engineer at Cloudflare working mostly on Linux. Ignat’s interests are cryptography, hacking, and low-level programming. Before Cloudflare, Ignat worked as a senior security engineer for Samsung Electronics’ Mobile Communications Division. His solutions may be found in many older Samsung smart phones and tablets. Ignat started his career as a security researcher in the Ukrainian government’s communications services.
Multicloud and the Chamber of Secrets
Michael Kehoe, Confluent
Making secrets both available and secure in a hybrid or multi-cloud environment is a challenging endeavor. How should you balance the security of the system that stores your secrets? What system(s) should you even choose?
Confluent is a multicloud SaaS provider that needs to secure hundreds of credentials across multiple clouds. This talk will detail how we created a strategy to store and serve these secrets securely and how we keep the auditors happy. The session outlines the process we took to understand our the needs of our engineers, what options we made available to our team, what controls did we put in place and how did we keep the auditors at bay.
Michael Kehoe, Confluent
Michael Kehoe is an author, speaker and Sr Staff Cloud and Reliability Architect at Confluent, leading a whole organization initiative to redesign the company’s cloud platform. Previously, he was a Sr Staff Site Reliability Engineer (SRE) at LinkedIn, architecting LinkedIn’s move to Microsoft Azure. Before graduating with a Bachelor of Electrical Engineering from the University of Queensland (Australia), Michael interned at NASA Ames Research Center building small-satellites known as Phonesats.
While working at LinkedIn, Michael led the companies work on Incident Response, Disaster Recovery, Visibility Engineering & Reliability Principles. He has also been embedded with the profile, traffic, espresso (KV Store) teams. After leading LinkedIn’s last physical data-center build, he was the architect for how LinkedIn builds its infrastructure in Azure.
Michael has spoken at numerous events all over the world is the co-author of the book “Cloud Native Infrastructure with Azure” and “Reducing MTTD for High Severity Incidents”.
10:25 am–10:55 am
Break with Refreshments
Level 3, Foyer 4
10:55 am–12:20 pm
Nicoll Room 2–3
Hold My Beer - Load Testing. In Production. On Autopilot.
Slava Antonenko, Outbrain
You're driving a car and like any other it went through crash testing before mass manufacturing and shipment. Now imagine tests were done for each component separately and not the car as a whole. Would you still drive it? Not so sure! The same is true with Load Testing in production. Having them on real-time data is possible and helps you to increase your predictions and business performance. In this talk we will explain how we’ve touched the holy grail - Load Testing in production while minimizing risk and human intervention. We will go over the gains (uptime and predictability) and tradeoffs (risks and costs). We will go over how automating load tests drove a deeper cultural shift by increasing developer confidence in their services, with almost no additional overhead to the developers. And lastly, we will share more info about the service that made it all happen.
Slava Antonenko, Outbrain
Slava is a Site Reliability Engineer at Outbrain, bringing 7 years of engineering experience to his work. He specializes in building reliable and scalable eco-systems, has a passion for innovation and a talent for bringing ideas to life from scratch. In addition to his technical pursuits, Slava is also a huge Formula 1 fan.
Performance Testing in Keptn Using K6
Jainam Shah, JioSaavn
Keptn is an open-source project that provides scalable automation for delivery and operations, helps evaluate service level objectives (SLOs), and includes a dashboard, alerts, and auto-remediation. Keptn also allows users to add load testing to their delivery pipeline using tools like K6, JMeter, and Locust.
K6, a modern load testing tool, can simulate thousands of virtual users with just one load generator and has the ability to export test metrics to external data sources like Prometheus and Datadog.
After load testing, Keptn's Quality Gates can evaluate and monitor test results and define SLO objectives.
The Keptn Job Executor Service allows for the integration of K6 and can be used to integrate other tools by running tasks as short-lived Kubernetes Jobs.
Jainam Shah, Software Engineer @ JioSaavn | GSoC @ Keptn
Jainam is a polyglot developer from India who graduated from IIT Ropar with a degree in Computer Science. He currently works as a Software Engineer in the ML/AI team of JioSaavn and recently contributed to the Keptn organization as part of Google Summer of Code.
In his free time, he enjoys playing football and participating in hackathons.
Nicoll Room 1
Mastering Chaos: Achieving Fault Tolerance with Observability-Driven Prioritized Load Shedding
Harjot Gill and Hardik Shingala, FluxNinja
Microservices-based applications are complex, with metastable failures like cascading failures and retry storms posing significant challenges. In this talk, we will explore these types of failures, the shortcomings of current state-of-the-art approaches, and introduce Aperture, a unique open-source tool for observability-driven prioritized load shedding.
Aperture enables graceful degradation of non-critical services, ensuring system stability. We'll delve into Aperture's innovative architecture, covering its control and data planes, and discuss how it employs token buckets, weighted fair queuing, and concurrency limiting to prioritize workloads effectively.
We will also share real-world results from implementing Aperture in cloud products, demonstrating its ability to protect multi-tenant databases from overloads through prioritized load shedding of GRPC and GraphQL traffic.
Join us on this journey as we unveil a powerful solution that addresses the limitations of current approaches, ensuring the reliability and resilience of your microservices-based applications.
Harjot Gill, FluxNinja, Inc.
Harjot Gill is Co-founder & CEO of FluxNinja, an early stage startup enabling reliability automation. He is co-creator of the Aperture open source project and active contributor in the open source community. Previously, he was Co-founder & CEO of observability startup Netsil, which was acquired by Nutanix. He holds advanced degrees in Computer Science & Networking Systems and has published several highly cited papers on declarative programming, mesh networks and scalable packet processing.
Hardik Shingala, FluxNinja, Inc.
Hardik Shingala, is an experienced IT professional with over 5 years of experience in the industry. He has worked on a variety of projects related to cloud computing, security, and finance, among other areas. He is skilled in multiple programming languages including Golang, Java, and Python, and has experience working with technologies such as Kubernetes and Docker. At FluxNinja, he specializes in backend development and DevOps.
Distributed Tracing: Adaptive and Telemetry-Based Approach for Effective Monitoring of Any Modern Application Stack
Susobhit Panigrahi
The current solution regarding distributed tracing and how to intelligently use the solution for better Observability and Monitoring for modern application stack by capturing useful traces in the heap of massive trace dumps will be interesting to discuss. The solution eases this process and reduces pain points for SRE, Dev and Ops teams largely, come join us and share thoughts! :)
Susobhit Panigrahi, VMware
Susobhit is a curious and inquisitive individual working as SRE/ Backend Developer @ VMware with a knack to solve interesting problems at scalable and simple solutions. Always curious!
12:20 pm–1:40 pm
Lunch
Summit Room 1
1:40 pm–2:35 pm
Nicoll Room 2–3
Challenges of Managing Real-Time Financial Market Data Storage
Kiran Kasichayanula and Nishith Nedungadi, Bloomberg LP
Are you looking to manage the storage of real-time market data? Or maybe you are interested in knowing how Bloomberg manages the storage of 300 billion unique events per day. Allow us to take you on our journey from local disk to cloud storage. We will talk about our motivations for the transition and explain how and why we decided to invest in a custom data scraping and chunk streaming solution to publish data into the cloud. We will talk about lessons learnt in migrating a complex production environment to leverage cloud storage.
Kiran Kasichayanula, Bloomberg LP
Kiran is System Reliability Engineer (SRE) in Bloomberg’s Feeds Engineering group. Kiran has worked on stability of real-time market data pipelines through automation, process standardization, and continuous process improvement. Kiran loves to play table tennis.
Nishith Nedungadi, Bloomberg LP
Nishith is an SRE team leader in Bloomberg’s Feeds Engineering group. He has 20 years of experience building and managing real-time data streaming systems in the finance industry. Nishith is interested in finding solutions to scalability and reliability problems related to the market data space. He loves robotics and spends most of his free time mentoring FIRST teams.
Nicoll Room 1
Transformation Journey of E2E Customer Flow Testing to Proactive Synthetic Monitoring System
Ananth Jayaraman and Rex Pravin L, PayPal India Private Ltd.
An important cog during PayPal’s releases is validation of End-to-End Customer Transaction Flows through automated test runs on the newly upgraded version before its released LIVE. The E2E customer flow tests are also simulated to run on each Availability Zones, to certify Changes, Maintenances on them before enabling to external customer traffic. Hence automated customer flow tests act as Change Reliability Lever. We have evolved this release vetting system to a Synthetic monitoring capability where PayPal Customer flow tests act as synthetics that are externalised to run at regular intervals to ensure they are functioning, available, and responding within specified Performance thresholds. Synthetic monitoring helps in Incident Prevention and reduce MTTD (Mean time to Detect) and thereby restore the site health within minutes. Its also facilitates proactive Alerting and Communicate the impact to the Customers proactively.
Ananth Jayaraman, PayPal
Ananth Jayaraman - A member of Technical Staff within SRE Platforms at PayPal. Technologist who is passionate about automation and building Platforms that are Reliable and scalable. Ananth has built Auto-Triage and Auto-recovery, along with Synthetic Monitoring capability for PayPal.
Rex Pravin L, PayPal
Rex is a Technology leader & SRE enthusiast. Rex has played various roles in PayPal from Quality Engineering Automation, Merchant Technical services & leading SRE Critical Issue Engineering, the common factor is the customer focus and passion for reliability & automation. Rex currently heads the SRE Platform Engineering team in APAC, building platforms to drive availability & reliability, especially on Change Reliability, Operability and Accountability tracking.
2:35 pm–3:05 pm
Break with Refreshments
Level 3, Foyer 4
3:05 pm–3:50 pm
Nicoll Room 2–3
The Only Constant Is Change: Lessons from a 25 Year SRE Career
Andrew Ryan, Meta
Twenty five years ago, when I started as a sysadmin and attended my first USENIX conference (LISA 1998), "SRE" didn't even exist as a field, and the compute infrastructures we supported were far smaller and more localized. Now, SRE is an industry standard job across the world, and SRE's manage massive and extraordinarily capable cloud environments distributed around the globe.
But the pace of technological change is not slowing down: with rapid development in new technologies such as Machine Learning and Large Language Models (e.g. ChatGPT), we can ask ourselves: will SRE still exist 25 years from now, and if so, what will those jobs be like, and what skills will they require? We cannot know the future, but this talk will concentrate on the things that we can know: job and career skills that have served the author well over the years, and cannot be easily -- if ever -- replicated by large volumes of compute power or any foreseeable artificial intelligence.
Using examples from the author's career through the industry, starting from maintaining a handful of systems in small organizations, all the way to the highest engineering ranks of one of the world's largest tech companies, we will discuss what has worked well, what career options are available, and how to keep your job skills relevant in a world that is constantly evolving.
Andrew Ryan, Meta
Over the course of his career, Andrew has been a system administrator, software developer, manager, and an SRE. For the last 14 years he has been at Meta as a Production Engineer, working on large scale infrastructures in areas such as Big Data, CDN’s, and Storage. He has also been heavily involved with building successful large-scale programs for hiring and developing engineers at Meta, including intern and pre-internship programs. Outside of work, you can often find him creating tie dye shirts and other “wearable art.”