Are We Really Engineers?

Hillel Wayne; Stephane Dudzinski; Abbas Soltanian; Joan O'Callaghan; Avleen Vig

All the times listed below are in Irish Standard Time.

New in 2024! The Discussion Track is a place for attendees and experienced session hosts to discuss challenges and problems they have experienced and the solutions that have worked for them. The format of each session is decided by the session co-hosts, who may run it as an AMA, an unconference, or simply as a group discussion. See complete details about the Discussion Track along with the rest of the conference program below.

Attendee Files

SREcon24 Europe/Middle East/Africa Attendee List (PDF)

Monday, 28 October

17:00–19:00

Badge Pickup

Ground Floor Foyer

18:00–19:00

Welcome Get-Together

Liffey Hall 1

Whether this is your first time at SREcon or your tenth, enjoy this opportunity to meet your fellow attendees over snacks and beverages before the conference program begins.

Tuesday, 29 October

07:30–17:00

Badge Pickup

Ground Floor Foyer

07:30–8:45

Morning Coffee and Tea

The Forum

08:45–09:00

Opening Remarks

The Liffey

Program Co-Chairs: Effie Mouzeli, Wikimedia Foundation, and Murali Suriar, Snowflake

09:00–10:30

Opening Plenary Session

The Liffey

Dude, You Forgot the Feedback: How Your Open Loop Control Planes Are Causing Outages

Tuesday, 09:00–09:45 GMT

Laura de Vesine, Datadog, Inc.

Available Media

It's a strong principle of good UX design that users should get feedback about the results of their actions, to help prevent errors. Experienced SREs know to build in additional observability to systems to watch our systems change as we mutate them, but these are typically out-of-band and require a conscious, deliberate action to observe -- so getting good feedback into our actions requires constant vigilance and training of new users. What if we instead built control planes that tell us exactly what we've done, and what effect that is having? This talk explores various patterns of "fire and forget" control planes in production systems, how each one contributes to outages, and some simple solutions to build better tools for operations.

Laura de Vesine is a 20+ year software industry veteran. She has spent the last 8 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her cats nap on her diploma.

You Depend on Time, This Is How It Works and You Won’t Believe It

Tuesday, 09:45–10:30 GMT

Philip Rowlands, Jane Street

Available Media

This is a talk about calendars, clocks, and computers. We’ll look at the metrology of the second, from candles to atoms, and consider how your phone always seems to know the right time.

If you’ve ever wondered why is today Thursday? or how was the Gregorian calendar adopted? then come and learn the mistakes to avoid the next time you are the Pope.

If you’ve ever wondered why do these two clocks disagree? then come and learn about the challenges of finding the elusive perfect tick, and why it’s not at the top of Mount Everest.

And if you’ve ever wondered how calendars and clocks work together in modern computer systems, then come and learn about protocols and APIs for keeping clocks reliable and accurate.

Philip Rowlands has been an SRE since before he really understood what it meant. He has worked over the years on automated telephony, Google Production SRE, Mainframe Linux, and more recently for various financial firms, all of which had timekeeping challenges.

10:30–11:00

Coffee and Tea Break

The Forum

11:00–12:30

Track 1

The Liffey A

SRE Saga: The Song of Heroes and Villains

Tuesday, 11:00–11:40 GMT

Daria Barteneva, Microsoft Azure

Available Media

SRE team require a balance of technical and soft skills, creativity and teamwork to be successful. Drawing parallels between the roles, challenges and dynamics of Dungeons and Dragons party and an SRE team will help us to explore SRE journey from the team inception to developing ideal makeup in terms of tenure/seniority, skillset and align it with the context SRE team could be part of.

We will share practical examples that helps SRE teams building resiliency and effective collaboration while dealing with challenges. We will also explore different mechanisms that can channel "super hero" energy to make team stronger and nurture the talent, helping team to keep the balance of distributed knowledge and accountability.

In this talk we will discuss:

Examples of functional SRE team setups
Common challenges SRE team may encounter
Developing early in career SRE
Dealing with the change and building resilience
Identifying red flags and avoiding long term problems

Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing organisational culture, processes, and platforms to improve service reliability and on-call experience. She has spoken at conferences on various aspects of reliability and human factors that play a key role in engineering practices, and has written for O'Reilly. Daria is originally from Moscow, Russia, having spent 20 years in Portugal, 10 years in Ireland, and now lives in the Pacific NorthWest.

The Frontiers of Reliability Engineering

Tuesday, 11:50–12:30 GMT

Heinrich Hartmann, Zalando SE

Available Media

We take the 10s anniversary of SRECon as an occasion to reflect over the past decade of advancements in Reliability Engineering and provide an overview about the Frontiers we are facing today. Within Zalando we followed major trends of the industry in outsourcing hardware provisioning to AWS, package applications into Docker images, fully automated deployments (CI/CD), and implemented Distributed Tracing for Microservice Observability. Despite these advances, many challenges remain in building reliable, observable software systems and new areas arose which require new methods and tools. In the talk we are proving a number of conceptual view that help to map out the larger Reliability Engineering landscape and zone-in on 3 specific frontiers that we are actively investing in at Zalando: (1) Data Operations and Monitoring Event Based Systems (2) Mobile Observability (3) Effective Management Practices for Reliability.

Heinrich Hartmann is a seasoned expert with a decade of experience in Reliability Engineering. Currently, he serves as the Senior Principal SRE at Zalando, a leading European e-commerce company, where he oversees company-wide reliability practices. Before joining Zalando, Heinrich was the Chief Data Scientist at the Monitoring Platform Circonus, where he managed the analytical product offerings and pioneered histogram methods for latency monitoring.

Heinrich is a frequent speaker at industry conferences and is best known at SRECon for his regular "Statistics for Engineers" masterclass.

Track 2

The Liffey B

I Can OIDC You Clearly Now: How We Made Static Credentials a Thing of the Past

Tuesday, 11:00–11:40 GMT

Iain Lane and Dimitris Sotirakis, Grafana Labs

Available Media

At Grafana Labs, we tackled a thorny problem: managing secrets in an open-source CI/CD pipeline. Our journey from static secrets to OIDC-based access wasn't just about better security—it was about empowering our engineers. We'll walk you through how we leveraged OIDC and GitHub Actions to create a "secretless" system for accessing cloud resources, complete with shared jobs and abstractions that make secure access simple. But it wasn't all smooth sailing. We'll share war stories, including a security hiccup that taught us valuable lessons. If you're drowning in a sea of secrets or just want to sleep better at night, come and learn how we boosted security while cutting operational headaches. You'll walk away with practical strategies for implementing OIDC-based access that'll make your engineers happy and your security team even happier.

Iain is a senior software engineer at Grafana Labs. A member of the Platform team, his focus is on maintaining the infrastructure - Kubernetes clusters - which runs Grafana Cloud, and helping build tools and processes for engineers to deploy their software into this environment with maximum confidence.

Dimitris is a Senior Software Engineer with background in Backend, DevOps, Release and Platform Engineering. Specialized in CI/CD architecture, he has spent most of his career tackling the challenges of delivering software, tools and frameworks with quality. Currently he’s a member of the Platform Productivity team in Grafana Labs.

OMG WTF SSO: A Beginner’s Guide to Single Sign-On (Mis)configuration

Tuesday, 11:50–12:30 GMT

Adina Bogert-O'Brien

Available Media

SSO protocols are just ways for an identity provider to share information about an authenticated identity with another service. Me having a way to tell my vendor “yeah, that’s Bob” doesn’t tell me what the vendor does with this information, or if the vendor always asks me who’s coming in the door. A bad SSO implementation can make you think you’re safer, while hiding all the new and fun things that have gone wrong. To get the most out of implementing SSO, I need to know what I’m trying to accomplish and what steps I need to follow to get there. To illustrate why SSO needs to be set up carefully, for each of the things you need to do right, I’ll give you some fun examples of creative ways you and your vendor can do this wrong. We all learn from failure, right???

I am incessantly curious, work in renewable energy, and sometimes find vulnerabilities when I’m bored. I co-founded a hackerspace over a decade ago but have only just accepted that security is more than a hobby. At work, I’m a business architect with security leanings working in knowledge management for a major renewable energy company.

Connect:

Mastodon

Track 3

Liffey Hall 2

Workshop: Loadshedding and Isolation Using Envoy Proxy

Tuesday, 11:00–15:30 GMT

Laura Nolan; Niall Murphy, Stanza

Effective load management is a core aspect of the SRE role. In this workshop, participants will be introduced to a number of Envoy Proxy features that are used for loadshedding and isolation, such as circuit breaking, adaptive concurrency, and ratelimiting. As part of the practical element of the workshop, participants will interact with Envoy configurations and status/control pages and endpoints, as well as Envoy’s telemetry.

Prerequisites: please arrive with a laptop with docker-compose installed. Information on setting up docker-compose is available at https://docs.docker.com/compose/install/

Laura Nolan has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know, and is currently is completing her MSc in Human Factors and Systems Safety at Lund University. Laura is a member of the USENIX board of directors and a long-time SREcon volunteer. She lives in rural Ireland in a small village full of medieval ruins.

Connect:

X

Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable Machine Learning, with Todd Underwood and many others.

Connect:

X

Discussion Track

Liffey Hall 1

Managing Cost

Tuesday, 11:00–12:30 GMT

John Looney, Reddit, and James Beal

This session is an opportunity for people to come together and discuss managing cost, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in managing cost.

John is a platform engineer who helps senior engineers tune their applications to cost less, and makes Kubernetes cost less to run. Both projects required making promises to product teams - “that the compute platform will be reliable enough that they don’t need to pad out resources to ensure a quality experience for hundreds of millions of Redditors.” He’s a long-term speaker at SRECon, ranging from distributed systems to how data privacy law impacts SREs.

James started playing with computers with the ZX81, learned C for his A Levels, and has degrees in computer science and parallel and distributed systems. He has been using Linux originally with MCC Interim Linux and later with other distributions. He started volunteering at the OTW 14 years ago when the organization only had two servers (it now has three racks).

12:30–14:00

Luncheon

The Forum

Sponsored by Cortex

14:00–15:30

Track 1

The Liffey A

Sailing the Database Seas: Applying SRE Principles at Scale

Tuesday, 14:00–14:40 GMT

Ioannis Androulidakis and Martin Alderete, Booking.com

Available Media

In this talk we will demonstrate how we apply core SRE principles in the field of Database Engineering. More specifically, we will talk about the challenges of operating large-scale database systems in multiple cloud environments and how adopting best SRE practices dramatically improved our daily workflows and operations.

We will share insights and concrete use cases around the following topics: Monitoring Distributed Systems, Eliminating Toil and Postmortem Culture.

This talk will equip attendees with ideas and guidelines to better understand and efficiently operate their database systems such as choosing the right SLIs and SLOs, automating capacity planning and embracing a postmortem culture after outages.

Ioannis Androulidakis is a Site Reliability Engineer with a strong background and multiple years of experience in Operating Systems, Observability Tools and Cloud Platforms. He is passionate about OSS technologies and has contributed to multiple open-source projects over the years.

Ioannis holds a diploma in Electrical and Computer Engineering from the National Technical University of Athens, Greece. In 2017 he was accepted for a full-time internship at the IT department of CERN in Geneva, Switzerland. Then, he worked for different companies as Software Engineer and expanded his knowledge in virtualization, distributed systems and cloud-native storage.

Recently he joined Booking.com's Database Engineering team as Site Reliability Engineer in Amsterdam, Netherlands, where he primarily focuses on the reliability and performance of large-scale MySQL clusters.

Connect:

X

Martin Alderete is a Principal Site Reliability Engineer with a long track record in Engineering, Distributed Systems and System Level Programming in both the academia where after getting his degree he worked as teacher assistant. And the industry where he led different teams building complex systems at scale.

He is passionate about Open Source and new technologies, an active contributor to open-source projects and part of different technical groups.

Before joining Booking.com he worked in multiple industries including space where he worked as a Satellite Reliability Engineer building systems (and bugs!) to operate fleets of satellites.

He is based in Amsterdam but originally from the beautiful Patagonia Argentina.

Connect:

X

Survivor: MySQL Island – Outwit, Outplay, Outlast Metadata Locking Challenges

Tuesday, 14:45–15:05 GMT

Julia Jablonska, Capsule CRM

Available Media

Think you understand MySQL metadata locks? Join this interactive session to test your knowledge and take a deep dive into the intricacies of MySQL's locking mechanisms.

We'll explore real-world scenarios, such as creating tables with foreign key constraints and adding indexes, to see how metadata locks can impact performance and stability. Through live voting you'll gain insights into what's happening behind the scenes and learn practical tips for managing database migrations.

As an Infrastructure Engineer at Capsule CRM, Julia is responsible for keeping Capsule secure, fast and reliable for thousands of our business customers around the globe.

Fixing Your Noisy Pager in 500 Easy Steps

Tuesday, 15:10–15:30 GMT

Chris Sinjakli, PlanetScale

Available Media

You're not sure when it happened, but your pager suddenly seems noisy. You've started dreading your on-call shifts before they begin. You breathe a sigh of relief every time you sleep without interruption. Sound familiar?

Noisy on-call rotas sneak up on us one page at a time - an edge case in a new feature, an alert with too many false positives, processes that get stuck and need restarting. Each of these is easy to tolerate alone, but they quickly add up, leaving you swamped in alert noise and tired from missed sleep.

In this talk we'll explore techniques for digging ourselves out of the hole. We'll look at how to demonstrate the scale of the issue to our colleagues, what to do when the list of problems seems insurmountable, and how to get started with automated remediation in a low-risk way - I promise it's less scary than it sounds.

Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.

All his programs are made from organic, hand-picked, artisanal keypresses.

Connect:

Mastodon

Track 2

The Liffey B

Achieving Excellence: SLO Thresholds That Transform Service Quality

Tuesday, 14:00–14:40 GMT

Thiara Ortiz, Netflix

Available Media

At Netflix, ensuring exceptional quality for our streaming platform is crucial. Every time a Netflix member sits down, reclines in their chair, and turns on their TV, it's a moment of truth. It's our opportunity to deliver a spectacular service with amazing quality of experience. Misses, errors, or high latency—whether due to ISP configuration changes, code deployment, or catastrophic fallback—impact how our service is perceived.

In this talk, I'll share methods for defining thresholds for SLOs, ranging from intuition and industry best practices to advanced techniques like A/B experimentation. At Netflix, properly defining SLOs allows us to ensure industry-leading quality of experience for our members.

Thiara is a Staff CDN Reliability Engineer at Netflix. Over the last four years, Thiara has been working on Open Connect, improving the resilience of the Netflix service for members around the world. Most recently, Thiara has been heavily involved with the introduction of Cloud Gaming on the Netflix platform. This talk was inspired by the need to define SLOs for an emerging service.

Connect:

Selective Reliability Engineering: There Is No Single Source of Truth

Tuesday, 14:45–15:05 GMT

Elise Burke, Datadog

Available Media

As engineers we design distributed architectures, define project scopes, and ensure that we have a single "source of truth". But what, exactly, do we mean by the phrase? Do we really have only one source of truth - and for that matter, how do we decide what it is?

We'll look at some well-known ambiguities in system design and data modeling and then consider more philosophical questions about truth, the sources of truth we accept, and why this ambiguity matters.

Elise's sixteen year career as a software and site reliability engineer includes supporting Google's internal distributed storage systems and Datadog's organization-wide production practices. Her interests include exploring the interconnectedness of both technology and the people behind it, building overly complex clusters of all kinds at home, and occasionally advocating for spaces or tabs (depending on context). She's excited by sharing her knowledge and experience and encouraging others to learn and grow their abilities in their professional and personal lives.

Connect:

sre.fyi

Why You’re (Probably) Doing Service Catalogs Wrong

Tuesday, 15:10–15:30 GMT

Lisa Karlin Curtis, incident.io

Available Media

Service catalogs promise a lot of things: powerful automations, insights into your technology estate.

But over the last few years, many of us have learned that setting up and maintaining a service catalog is really hard.

Building out a catalog from a standing start can take months, or even years. Too many people get stuck in a chicken-and-egg situation, where you can’t deliver value because you don’t have the data in your catalog, and you can’t convince anyone to spend time helping you because the catalog doesn’t do anything yet.

But there is another way...

Lisa started out as a consultant working with HMRC and then smart meters, before accidentally becoming a developer. She was a founding engineer at incident.io, building tooling to help your whole organization manage incidents better. She loves building stuff, but is also really interested in how people interact with each other in a work environment - particularly in software engineering. Having seen the 'old way' at Accenture (large-scale waterfall projects), she's now looking at taking the lessons from that environment to the start-up scene.

Connect:

X

Track 3

Liffey Hall 2

Workshop: Loadshedding and Isolation Using Envoy Proxy

Tuesday, 11:00–15:30 GMT

Laura Nolan; Niall Murphy, Stanza

Effective load management is a core aspect of the SRE role. In this workshop, participants will be introduced to a number of Envoy Proxy features that are used for loadshedding and isolation, such as circuit breaking, adaptive concurrency, and ratelimiting. As part of the practical element of the workshop, participants will interact with Envoy configurations and status/control pages and endpoints, as well as Envoy’s telemetry.

Prerequisites: please arrive with a laptop with docker-compose installed. Information on setting up docker-compose is available at https://docs.docker.com/compose/install/

Laura Nolan has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know, and is currently is completing her MSc in Human Factors and Systems Safety at Lund University. Laura is a member of the USENIX board of directors and a long-time SREcon volunteer. She lives in rural Ireland in a small village full of medieval ruins.

Connect:

X

Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable Machine Learning, with Todd Underwood and many others.

Connect:

X

Discussion Track

Liffey Hall 1

eBPF

Tuesday, 14:00–15:30 GMT

Cameron Howes, Goldman Sachs, and Daniel Hodges

This session is an opportunity for people to come together and discuss eBPF, facilitated by our knowledgeable hosts. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in eBPF.

Cameron Howes is an Analyst in the Market Data SRE team at Goldman Sachs, specialising in low-level development and performance instrumentation. When he's not ferociously avoiding a memory allocation, or reading about the latest CVEs, Cameron can be found writing black-box probers and Prometheus exporters for the ticker plant.

Daniel Hodges is a software engineer that works at Meta on profiling and scheduling. He has worked as a site reliability engineer, production engineer and has experience with observability, profiling and production deployments.

15:30–16:00

Coffee and Tea Break

The Forum

16:00–17:30

Track 1

The Liffey A

Exploring the Unintended Consequences of Automation in Software

Tuesday, 16:00–16:40 GMT

Courtney Nash, The VOID

Available Media

Automation is ubiquitous—it is entwined in our daily lives in ways that we aren’t always aware of. It has been woven into all aspects of modern software by being presented as a utopian vision: a way of making human lives easier, doing repetitive tasks faster and with fewer errors, freeing us fallible humans up to do other ostensibly more important work. But anyone who has worked directly with automated systems knows that we are still very far from such a dreamy reality.

This talk delves into detailed research about how automation is involved in software incidents. My focus on this area stems from the growing portrayal of automation as a panacea for various software incident issues, despite its limitations in effectively addressing these challenges, such as reliable detection and resolution of software issues or analyzing and disseminating learnings from these incidents back into the organization and its products and services.

Drawn directly from public incident reports (collected in the VOID), this research revealed multiple, often competing, roles that automation can play over the course of an incident, and most importantly underscored how important humans are at understanding, troubleshooting, and recovering from automated software issues. If you're struggling to convey the reality behind the hype of automation and AI to others on your team or at your organization, this is the talk for you.

Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Prowler, Verica, Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.

Connect:

Rock around the Clock (Synchronization): Improve Performance with High Precision Time!

Tuesday, 16:45–17:05 GMT

Lerna Ekmekcioglu, Clockwork Systems

Available Media

Is the app slow or the network lagging? When it comes to latency in distributed systems, it can be hard to identify where exactly the issue is. As businesses increasingly adopt diverse deployment environments —on-premises, cloud, or hybrid— the complexity grows, obscuring visibility into system health. Join me to hear why clock synchronization is key for identifying the true culprit when latency is due to contention in the network. I’ll demo how network contention impacts tail latencies followed by an overview of clock synchronization protocols to date, their pros and cons, and best practices in disciplining clocks, as well as recent algorithms from Stanford Research. With high precision clock synchronization at scale, we gain back visibility into useful one way delay metrics, which act as an early signal for network congestion that help us prevent impact to response times for our end users!

Lerna is a Senior Solutions Engineer at Clockwork Systems where she helps customers meet their performance goals with software solutions built on Clockwork.io’s foundational research. Prior to this, she was a Senior Solutions Architect serving Global Financial Services customers at AWS for 3 years. Before that, Lerna spent 17 years as an infrastructure engineer in large financial services companies working on authentication systems, distributed caching, and multi region deployments using IaC and CI/CD to name a few. In her spare time, she enjoys hiking, sightseeing and backyard astronomy.

Connect:

Mnemonic Rules for Eponymous Laws or: There’s a Law for That!

Tuesday, 17:10–17:30 GMT

Peter Burkholder, U.S. Government

Available Media

As SREs, referencing named laws like Brook’s Law, Galls Law, or Jevons Paradox can help strengthen our arguments. But remembering which law applies when is challenging.

In this talk, I'll highlight the most useful tech and behavioral science laws for SRE work, offer mnemonic tips for recalling them, and share real-world examples. We'll finish with a quick quiz to ensure you're ready to apply these concepts in your role.

Geophysicist turned SRE. Jobs include: US Gov, (18f/cloud.gov), GovReady, Chef, AARP, NCBI, NCAR, Univ. of Washington. In my own time, I make pizza, sing, and play guitar (not simultaneously).

Connect:

Mastodon

Track 2

The Liffey B

SRE Stakeholders: A Spotter’s Guide

Tuesday, 16:00–16:40 GMT

Dave O'Connor

Available Media

For Every SRE or SRE-adjacent team in any organisation, there are many kinds of stakeholders; people who care (or don't care!) about how your team operates, and the outcomes of that. They differ massively in how they view your team, and in how they, in turn, should be viewed, and managed.

In a timeline that doesn't contain a canonical book setting out what SRE is here for and how it achieves that, the sad and annoying answer is that "it depends". Because of this, we need to get good (or remain good) at stakeholder management and communications about why we're here, and what we do.

While primarily useful to SRE leadership, the kinds of stakeholders you run into can be useful to know for any SRE. Learn to spot the different stakeholders in your life, what they (generally) care about, and how you can help reduce misunderstandings and tension, no matter where you're sitting.

Dave is an SRE Leadership practitioner, Advisor and Coach based in Dublin. He's been working on SRE and SRE-adjacent organisations for over 20 years, primarily as an SRE Lead at Google from 2004-2021. Since then, he has spent time leading SRE, Security and Infrastructure teams at Elastic and Twilio.

He's currently a Consultant/Advisor for Busy teams at Co-Servant Systems, and a coach specialising in tech leadership at all levels.

Connect:

X

Mastodon

Panel Discussion: Is Reliability a Luxury Good?

Tuesday, 16:50–17:30 GMT

Moderator: Emil Stolarsky, Increase
Panelists: Andrew Ellam; Niall Murphy, Stanza; Joan O'Callaghan, Udemy; Avleen Vig

Available Media

Emil is an engineer at Increase where he works on building modern banking infrastructure. Before that, he was at companies such as Wave Mobile Money, DigitalOcean, and Shopify, working on everything from building data centres in Sub-Saharan Africa to caching & performance optimizations in the cloud. In addition to speaking at & organising a number of conferences, he was a contributor to Seeking SRE and co-authored 97 Things Every SRE Should Know.

Andrew is a software engineer turned Technical Program Manager turned manager. He ran his own businesses for a while, worked for startups and scaleups and for big tech companies.

Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable Machine Learning, with Todd Underwood and many others.

Connect:

X

Joan O'Callaghan is a Monitoring and Observability Director at Udemy. She has worked in SRE and Incident Management and M&O (in one form or another), for many, many years. She likes to host and write blameless incident reviews and take long walks on the beach where she has imaginary arguments with people that don't like resilience as much as she does. She is always very happy when she meets people more paranoid than her.

Connect:

Avleen is one of Twilio’s Architects for SRE. Over his luminous 20+ year career he has shone a light on the importance of making reliability a core part of the work done by all software engineering teams. When he isn’t working on improving systems designs and reviewing code, you can often find him outside with a telescope and a hot cup of tea.

Track 3

Liffey Hall 2

Enhancing Elasticsearch Performance: Innovative Reindexing Strategies Using Dedicated Nodes and KEDA Autoscalers

Tuesday, 16:00–16:40 GMT

Leila Vayghan, Shopify

Available Media

This talk is about enhancing the search infrastructure of Shopify, a large-scale ecommerce platform that supports over 3 million merchants and handles more than two petabytes of data.

This talk explains how we leverage Kubernetes on Google Cloud Platform to ensure high availability and performance, crucial for maintaining our platform's robust search functionality. It will also elaborate on our innovative approach using dedicated reindexing nodes within existing clusters, which significantly improves indexing and reindex performance while cutting infrastructure costs. We will explore the application of Kubernetes Event-Driven Autoscaling (KEDA) to dynamically manage resource allocation, enhancing operational efficiency and reducing on-call fatigue. This strategy not only supports seamless user experiences but also boosts Gross Merchandise Value (GMV) and revenue through improved system responsiveness.

This presentation is ideal for those involved in managing large-scale data systems or interested in advanced Elasticsearch optimizations.

Leila is an engineer at Shopify, where she spends her days enabling millions of merchants to grow by making sure buyers are able to search and find their products. She does this by running a large-scale search infrastructure on Kubernetes in many regions of the world. Leila has completed her master’s degree on the availability of stateful applications running on Kubernetes and has presented her work at many conferences.

Connect:

Discussion Track

Liffey Hall 1

Service Level Objectives

Tuesday, 16:00–17:30 GMT

Alex Hidalgo, Nobl9, and Heinrich Hartmann, Zalando SE

This session is an opportunity for people to come together and discuss SLOs, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in SLOs.

Alex Hidalgo is the Field CTO at Nobl9 and author of "Implementing Service Level Objectives." During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching Premier League football. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.

Heinrich Hartmann is a seasoned expert with a decade of experience in Reliability Engineering. Currently, he serves as the Senior Principal SRE at Zalando, a leading European e-commerce company, where he oversees company-wide reliability practices. Before joining Zalando, Heinrich was the Chief Data Scientist at the Monitoring Platform Circonus, where he managed the analytical product offerings and pioneered histogram methods for latency monitoring.

Heinrich is a frequent speaker at industry conferences and is best known at SRECon for his regular "Statistics for Engineers" masterclass.

17:30–19:30

Conference Reception at the Sponsor Showcase

The Forum

Enjoy hors d'oeuvres and beverages while networking with other attendees and visiting the exhibits as we close out the first day of sessions!

Wednesday, October 30

08:00–17:00

Badge Pickup

Ground Floor Foyer

08:00–9:00

Morning Coffee and Tea

The Forum

09:00-10:30

Opening Plenary Session

The Liffey

Lessons from Unix History

Wednesday, 09:00–09:45 GMT

Diomidis Spinellis, AUEB & TU Delft

Available Media

Explore the timeless lessons of Unix’s evolution in a talk that examines its significant influence on modern computing. For over fifty years, Unix has been a cornerstone in shaping software technologies and development practices. This session will guide you through a historical narrative, illustrating key innovations from Unix's First Research Edition to modern FreeBSD releases, such as prototyping, portability, modular coding, and the importance of developer efficiency over machine time.

Discover the architectural philosophies embedded in Unix, such aggressive partitioning, composition, layering, and convention-based extensibility, as well as the strategic use of pipelines and filters for program composition. Based on extensive research and case studies, this talk is not just a technical retrospective but also a reminder of the enduring principles that continue to inform effective system and software development today. Perfect for developers, architects, and tech enthusiasts eager to enhance their programming ethos with proven, age-old wisdom.

Diomidis Spinellis is a Professor of Software Engineering at AUEB and a Professor of Software Analytics in the Department of Software Technology at TUDelft. In previous lives he has served the Greek Government as Secretary General for Information Systems and has worked (briefly) as an senior SRE for Google. He is the developer of the ai-cli-lib AI command-line copilot, git-issue, the Unix history repository, the CScout refactoring browser for C, dgsh, and other open-source software packages, libraries, and tools. His most recent book is “Effective Debugging: 66 Specific Ways to Debug Software and Systems”.

Connect:

X

Mastodon

Treat Your Code as a Crime Scene

Wednesday, 09:45–10:30 GMT

Adam Tornhill, CodeScene

Available Media

We'll never be able to understand a software system from a single snapshot of the code. Instead we need to understand how the code evolved and how the people who work on it are organized. We also need strategies for finding bottlenecks and technical debt impairing our productivity, as well as uncovering hidden dependencies between code and people. Where do you find such strategies if not within the field of criminal psychology?

This session starts with a crash course in offender profiling before we quickly move on to adopt those principles to software development. You'll learn how easily obtained version-control data lets you uncover the behavior and patterns of the development organization. This language-neutral approach lets you prioritize the parts of your system that benefit the most from improvements so that you can balance short- and long-term goals guided by data. The presentation will change how you view code. Promise.

Adam Tornhill is a programmer who combines degrees in engineering and psychology. He’s the founder of CodeScene, where he designs tools for software analysis. He’s also the author of the best-selling Your Code as a Crime Scene, and three more technical books. Adam’s other interests include music, retro gaming, and martial arts.

10:30–11:00

Coffee and Tea Break

The Forum

11:00–12:30

Track 1

The Liffey A

Finding the Capacity to Grieve Once More

Wednesday, 11:00–11:40 GMT

Alexandros Kosiaris, Wikimedia Foundation

Available Media

At Wikipedia, we handle unpredictable traffic spikes, especially during notable deaths, which can cause severe outages. Despite believing we had mitigated this issue years ago, a major outage occurred in 2020 due to a notable death and a DDoS attack, leading to the realization that our platform needed further improvements. Over the years, we conducted investigations and implemented numerous fixes, educating new SREs about our platform's unique constraints. Two years ago, following the death of Elizabeth II, our system successfully handled unprecedented traffic without outages, demonstrating our platform's resilience. This story highlights the infrastructure improvements that allowed us to manage traffic surges and the emotional journey of regaining the capacity to properly grieve significant losses.

We heavily rely on open source, and our code is public, making our solutions accessible to everyone.

A Linux sysadmin, turned FreeBSD sysadmin, turned Linux sysadmin, turned systems engineer (somewhere along that path there’s a Devops hat as well), turned SRE, Alexandros has been in the space since 1999, starting as a hobbyist, then a professional. Currently working with the Wikimedia Foundation, he has pushed forward for more virtualization, better orchestrated microservices and platform developments for their execution.

Connect:

Mastodon

Incident Groundhog Day

Wednesday, 11:50–12:30 GMT

Hamed Silatani, Uptime Labs

Available Media

Learning how to respond effectively to incidents is hard. One of the reasons is that we never see the same incident twice. While we can learn vital lessons during and after an incident, we can’t hop into a time machine, and apply these lessons to the same incident to discover their impact. What if we could experience the same incident over and over again? What might we learn? This talk describes a ‘staged world’ experiment in which 20 incident managers separately experienced the same simulated incident affecting a fictitious e-commerce company. We discuss what we noticed that differentiated some incident responders from others, and some surprising things that we expected to see, but didn’t.

Hamed is co-founder and CEO of Uptime Labs, an incident learning & simulation platform. He has 20 years of experience in engineering leadership, reliability engineering, and IT operations. Having spent the majority of his career at the sharp end of incident response in financial services, he's looking to help all companies master the unexpected.

Track 2

The Liffey B

Anomaly Detection in Time Series from Scratch Using Statistical Analysis

Wednesday, 11:00–11:40 GMT

Ivan Shubin

Available Media

Implementing anomaly detection for time series can be challenging, with many techniques and tools available. But can you achieve effective results without AI or Machine Learning? In this talk, we will demonstrate how basic statistical methods can effectively detect anomalies in time series data. We'll show you how to use Grafana to visualize these anomalies on graphs and ensure past incidents do not impact future predictions. Additionally, we will explore building Grafana dashboards as code as part of the anomaly detection solution and adjusting the detection for various events.

Hi, my name is Ivan. I am a Senior Site Reliability Engineer at Booking.com. Before that I worked at TomTom and eBay. Throughout my career, I have explored various roles including Quality Assurance, Software Engineering, System Administration, and SRE. I have always been fascinated by the complexity of high-load and distributed systems and have a passion for understanding how everything works. In my spare time, I enjoy working on my open-source project, Schemio, which I use to build various interactive visualisations on SRE topics.

Generative AI: Beyond (Just) Hype

Wednesday, 11:50–12:30 GMT

Todd Underwood

Available Media

Generative AI is one of the most hyped technologies in most of our careers. While it is driving a complete transformation of priorities some tech organizations many engineers remain deeply skeptical about any practical uses of Generative AI.

The skepticism is warranted and the hype is (for now) exaggerated, but not completely without merit. These technologies are not entirely useless for the kind of work we do. In this talk I will highlight a few emerging use cases that sidestep some of the weaknesses of GenAI (hallucination, errors), and still manage to provide value, specifically for production engineering.

Todd Underwood recently lead reliability for the Research Platform at Open AI. Previously he was a Senior Engineering Director at Google leading ML capacity engineering in the office of the CFO at Alphabet. Before that, he founded and led ML Site Reliability Engineering and was the Site Lead for Google’s Pittsburgh office. He co-wrote "Reliable Machine Learning: Applying SRE Principles to ML in Production" (O’Reilly Press, 2022).

Track 3

Liffey Hall 2

From PIDs to Pods: The Life Cycle of an eBPF-Autoinstrumented Application

Wednesday, 11:00–11:40 GMT

Marc Tudurí, Grafana Labs

Available Media

eBPF allows to attach programs in the Linux Kernel and inspect the runtime memory of the Kernel and user programs at runtime. Join us in this session to discover how Grafana Beyla works, our eBPF-based instrumentation tool, and how is a Kubernetes a first-class citizen. We will describe how we match the low-level abstractions from eBPF with the Kubernetes metadata, allowing Kubernetes users to have out-of-the box observability for their running applications.

Marc Tudurí is a Prometheus contributor, OpenTelemetry member and Software Engineer at Grafana.

Connect:

Mastodon

Scheduling at Scale: eBPF Schedulers with Sched_ext

Wednesday, 11:50–12:30 GMT

Daniel Hodges, Meta

Available Media

This talk will discuss how eBPF-based schedulers can be used to enhance application performance at scale. The presentation will begin by explaining the fundamental eBPF capabilities necessary for constructing schedulers, providing a foundation for understanding their design. Following this introduction a discussion of schedulers and their design will be presented. Finally, some practical lessons for deploying schedulers in production environments will be given.

Daniel Hodges is a software engineer that works at Meta on profiling and scheduling. He has worked as a site reliability engineer, production engineer and has experience with observability, profiling and production deployments.

Discussion Track

Liffey Hall 1

SRE in Small Orgs

Wednesday, 11:00–12:30 GMT

Emil Stolarsky, Increase, and Joan O’Callaghan, Udemy

This session is an opportunity for people to come together and discuss running SRE teams in small organisations, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session,with plenty of opportunity to ask questions and to talk to other attendees who are part of SRE teams in small organisation.

Emil is an engineer at Increase where he works on building modern banking infrastructure. Before that, he was at companies such as Wave Mobile Money, DigitalOcean, and Shopify, working on everything from building data centres in Sub-Saharan Africa to caching & performance optimizations in the cloud. In addition to speaking at & organising a number of conferences, he was a contributor to Seeking SRE and co-authored 97 Things Every SRE Should Know.

Joan O'Callaghan is a Monitoring and Observability Director at Udemy. She has worked in SRE and Incident Management and M&O (in one form or another), for many, many years. She likes to host and write blameless incident reviews and take long walks on the beach where she has imaginary arguments with people that don't like resilience as much as she does. She is always very happy when she meets people more paranoid than her.

Connect:

12:30–14:00

Luncheon

The Forum

14:00–15:30

Track 1

The Liffey A

When Your SaaS Provider Goes out of Business – Lessons from an Averted Crisis

Wednesday, 14:00–14:40 GMT

Raphael Seebacher and Christof Gerber, Open Systems AG

Available Media

What do you do when your SaaS provider unexpectedly goes out of business?

It's the early days of 2023 when the provider of a critical component in our Web Proxy service announces that it just went out of business. With services used by 100 customers with 300K daily users across 3'500 locations worldwide at stake, we knew that it was time for swift action.

Join us in this talk as Raphi, the crisis lead, and Christof, engineer on the Web Security team, recount their experience handling this crisis. We will take you from the dizziness of the initial shock to our first hour, first day, first week, and first month actions, detailing leadership, communication, and technical responses and the trade-off decisions we faced.

You'll leave this talk with another tale from production and practical ideas to make your own organisation better prepared for a similar unexpected crisis.

Raphael is a systems engineer who spent the last decade exploring the Engineer/Manager pendulum at Open Systems. He holds a MSc in electrical engineering and information technology, a MAS in Management, Technology and Economics and is a captain in the Swiss Armed Forces. His interests include amateur radio, the outdoors, and tinkering with all kinds of hardware and software systems.

Connect:

Christof is an engineer who develops, maintains, and operates Software as a Service to secure corporate web traffic worldwide. Working at the intersection of computer networks, IT security, and software engineering, he is passionate about building reliable systems for Linux servers and Kubernetes.

Connect:

Configuration Languages Are the Bane of Our Existence

Wednesday, 14:45–15:05 GMT

Paul Komkoff

Available Media

It is probably a good idea to make it possible to change some constants in your program without recompiling it. So why it then gets incredibly hard to control these configurations? At which point configuration becomes a program with no tests, written in untyped language, which requires a lot of compute to evaluate and can't be checked in advance? Is it at all possible (and enough) to get rid of these languages and go back to ini files?

If you are like me, you want to know the answers to these questions, and this is what I'm going to talk about. Plus:

sendmail.cf was an early sign everybody ignored
if you use regular expressions for matching and selecting in your configuration, start writing a premortem
when your configuration is more complicated than your program, who is your program now?

Out of 33 years of working with computers and networks, Paul spent 17 in SRE organization. He believes that complexity needs to be actively managed and to know the better ways to fix things we need to explore the depths of failure.

Just Buy the Printer: Resilience in Action

Wednesday, 15:10–15:30 GMT

Cail Young, Octopus Deploy

Available Media

A retelling of a recent near-miss at Octopus Deploy involving code signing certificates, multiple teams responding on an incident, and everybody's favourite piece of security hardware - the humble printer. After the story, we'll reflect on what the story says about the resilience factors already in the organisation, and what the telling of the story itself might be able to do for resilience across organisations.

Cail has spent the last couple of decades working at the intersection of people and technology: in the performing arts, in the motion picture industry, and now in the field of software operations. He is fascinated by learning from incidents - large and small - and will gladly trade stories about them.

Track 2

The Liffey B

Noisy Neighbors, through Networking

Wednesday, 14:00–14:40 GMT

René Treffer and Ben Kochie, Reddit

Available Media

When operating multi-tenant environments, like in Kubernetes, you can have "noisy neighbors". Resources like CPU and network can have contention which can lead to service degradation. But the causes of contention are not always what you would think. In this talk we will look at some surprising instances of "noisy neighbors", how they unfolded, how we discovered them, and how we mitigated the effects.

René Treffer is an infrastructure software engineer at Reddit.

Ben Kochie is a principal software engineer at Reddit.

Taming Noisy Benchmark Results Using Change Point Detection

Wednesday, 14:45–15:05 GMT

Matt Fleming, Cloudflare

Available Media

Modern systems are inherently nondeterministic and that leads to noisy benchmark results. Change Point Detection has emerged as a helpful technique for detecting significant changes in performance results even when those results are noisy and unstable. This talk will explain how Change Point Detection works and the open source projects available for developers to use CPD with noisy benchmark results.

Matt is co-founder of Nyrkiö and a Systems Engineer at Cloudflare. He has spent over 15 years working on low-level, high-performance systems and was previously the maintainer for the Linux kernel EFI subsystem. He has co-authored papers on performance change detection and distributed systems testing and served on the ACM/SPEC ICPE program committee. Matt can often be found on Twitter, discussing topics such as software performance, benchmarking and statistics.

Connect:

X

Enabling Product Scalability through Load Testing

Wednesday, 15:10–15:30 GMT

Monica Baluna and Ehab Tawfik, Bloomberg

Available Media

One of Bloomberg's flagship products, Instant Bloomberg (IB), is used by financial professionals around the globe for instant messaging. This system is powered by a multitude of microservices, databases and UIs that interact through synchronous or asynchronous API calls and queueing mechanisms.

We recently released Forums in IB. This new form of group chat introduced exciting features. With our clients needing increasingly larger group chats, we took the opportunity to ask how to make sure the new system and the existing one can scale up with the extra load without affecting the existing user workflows.

This talk explores the different load testing strategies we adopted while enabling support for chats ten times larger than before, while also migrating existing group chats to become Forums. We will focus on two elements: (i) creating a realistic representation of production traffic in a test environment, and (ii) how to efficiently gather insightful metrics.

Monica Baluna is a software engineer at Bloomberg in London, where she has worked for the past six years. Her main interests include distributed systems, as well as building reliable software and robust APIs. She has had an opportunity to explore these interests, as her team manages a content sharing solution where performance and scalability are key features. Monica earned a bachelor's degree in computer science and engineering from Politehnica University of Bucharest.

Connect:

Ehab Tawfik is a software engineer who loves problem solving, technology, and business. He works in Core Products Engineering at Bloomberg in London. He is passionate about back-end systems and distributed computing. Ehab earned a bachelor's degree in computer science and engineering from Nile University in Egypt.

Connect:

Track 3

Liffey Hall 2

NVMe/TCP Makes iSCSI Look like Fortran

Wednesday, 14:00–14:40 GMT

Chris Engelbert, simplyblock GmbH

Available Media

For more than two decades, iSCSI was the go-to protocol standard for remote block storage over commodity network hardware, utilizing normal Ethernet networks, hence mitigating specialist hardware, saving cost, and providing a much lower entry barrier than Fibre Channel or Infiniband.

However, the underlying storage technologies made leaps during that time, and today iSCSI is often a bottleneck for high-performance storage deployments, backed by SSDs or NVMe. Therefore, the NVMe Express group defined the NVMe over Fabrics protocol family, with NVMe over TCP being at the forefront to replace iSCSI, while offering lower latency, higher throughput, and less protocol overhead.

Let’s dive into NVMe, NVMe over TCP, and how it’s superior to iSCSI, as well as the support landscape.

Christoph Engelbert is a developer by heart, with strong bonds to the open source world. As a seasoned speaker on international conferences, he loves to share his experience and ideas, especially in the areas of scalable system architectures and back-end technologies, as well as all things programming languages.

Connect:

X

Mastodon

Bluesky

The Silent Performance Killers: BIOS and Firmware Updates

Wednesday, 14:45–15:05 GMT

Darin E. Langone

Available Media

In the ever-changing landscape of CVEs, bug fixes, enhancements, etc., vendors are taking a more rigid stance when it comes to applying patches and security fixes that they have provided. If you are not careful and do as they say without implementing any pre- and post-patch testing and analysis, you open your hardware and systems up to potentially significant performance impact.

Darin Langone is a software engineer at Bloomberg. As a member of the Compute Platform engineering team, his focus is on performance testing and benchmarking servers before and after BIOS and firmware updates have been applied. Since joining Bloomberg 25 years ago, he has worked on a lot of different things. Darin holds a master's degree in forensic psychology from the John Jay College of Criminal Justice and a bachelor's degree in psychology & computer science from Queens College. In his spare time, Darin likes to hit a small ball around large tracts of land trying to get it into a hole in the ground.

How a Single API Endpoint Saved Us 3000 CPU

Wednesday, 15:10–15:30 GMT

Lasse Hels, Maersk

Available Media

How do you run a time series database exclusively on spot nodes? With great difficulty!

Grafana Mimir is the centrepiece of our observability platform at Maersk. For a long time, rollouts of Mimir's most crucial component would consistently trigger significant performance degradations in the platform. Getting to the root cause of the issue proved laborious and took us deep into the internals of Mimir.

Join us as we go through the issue postmortem and reflect on how to create consistency in a chaotic environment. The talk touches on topics such as CPU throttling, hash rings, compute utilisation analysis and metric series cardinality.

Lasse is a software engineer at Maersk. As a member of the telemetry team, he took part in building the Maersk Observability Platform, and now spends much of his time keeping it running. Outside of work, his interests include speedrunning, powerlifting, etymology, and camels.

Connect:

Discussion Track

Liffey Hall 1

Monitoring and Alerting

Wednesday, 14:00–15:30 GMT

Daria Barteneva, Microsoft Azure, and Niall Murphy, Stanza

This session is an opportunity for people to come together and discuss monitoring and alerting, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in monitoring and alerting.

Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing organisational culture, processes, and platforms to improve service reliability and on-call experience. She has spoken at conferences on various aspects of reliability and human factors that play a key role in engineering practices, and has written for O'Reilly. Daria is originally from Moscow, Russia, having spent 20 years in Portugal, 10 years in Ireland, and now lives in the Pacific NorthWest.

Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable Machine Learning, with Todd Underwood and many others.

Connect:

X

15:30–16:00

Coffee and Tea Break

The Forum

16:00–17:30

Track 1

The Liffey A

Managing the Risk of Software Supply Chain Attacks

Wednesday, 16:00–16:40 GMT

Mark Hahn, Qualys

Available Media

Open-Source Software (OSS) are flourishing and are getting used by at least 90% of companies. Modern applications are built on webs of open-source code, APIs, and third-party integrations.

Because of this hackers are now compromising weak links in existing software supply chains. Software supply chain (SSC) threats include tampering with updates (tainted updates), compromised third-party libraries, vulnerabilities in open-source packages, malicious code or malware in packages etc. Software Supply Chain attacks have an average increase of 742% per year.

This talk covers ways to prevent software supply chain attacks and how to respond when the ecosystem has been tainted.

Mark Hahn is the Solutions Architect for Cloud and DevOps Security at Qualys. He uses DevSecOps and Site Reliability Engineering practices to ensure that software and applications are deployed with high velocity and with the utmost security. He shows clients how to build security into software using agile methods and cloud native distributed systems world built for DevOps and rapid change.

Connect:

When SRE and Security Teams Meet to Face a Crisis

Wednesday, 16:50–17:30 GMT

JR Aquino

Available Media

For SREs, Security is at the same time a priority and not a priority; prioritization highly depends on the environment, the size, and the goals of each org.

This talk aims to give SREs - through real life examples - insights of:

What to expect (and how to be good neighbors) when they are called in to work with Security teams to manage a security incident
Inter-organizational unexpected challenges that might occur
What to keep an eye on in the future

Redfin | Rent - Head of Information Security
Former Microsoft and Citrix Security Leader
Created centralized SUDO for Fedora’s FreeIPA
FreeBSD port maintainer for Metasploit and UnrealIRCD
OpenBSD port maintainer for Nmap

Connect:

Mastodon

Track 2

The Liffey B

How to Host a (Very) Popular Website for 30 Altairian Dollars a Day

Wednesday, 16:00–16:40 GMT

James Beal

Available Media

For 15 years, the Archive of Our Own (AO3) has provided a safe haven for fanworks while refusing to implement paid accounts, sell user data, or restrict fans' creativity. We're completely donor-funded and volunteer-run and currently serve about 34 billion pages a year—using servers that we own in order to reduce the likelihood of deplatforming due to our commitment to creative freedom.

We know a thing or two about getting the most out of an Altairian dollar without compromising user privacy or free expression. Even if your project has different constraints, our approach might just help you stretch your project's budget.

James started playing with computers with the ZX81, learned C for his A Levels, and has degrees in computer science and parallel and distributed systems. He has been using Linux originally with MCC Interim Linux and later with other distributions. He started volunteering at the OTW 14 years ago when the organization only had two servers (it now has three racks).

How Snowflake Migrated All Alerts and Dashboards to a Prometheus-Based Metrics System in 3 Months

Wednesday, 16:50–17:30 GMT

Carlos Mendizabal, Snowflake

Available Media

This talk goes over how Snowflake migrated its alerts and dashboards in 3 months, a migration that included rewriting all alerts and dashboards used for system monitoring. We'll go over the tooling that enabled us to complete this migration successfully, which included configuration-as-code through Jsonnet and an unit testing framework, and share some important take-aways from this effort.

Carlos Mendizabal is a software engineer at Snowflake. He is part of the Observability team and loves to build things (and to ensure they're well monitored!). Previously at Meta, he's also passionate about meeting folks across the industry and keeping up with the latest and greatest in tech. Carlos lives in Seattle, Washington and is also a pilot in his free time.

Track 3

Liffey Hall 2

What If We Ask Linux to Do Cryptography for Us?

Wednesday, 16:00–16:40 GMT

Oxana Kharitonova, Cloudflare

Available Media

It's difficult to imagine the modern world without cryptography. We use cryptography to encrypt data before transmitting it over the Internet or storing it on a disk. But we don't think much about how it works, we just pick the most popular cryptographic user space library for our next application and let it do the work for us. What if it's not as secure as we hope? There is another way to do it with the Linux Kernel itself. It can encrypt & decrypt data in the same way as user space libraries do it but in a much more secure way. Through the talk we will explore how to integrate this feature in user space applications written in Golang and Rust languages. You don’t need to be a Linux kernel ninja to start using it.

Synthetic Monitoring and E2E Testing: 2 Sides of the Same Coin

Wednesday, 16:50–17:30 GMT

Carly Richmond, Elastic

Available Media

Despite the emergency of DevOps to unite development, support and SRE factions together using common processes, we still face cultural and tooling challenges that create the Dev and SRE silos. Specifically, we often use different tools to achieve similar testing: case in point validating the user experience in production using Synthetic Monitoring and in development using E2E testing.

By joining forces around common tooling, we can use the same tool for both production monitoring and testing within CI. In this talk, I will discuss how Synthetic Monitoring and E2E Testing are two sides of the same coin. Furthermore, I shall show how production monitoring and development testing can be achieved using Playwright, GitHub Actions and Elastic Synthetics.

Discussion Track

Liffey Hall 1

Scaling Databases

Wednesday, 16:00–17:30 GMT

Chris Sinjakli, PlanetScale, and Martin Alderete, Booking.com

This session is an opportunity for people to come together and discuss the challenges inherent in scaling databases, facilitated by our knowledgeable host. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in scaling databases.

Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.

All his programs are made from organic, hand-picked, artisanal keypresses.

Connect:

Mastodon

Martin Alderete is a Principal Site Reliability Engineer with a long track record in Engineering, Distributed Systems and System Level Programming in both the academia where after getting his degree he worked as teacher assistant. And the industry where he led different teams building complex systems at scale.

He is passionate about Open Source and new technologies, an active contributor to open-source projects and part of different technical groups.

Before joining Booking.com he worked in multiple industries including space where he worked as a Satellite Reliability Engineer building systems (and bugs!) to operate fleets of satellites.

He is based in Amsterdam but originally from the beautiful Patagonia Argentina.

Connect:

X

17:45–18:45

Lightning Talks

The Liffey A

Lightning Talks are four-minute talks by different speakers addressing a variety of SRE-relevant topics.

The lightning talks session will conclude with Slide Karaoke; a chance for any attendee to show off their improv skills by presenting a slide deck that they have never seen before.

What About the Engineer's MTTR?
Ian Duffy, Cloudsmith
Rollout Monitoring at Scale: Reflections on Adopting Canarying in GCE
Roberto Frenna, Google
Breaking Out of Our Hybrid Cloud Datastore EOL Chains
Konstantinos Fardelas, Skroutz SA
SRE for LLMs: What We Learned While Launching
John Lunney, Google
Re-Building Envoy in Rust
Dawid Nowak, Huawei Ireland Research Lab
How SRE Can Help With Cost & Efficiency
John Looney, Crusoe Energy
How to SRE Anything to Work Smarter and Live Better
Jennifer Petoff, Google
9 SLIs; OH MY!
Sal Furino, Bloomberg CRE
The Voyager Spacecraft—These Are the Only Engineers on Earth Who Want To Maximize Latency
Robert Barron, IBM

Available Media

Thursday, 31 October

08:00–12:00

Badge Pickup

Ground Floor Foyer

08:00–09:00

Morning Coffee and Tea

The Forum

09:00–10:30

Track 1

The Liffey A

Monitoring Systems as a Service – Walking the Line between Giving Your Devs Good M&O and Setting All Your Money on Fire

Thursday, 09:00–09:40 GMT

Joan O'Callaghan, Udemy

Available Media

Monitoring-as-a-Service products, like Datadog and Honeycomb are amazing products for implementing monitoring & observability with minimal effort, but like Anything-as-a-Service, it comes at a cost.

We are a very normal company, with all the tech debt and orphaned code that any company over a certain age has. Like everyone else, we had staff that heard, "measure everything!" but they didn't know what the monitoring bill looked like and that "everything" included a lot of junk.

In the talk I'll discuss how we managed to reduce cost wastage, enable extra vendor features, improve M&O knowledge within the engineering organisation and keep the bill the same or lower, despite a 60% growth in infrastructure at our company.

Notes re the vendor - I won't say who the Vendor is, but I think our experience was universal enough that our fixes and techniques will be helpful to other companies.

Joan O'Callaghan is a Monitoring and Observability Director at Udemy. She has worked in SRE and Incident Management and M&O (in one form or another), for many, many years. She likes to host and write blameless incident reviews and take long walks on the beach where she has imaginary arguments with people that don't like resilience as much as she does. She is always very happy when she meets people more paranoid than her.

Connect:

An Exploration in Storing Telemetry in Cloud Object Storage

Thursday, 09:50–10:30 GMT

Mike Heffner and Ray Jenkins, Streamfold

Available Media

Modern web application architectures require extensive telemetry data to function efficiently at scale. Traditional methods for collecting, storing, and processing this data have become increasingly expensive and challenging to maintain. Conversely, the prevalence of cloud object storage has given rise to the data lake. This has led some organizations to explore telemetry data lakes, which enable cost-efficient storage of large volumes of telemetry data.

We will explore various data storage formats used in constructing telemetry data lakes and discuss the tradeoffs associated with each approach. We will delve into common formats such as JSON, Parquet, ORC, and Apache Iceberg, examining how they can be utilized to store telemetry data like logs, metrics, and traces at scale. These formats will be empirically evaluated using real-world datasets. Additionally, we will review recent literature that highlights areas for design improvements in storage formats to better align them with modern computing hardware.

Mike Heffner is co-founder of Streamfold, where they are creating the first telemetry pipeline built for developers. Prior to Streamfold, Mike was a backend engineer at Netlify helping scale their delivery network, and at Librato building one of the first monitoring SaaS products. In his free time he takes advantage of all that the Blue Ridge Mountains have to offer.

Connect:

X

Ray Jenkins is co-founder of Streamfold, where they are creating the first telemetry pipeline built for developers. Prior to founding Streamfold, he led software engineering efforts at Snowflake, on the observability and performance of FoundationDB and at Segment on development of their stream processing pipeline, identity resolution system, and message delivery platforms.

Connect:

X

Track 2

The Liffey B

Opening the Box: Diagnosing Operating-System Task-Scheduler Behavior on Highly Multicore Machines

Thursday, 09:00–09:40 GMT

Julia Lawall, Inria-Paris

Available Media

Getting unexpectedly poor performance from your multicore application? Maybe the operating system task scheduler is at fault. The task scheduler is responsible for placing tasks on cores and for selecting which task is allowed to run, at what time, and for how long. As such, the scheduler is a critical component of any operating system and has a major impact on application performance. Still, scheduling decisions are buried deep within the operating system code, making it challenging to diagnose performance problems (or even performance improvements) to determine whether the scheduler is responsible and, if so, in what way. These challenges are compounded for highly multithreaded applications, running on large multicore machines, due to the huge amount of information available.

In this talk, we present some tools that we have developed for visualizing the behavior of the Linux kernel task scheduler, and illustrate how these tools can be used to help diagnose performance problems. The tools presented are freely available at https://gitlab.inria.fr/schedgraph/schedgraph

Julia Lawall is a senior researcher at Inria Paris. Prior to joining Inria, she completed a PhD at Indiana University and was on the faculty at the University of Copenhagen. Her work focuses on issues around the correctness and performance of operating systems. She develops and maintains the Coccinelle program transformation system that has been extensively used on Linux kernel code, and has recently begun investigating the performance impact of the Linux kernel scheduler, as well as exploring formal verification of scheduler properties.

Granular CPU Capacity Management at Scale with eBPF

Thursday, 09:50–10:30 GMT

George Brighton and Cameron Howes, Goldman Sachs

Available Media

Real-time market data is exceptionally bursty, with update rates in the busiest seconds of the day regularly exceeding 10x the average. User experience is predicated on maintaining sufficient CPU headroom to prevent full buffers and the resulting client disconnects. Sampling cumulative CPU time at a typical scrape interval hides microbursts, and sub-second polling from user space induces unacceptable overhead, so a different approach is needed.

This talk will cover how Market Data SRE at Goldman Sachs uplifted CPU monitoring of our market data distribution infrastructure in an unintrusive way, achieving 10x the granularity with 5% of the original monitoring overhead. We will cover the journey from deciding to use eBPF, through trials using bpftrace and making the leap to BPF C, to collecting and aggregating the metrics effectively. It will be most relevant to those interested in capacity management across a heterogeneous estate, and those looking to implement eBPF for the first time in their organisations.

George Brighton is a Vice President at Goldman Sachs, where he leads the Market Data SRE team. A Prometheus and OTel committer, he is responsible for uplifting observability and operational practices. George presented "Market Data: Applying SRE Techniques to Legacy Designs" at SREcon22 EMEA.

Connect:

X

Cameron Howes is an Analyst in the Market Data SRE team at Goldman Sachs, specialising in low-level development and performance instrumentation. When he's not ferociously avoiding a memory allocation, or reading about the latest CVEs, Cameron can be found writing black-box probers and Prometheus exporters for the ticker plant.

Track 3

Liffey Hall 2

Workshop: Guided Journey into the Heart of Systemd

Thursday, 09:00–12:30 GMT

Alvaro Leiva Geisse and Anita Zhang, Meta

IMPORTANT: If you are attending the workshop, please bring a laptop that is capable of SSH-ing into a remote machine.

systemd (with lowercase S and D) remains up until this day, both one of the most critical pieces of a system, and the least understood one. This workshop is designed to touch upon the beginner features of systemd and explain how you can use systemd to solve common problems, including some that you didn't even know you had. What problems do you ask? You’ll have to come and see.

I love Python, I grew up in a small town in Chile and one weekend, over 16 years ago, I had the flu and could not go out. I decided to learn how to code in Python and that was the beginning of the road that would move us all to Northern California so that I could join the Production Engineering team at Meta. I also like eating and cooking (in that order).

Connect:

X

Anita Zhang is the software engineering manager of Meta's Linux Umbrella family of teams. Her teams connect Meta's low-level infrastructure with the open source community. She is known for being a part of the systemd community and continues to support systemd at Meta as part of their Linux Userspace team.

Discussion Track

Liffey Hall 1

Wrangling your Management Chain

Thursday, 09:00–10:30 GMT

Dave O’Connor and Todd Underwood

This session is an opportunity for people to come together and discuss managing your management chain, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in wrangling your management.

Dave is an SRE Leadership practitioner, Advisor and Coach based in Dublin. He's been working on SRE and SRE-adjacent organisations for over 20 years, primarily as an SRE Lead at Google from 2004-2021. Since then, he has spent time leading SRE, Security and Infrastructure teams at Elastic and Twilio.

He's currently a Consultant/Advisor for Busy teams at Co-Servant Systems, and a coach specialising in tech leadership at all levels.

Connect:

X

Mastodon

Todd Underwood recently lead reliability for the Research Platform at Open AI. Previously he was a Senior Engineering Director at Google leading ML capacity engineering in the office of the CFO at Alphabet. Before that, he founded and led ML Site Reliability Engineering and was the Site Lead for Google’s Pittsburgh office. He co-wrote "Reliable Machine Learning: Applying SRE Principles to ML in Production" (O’Reilly Press, 2022).

10:30–11:00

Coffee and Tea Break

The Forum

11:00–12:30

Track 1

The Liffey A

Embrace Fleet Reboots and Make Them Boring

Thursday, 11:00–11:40 GMT

Everton Didone Foscarini, Cloudflare

Available Media

Server reboots bring up mixed sentiments. Some want to say “My kernel is stable, it does not crash with a thousand days uptime”, others understand that you are running a system with a thousand days of accumulated vulnerabilities.

In Cloudflare we believe that high uptimes are bad, and while the reboot automation was being developed, we were hit by a kernel+BIOS bug that caused a high rate of node crashes, and encouraged the quick adoption of reboot automation, prompting us to implement better tooling to deploy fleet changes over reboots, creating multiple reboot queues for different workloads, load-based maintenance windows and more.

We achieved monthly reboots for our edge fleet while keeping the clusters online and serving customer-facing traffic, unlocking our ability to iterate fast on Linux Kernel versions and OS releases, ensuring we are not running outdated library versions in hosts not rebooted for a thousand days.

Working on Internet-based services using Linux since 2003, joined Cloudflare in 2017 and helped to scale Edge location operations from 102 to 320 cities, creating tooling to manage services lifecycle and server reboots.

Connect:

Mastodon

A Brief History of Release Engineering

Thursday, 11:45–12:05 GMT

Dinah McNutt, MongoDB

Available Media

TL;DR This talk is a humorous (hopefully) retrospective on release engineering. How did we get from building binaries using a command line to all the fancy CI/CD systems we have today?

Things we used to do seem ridiculous today. Can looking back help us move forward? What’s the evolution and career path of a release engineer? Has the role become diluted through overuse and misuse?

Please join in the fun and include your anecdotes and experiences in the slack channel.

Dinah McNutt is a TPM for MongoDB and based in Dublin, Ireland. She has over 35 years of experience in systems administration, release engineering and software development. She has written for various publications over the years including the Daemons and Dragons column for UNIX Review magazine and Byte Magazine. She was the program chair for several USENIX Release Engineering Summits and LISA VIII. She’s given talks and taught tutorials at numerous conferences including LISA, FlowCon, and RELENG 2014.

Connect:

Red Tide Revert

Thursday, 12:10–12:30 GMT

David Newman, Automattic

Available Media

Explore the challenges of managing unexpected production errors in high-frequency deployment environments and introduce an innovative AI-driven solution for rapid error detection and resolution. The speaker will discuss how their team developed and refined an automated system that analyzes error logs, identifies problematic code commits, and streamlines the incident response process. This approach aims to reduce on-call stress, minimize user impact, and pave the way for fully automated error mitigation in complex, fast-paced development ecosystems.

With a diverse background in platform engineering, distributed systems, and artificial intelligence, our speaker brings a trove of experience driving innovation from startup to enterprise environments. As a technical founder in companies ranging from retail intelligence to digital signage, they have consistently demonstrated the ability to transform complex challenges into scalable solutions.

Their journey includes contributions to industry giants like Automattic and Red Hat, where they helped shape the future of WordPress.com and large-scale Kubernetes fleet management. An advocate for education and social impact, they led technology initiatives at Library For All, developing innovative solutions for remote learning in developing countries.

Currently leading the engineering initiative on an internal AI platform, ModelOps, and various AI-driven initiatives at Automattic.

Connect:

Track 2

The Liffey B

Riot Games: Evolution of Observability at the Gaming Company

Thursday, 11:00–11:40 GMT

Erick Moreira and Kirill Mikhailov, Riot Games

Available Media

The video game industry is growing year-by-year, and it is projected that the market size for video games will double in the coming 10 years. The number of people playing video games will also grow substantially. All of these produce a lot of challenges for tech teams to make sure that the games are not only fun to play but also offer stable, accessible gameplay. This is even more important for online competitive games, as they demand increased stability and performance.

Our presentation is focused on a review of the Riot Games journey through observability and specifically on the latest iteration of global-scale changes we made to introduce SRE and the new observability pipeline in the company.

I am Erick Moreira, a 32-year-old Brazilian from Rio, working and living for 5 years in Dublin. I grew up modding and creating simple things for games. Now, I am focused on the backend, cross-cutting concerns, and the developer experience. I still find space in my heart to build front-end stuff and games on the side. I am an enthusiast for tooling, automation, standards, and high-performance services.

Connect:

I started my journey as an engineer while in school, building servers for online games. I then switched to traditional software engineering, working for large tech companies. But at the end of the day, I still landed in the gaming industry, where I have worked for the LiveOps organisation at Riot Games for the last 2 years.

Connect:

A Powerful Logs Management Solution We All Have and Use but We Underestimate: systemd-journal

Thursday, 11:45–12:05 GMT

Costa Tsaousis, Netdata

Available Media

This talk aims to unearth the potent features of systemd-journal that have remained mostly underutilized and largely underappreciated within the SRE community. The focus will be on its ability to handle dynamically structured log entries, its inherent support for centralized logging, and its robust security features including log sealing.

Systemd-journal offers dynamic field management, allowing flexible log annotation and querying without predefined schemas, along with decentralized log management that enables seamless analysis across systems. Its sealing feature ensures log integrity, critical for incident response and forensics. There’s a tooling gap for converting plain logs into structured entries, however, we will show examples of how this can be achieved.

Costa Tsaousis, is the Founder and CEO of Netdata. Since 1995, Costa has been actively working on internet related startups. He has been a co-founder and C-level executive of many successful projects, including Internet Service Providers, Cloud Hosting Providers and Fintech startups. With a passion for innovation and open-source, he now leads Netdata, a monitoring solution aiming to simplify and modernize infrastructure observability for all of us.

Blast Radius Reduction for Large-Scale Distributed Systems

Thursday, 12:10–12:30 GMT

Linhua Tang, Huawei Ireland Research Centre

Available Media

The construction of large-scale distributed systems poses significant challenges due to inherent complexities and the inevitability of failures across various levels, from hardware malfunctions to software bugs. Embracing the 'design for failure' philosophy, this paper delves into advanced isolation techniques aimed at reducing the blast radius—both spatially and temporally—thereby enhancing system resilience. Spatial containment strategies, such as cell-based architecture, compartmentalize failures to localized areas, preventing cascading effects. Temporal mitigation focuses on rapid recovery and self-healing mechanisms, which aim to restore system health promptly after a failure occurs. Furthermore, the paper explores the application of formal methods in verifying the robustness of these designs, providing a rigorous approach to ensure the reliability and effectiveness of implemented solutions. This research underscores the importance of proactive architectural planning and continuous verification in maintaining the stability of complex distributed systems.

Linhua Tang (also known as James) is a software engineer and tech lead for global server load balancing and formal methods at Huawei Ireland Research Center. Before that, he worked at Microsoft and Amazon in different distributed systems.

Connect:

Track 3

Liffey Hall 2

Workshop: Guided Journey into the Heart of Systemd

Thursday, 09:00–12:30 GMT

Alvaro Leiva Geisse and Anita Zhang, Meta

IMPORTANT: If you are attending the workshop, please bring a laptop that is capable of SSH-ing into a remote machine.

systemd (with lowercase S and D) remains up until this day, both one of the most critical pieces of a system, and the least understood one. This workshop is designed to touch upon the beginner features of systemd and explain how you can use systemd to solve common problems, including some that you didn't even know you had. What problems do you ask? You’ll have to come and see.

I love Python, I grew up in a small town in Chile and one weekend, over 16 years ago, I had the flu and could not go out. I decided to learn how to code in Python and that was the beginning of the road that would move us all to Northern California so that I could join the Production Engineering team at Meta. I also like eating and cooking (in that order).

Connect:

X

Anita Zhang is the software engineering manager of Meta's Linux Umbrella family of teams. Her teams connect Meta's low-level infrastructure with the open source community. She is known for being a part of the systemd community and continues to support systemd at Meta as part of their Linux Userspace team.

Discussion Track

Liffey Hall 1

Building New SRE Teams

Thursday, 11:00–12:30 GMT

Avleen Vig and Stephane Dudzinski

This session is an opportunity for people to come together and discuss building new SRE teams, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in building SRE teams from scratch.

Avleen is one of Twilio’s Architects for SRE. Over his luminous 20+ year career he has shone a light on the importance of making reliability a core part of the work done by all software engineering teams. When he isn’t working on improving systems designs and reviewing code, you can often find him outside with a telescope and a hot cup of tea.

Stephane Dudzinski is a seasoned veteran with over 20 years of experience in the tech industry, specializing in observability, SRE, and systems. With a decade of leadership experience, he has managed and mentored high-performing teams, improving system reliability. Stephane currently works as an SRE Manager at Reddit. When not looking into metrics, you can find him brewing beer in his back garden.

12:30–14:00

Luncheon

The Forum

14:00–15:30

Track 1

The Liffey A

AppStack: An Open Source Cloud Native Platform for Running Digital Public Services

Thursday, 14:00–14:40 GMT

Dimitris Mitropoulos, National Infrastructures for Research and Technology – GRNET and University of Athens; Alex Kiousis, National Infrastructures for Research and Technology – GRNET

Available Media

GRNET is Greece's National Infrastructures for Research and Technology (NREN) organisation, which acts as a network and services provider for research and education communities. Since 2019, GRNET is responsible for the development, operation and maintenance of several governmental services, thus playing an important role in Greece's digital transformation. To address the different challenges related to this role, GRNET teams developed AppStack, a cloud-native platform, based on production-ready open source software, for running government-related services such as the gov.gr portal, the electronic issuance of documents signed by the Greek state, and gov wallet, among others.

AppStack provides an environment for integrating open-source and in-house software components, where DevOps can incorporate suitable tools to tackle scalability and security issues.

Currently, AppStack hosts workloads that serve more than 8 million Greek citizens, are able to handle more than 20K requests per second, and can generate hundreds of digital documents signed by the Greek state per second.

In this talk we will present AppStack, its numerous components, and how open source made it possible. Finally, we will describe some key experiences from production.

Dimitris Mitropoulos is an Assistant Professor at the National and Kapodistrian University of Athens and the Head of Reliability Engineering at the Greek National Infrastructures for Research and Technology (GRNET). Previously, he has been a postdoctoral researcher at the Computer Science Department of Columbia University. Dimitris holds a PhD in Software Security from the Athens University of Economics and Business and has been involved in several EU and US funded R&D projects. His research interests include software engineering and computer security. He is a member of ACM, IEEE and USENIX.

Connect:

X

Alex Kiousis is a Site Reliability Engineer in GRNET in Greece. His team handles GRNET's on-premise infrastructure and services, delivering GRNET's custom Cloud service to Greece's Research and Academic communities and several user-facing Government-related Digital Transformation Web services.

Science Reliability Engineering for High Performance Computing

Thursday, 14:50–15:30 GMT

Nicholas Jones, LANL

Available Media

High Performance Computing (HPC) as an industry has long stood on very human facing operational workflows. These workflows exist because HPC systems are generally purpose built machines for small sets of code bases with very specific performance metrics. This purpose built nature has resulted in HPC having very bespoke one-off systems, resulting in process and infrastructure that benefit a small set of code bases well, but aren't resilient to generational churn. To combat the difficulty from generational churn we've adopted an SRE mindset for our new administrative stack OpenCHAMI. This lets us keep our figures of merit (exact reproducibility, parallel bandwidth, and compute time to solution) aligned with what benefits our customer base the most.

Nick is a scientist at Los Alamos National Lab, where he works on system security architecture, CI/CD infrastructure, and shared computing environments and strategies across the National Nuclear Security Administration Laboratories.

Track 2

The Liffey B

Get Your Non-SREs Oncall Ready!

Thursday, 14:00–14:40 GMT

JC van Winkel and Brad Lipinski, Google

Available Media

Hands on learning is best for adults, and we've used this principle in Google SRE since 2017. However, many oncall engineers aren't SREs and haven't gone through a full week-long SRE onboarding program. How can they learn the same skills and go oncall with confidence, but without the week-long curriculum?

We cherry picked our SRE onboarding program to create a succinct, scalable program for this audience that includes the best of orientation: the breakage exercises. This program is called "Oncall Ready!" and is completely self-service, requiring no operational work from the SRE EDU team. In this talk we will discuss the development, the behind the scenes, and the outcomes of this project. Best comment we got from a participant: "Oh wow, this is like going through a [production] escape room without having to pay for it".

JC has been teaching UNIX and programming languages since 1992, working for AT Computing, a small courseware spin-off of the University of Nijmegen, the Netherlands. JC joined Google's Site Reliability Engineering team in 2010 and is both a founding member and lead educator of the SRE education team, SRE EDU.

Connect:

Brad joined Google SRE in 2013 and worked on datacenter software. He's taught for SRE EDU from the beginning and contributed to many of the team's automation efforts. In 2019, he joined SRE EDU full time and is now the team's tech lead.

Connect:

Transforming Production Readiness

Thursday, 14:50–15:30 GMT

Panagiotis Moustafellos, Elastic

Available Media

In this talk, we’ll share lessons learned from integrating development teams into on-call rotations at Elastic, along with insights into the design and operation of an SLO observability product that monitors hundreds of thousands of SLIs globally. We’ll cover best practices for production readiness, phased product launches, and approaching significant software and infrastructure re-architecture.

Moreover, we will go through actionable strategies for navigating the delicate process of getting all engineers on-call, improving incident management, promoting safer software releases, and the use of observability tools, empowering teams to fully own their services.

Attendees will gain practical insights to enhance production readiness and service reliability within their organization.

Panagiotis Moustafellos is a systems engineer with over 15 years of experience in diverse tech environments. His areas of expertise include systems architecture, observability, and security, as well as scaling software systems and infrastructure. Currently he is a Distinguished Engineer at Elastic, building observability products and driving the production readiness transformation for Elastic Cloud.

Connect:

X

Discussion Track

Liffey Hall 2

Learning from Incidents

Thursday, 14:00–15:30 GMT

Laura de Vesine, Datadog, Inc., and Cail Young, Octopus Deploy

This session is an opportunity for people to come together and discuss getting the most out of your incident review process, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in learning from incidents.

Laura de Vesine is a 20+ year software industry veteran. She has spent the last 8 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her cats nap on her diploma.

Cail has spent the last couple of decades working at the intersection of people and technology: in the performing arts, in the motion picture industry, and now in the field of software operations. He is fascinated by learning from incidents - large and small - and will gladly trade stories about them.

Discussion Track

Liffey Hall 1

System Performance and Scaling

Thursday, 14:00–15:30 GMT

Leila Vayghan, Shopify, and Abbas Soltanian, OpsGuru

Join us for an interactive Q&A session on System Performance and Scaling, where our expert panel, featuring a senior infrastructure engineer and a senior cloud solutions architect, will address your most pressing questions. This session is designed to provide practical insights and real-world solutions to help you optimize your systems for performance and scalability. Whether you're dealing with cloud architecture challenges, Kubernetes orchestration, or scaling complex infrastructures, bring your questions and engage with industry experts to enhance your understanding and capabilities.

Leila is an engineer at Shopify, where she spends her days enabling millions of merchants to grow by making sure buyers are able to search and find their products. She does this by running a large-scale search infrastructure on Kubernetes in many regions of the world. Leila has completed her master’s degree on the availability of stateful applications running on Kubernetes and has presented her work at many conferences.

Connect:

Dr. Abbas Soltanian, a Senior Cloud Solutions Architect at OpsGuru (Canada), holds a Ph.D. in Cloud Computing and has presented his work at numerous conferences. With over thirteen years of experience in both academia and industry, he helps companies migrate to the cloud and modernize their applications using cloud-native and open-source solutions. As a trusted advisor, Abbas leads multiple teams of cloud engineers, assisting companies from various domains in designing and developing secure, scalable, and highly available systems.

Connect:

15:30–16:00

Coffee and Tea Break

The Forum

16:00–17:30

Closing Plenary Session

The Liffey

Energy Consumption of Datacenters

Thursday, 16:00–16:45 GMT

Thomas Fricke

Available Media

Let us have look into the resource consumption of data centers and collect the current state of knowledge. There will be more questions than answers but predictions can be made because all resources have their limits.

The increase has already been exponential for years. With the AI hype, the demand for energy, cooling, water and other resources has increased dramatically.

The existing GPU based computing paradigm cuts hard into the standard design of data centers and demands other ways of cooling.

Thomas main focus is cloud and Kubernetes security. He plans private clouds and delivers applications in highly critical infrastucture. His customers are delivering serivices for transmission grids, healthcare, traffic and the German administration.

He is cofounder of two companies, lives in Berlin, Germany.

In his former life at the university he studied statistical physics and even gave lessons on quantum theory and thermodynamics.

Connect:

Mastodon

Are We Really Engineers?

Thursday, 16:45–17:30 GMT

Hillel Wayne

Available Media

What makes software engineering different from “traditional” engineering? To find out, I interviewed 17 “crossovers”: people who have worked professionally as both a software and a traditional engineer. In aggregate, we learn three things: we are in fact engineers, we’re not actually that different as a field, and there’s a lot we can both teach and learn.

Hillel is a formal methods consultant and the author of Logic for Programmers and Practical TLA+. His other work includes Computer Things, a weekly newsletter on the history and theory of software engineering, and Let's Prove Leftpad. In his free time, he juggles and makes chocolate. He did, in fact, bring enough for everyone.

17:30–17:40

Closing Remarks

The Liffey

Program Co-Chairs: Effie Mouzeli, Wikimedia Foundation, and Murali Suriar, Snowflake

SREcon24 Europe/Middle East/Africa Conference Program

Monday, 28 October

17:00–19:00

Badge Pickup

18:00–19:00

Welcome Get-Together

Tuesday, 29 October

07:30–17:00

Badge Pickup

07:30–8:45

Morning Coffee and Tea

08:45–09:00

Opening Remarks

09:00–10:30

Opening Plenary Session

10:30–11:00

Coffee and Tea Break

11:00–12:30

12:30–14:00

Luncheon

14:00–15:30

15:30–16:00

Coffee and Tea Break

16:00–17:30

17:30–19:30

Conference Reception at the Sponsor Showcase

Wednesday, October 30

08:00–17:00

Badge Pickup

08:00–9:00

Morning Coffee and Tea

09:00-10:30

Opening Plenary Session

10:30–11:00

Coffee and Tea Break

11:00–12:30

12:30–14:00

Luncheon

14:00–15:30

15:30–16:00

Coffee and Tea Break

16:00–17:30

17:45–18:45

Lightning Talks

Thursday, 31 October

08:00–12:00

Badge Pickup

08:00–09:00

Morning Coffee and Tea

09:00–10:30

10:30–11:00

Coffee and Tea Break

11:00–12:30

12:30–14:00