The Frontiers of Reliability Engineering

Tuesday, 29 October, 2024 - 11:5012:30 GMT

Heinrich Hartmann, Zalando SE

Abstract: 

We take the 10s anniversary of SRECon as an occasion to reflect over the past decade of advancements in Reliability Engineering and provide an overview about the Frontiers we are facing today. Within Zalando we followed major trends of the industry in outsourcing hardware provisioning to AWS, package applications into Docker images, fully automated deployments (CI/CD), and implemented Distributed Tracing for Microservice Observability. Despite these advances, many challenges remain in building reliable, observable software systems and new areas arose which require new methods and tools. In the talk we are proving a number of conceptual view that help to map out the larger Reliability Engineering landscape and zone-in on 3 specific frontiers that we are actively investing in at Zalando: (1) Data Operations and Monitoring Event Based Systems (2) Mobile Observability (3) Effective Management Practices for Reliability.

Heinrich Hartmann, Zalando SE

Heinrich Hartmann is a seasoned expert with a decade of experience in Reliability Engineering. Currently, he serves as the Senior Principal SRE at Zalando, a leading European e-commerce company, where he oversees company-wide reliability practices. Before joining Zalando, Heinrich was the Chief Data Scientist at the Monitoring Platform Circonus, where he managed the analytical product offerings and pioneered histogram methods for latency monitoring.

Heinrich is a frequent speaker at industry conferences and is best known at SRECon for his regular "Statistics for Engineers" masterclass.

BibTeX
@conference {302245,
author = {Heinrich Hartmann},
title = {The Frontiers of Reliability Engineering},
year = {2024},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}