Eric Schow and Praveen Yedidi, CrowdStrike
We ingest over a trillion events per day into our cloud platform and it is very important that this platform is available, operational, reliable, and maintainable.
In creating a comprehensive monitoring strategy for our data processing platform, we found it strategic to model our platform's efficiency and resilience along two axes—complexity of implementation and engineer experience—from which we can define four quadrants—observability, operability, availability, and quality.
In this talk, we present how we've employed this four-quadrant model to establish key indicators and enforceable quality SLAs in order to improve the resilience of our cloud platform while reducing operational complexity.
Eric Schow, CrowdStrike
Computational Biophysicist turned Mobile Engineer turned Cloud Engineer turned Site Reliability aficionado. Currently on a mission to stop breaches at CrowdStrike, where I lead the Site Reliability team.
Praveen Yedidi, CrowdStrike
Distributed systems Developer with experience in mentoring, facilitating, and leading teams offering a decade of experience in Large Scale cloud-native application and tooling development. Possessing excellent analytical skills summed up with strong knowledge in Go, JavaScript, Kubernetes, AWS, Terraform, Vault, Consul, Service Meshes, Observability, and monitoring tools. Active open-source contributor and contributed to projects like Kubernetes, gvisor, grafana, terraform, firecracker-containerd. I enjoy speaking and spoke at conferences like Kafka Summit, JS Conf, ContainerCamp AU, DDD Sydney, and Go Days. Organizer of Serverless Days Melbourne.
SREcon21 Open Access Sponsored by Indeed
author = {Eric Schow and Praveen Yedidi},
title = {A Principled Approach to Monitoring Streaming Data Infrastructure at Scale},
year = {2021},
publisher = {USENIX Association},
month = oct
}