Adam Mckaig and Tahia Khan, Datadog
Datadog is a popular cloud monitoring service which operates at scale in all three major cloud providers, ingesting 10s of GB/s of points across many billions of timeseries into PiBs of hot and cold storage. Naturally, reliability is paramount.
In this talk, we'll show how our very large distributed system works today, and how it grew from a very small not-distributed system. We'll share the most interesting scaling and reliability challenges we faced along the way, how we solved them (for now), and some important lessons and strategies which emerged. We'll also share a couple of bonus problems which are still very much unsolved today, and what we're planning next.
Adam Mckaig, Datadog
Adam Mckaig is a Staff Engineer at Datadog in New York, where he runs Metrics Reliability. Previously he has built things at Google, the New York Times, Bloomberg, and UNICEF. His favorite sound is a pager not going off.
Tahia Khan, Datadog
Tahia Khan is a Toronto-based SRE at Datadog. Before settling on SRE, she’s worked on everything but frontend at a bunch of startups, Mozilla and Amazon. Outside of work, Tahia draws bad art.
SREcon22 Americas Open Access Sponsored by Blameless
author = {Adam Mckaig and Tahia Khan},
title = {How the Metrics Backend Works at Datadog},
year = {2022},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}