Liz Fong-Jones, honeycomb.io
It is not feasible to run an observability infrastructure that is the same size as your production infrastructure. Past a certain scale, the cost to collect, process, and save every log entry, every event, and every trace that your systems generate dramatically outweighs the benefits. If your SLO is 99.95%, then you'll be naively collecting 2,000 times as much data about requests that satisfied your SLI than those that burnt error budget. How do you scale back the flood of data without losing the crucial information your engineering team needs to troubleshoot and understand your system's production behaviors?
Statistics can come to our rescue, enabling us to gather accurate, specific, and error-bounded data on our services' top-level performance and inner workings. We can keep the context of the anomalous data flows and cases in our supported services while not allowing the volume of ordinary data to drown it out.
Liz Fong-Jones, honeycomb.io
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 15+ years of experience. She is an advocate at Honeycomb.io for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Liz Fong-Jones},
title = {Refining Systems Data without Losing Fidelity},
year = {2019},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}