Andrew Cowie
Implementing observability was a game-changer. We dramatically reduced our time to identify problems, isolate causes, and see effects of changes.
But it's not quite as easy to retrofit as we might like to think. Brooks taught us to be wary of doing things over, but we couldn't safely make even basic changes to the existing codebase. Being able to do observability at all was a major motivation for a massive re-engineering. We'll share lessons learned as we rebuilt a large distributed system.
As we iterate the code we iterate our telemetry, too. Once you've learned something and changed the system, it's a new system; telemetry is not a continuous function! This has a drawback: you can't use observability as a substitute for business metrics. Which raises an interesting question: can you actually measure your SLOs using SLIs in a distributed system?
Andrew Cowie[node:field-speakers-institution]
Andrew Cowie has an extensive background of software development, systems operations, production infrastructure, and engineering leadership experience but somewhat unusually started his career as an infantry officer in the Canadian army, having graduated from Royal Military College with a degree in engineering physics. He later ran operations for a new media company in Manhattan and was a part of recovering the firm after the Sept 11 attacks. Since then he has consulted on crisis resolution, change management, robust architectures, and (more interestingly) leveraging Open Source to achieve these ends. Andrew has been working in and around systems engineering and functional programming for many years; his most recent work has been to re-engineer observability into analytics pipelines written in Haskell.
author = {Andrew Cowie},
title = {Observability Is Not Analytics!},
year = {2022},
address = {Sydney},
publisher = {USENIX Association},
month = dec
}