Seeing like an SRE

Metrics that Matter

July 16, 2021

Opinion

Authors:

Article shepherded by:

Laura Nolan

In ‘Seeing Like an SRE: Site Reliability Engineering as High Modernism’ [1], Laura Nolan introduced James C. Scott’s Seeing Like a State: How Certain Schemes to Improve the Human Condition have Failed [2]. She also compared SRE to High Modernism. It is a deeply interesting comparison, in particular because I think they share a lot of the same shortcomings.

When Scott talks of "legibility", he talks of the ability to make a complex, always changing, localized world fit into a set of predefined characteristics. The goal here is that by defining a limited set of well known standardized characteristics that sufficiently describe all the different local realities, the state can impose the same rules and control over all the disparate parts.

In a High Modernist view, the idea is that a centralized entity that has a deep knowledge of the science and high level models of the field will be able to make better decisions for the group and be able to lead and direct the whole group behavior to success. History has been a judge of the multiple attempts to implement High Modernism over the years and has generally judged it harshly. High Modernism has nearly completely failed at its stated goals every time it has been tried.

When Scott talks of "legibility", he talks of the ability to make a complex, always changing, localized world fit into a set of predefined characteristics.

In reaction to Nolan's article, I advised my twitter readers to read Jane Jacobs’ The Death and Life of Great American Cities [3] as a different take on the problem of urbanism and the failures of High Modernism. From Jacobs’ position, cities are not inhabited by a static population with static needs. Over time the needs of the city’s population evolves. That means that for a city to survive, it needs to be able to adapt and change its behavior. Cities need to be polymorphs, always able to reuse and repurpose its parts into something new and different — something that will serve the current needs of the inhabitants. Another interesting corollary is that too high a degree of specialization in urbanism (or any other field) will always have an expiration date, usually reached faster than the designer thinks. This is a fundamental problem that limits High Modernism [4].

On the other hand, after a certain size, some kind of control and direction for evolution is needed. You need the specialised knowledge of the kind of metrics, monitoring and generalised management techniques that High Modernism considers the solution to the problem of complexity — without it, it becomes impossible to fulfil all the constraints and requirements on the system. While the diversity and the fast-changing nature of demands on a system force you to have a lot of metis — local knowledge learned by being there when it was needed — you also need a lot of hard-to-master techne, the kind of knowledge built by abstracting reality into models- that will not be easily accessible but can only be learned by a directed and driven specialist.

This is also the situation that SREs find themselves in. We need the ethos of the DevOps movement, letting the teams that have ownership of their services manage a lot of day-to-day work. We need to let teams adapt their methods and tools to their needs and context. However, we also need a higher level vision of the future of the organisation as a whole. Much of the knowledge needed to make higher level strategic planning work is not available at the team level, because each team is after all, mainly oriented to the needs of its own services and users.

Metrics

This brings us to metrics. Usually, the way we resolve this dilemma is by ignoring the local needs. We impose some generic metrics, rules and constraints across the whole organization. After all, if we want to control the future of the organization, what we need are organizational-level measures, decisions and objectives. It is a style of management that tends to mix well with Command-and-Control management. Set up SLOs, OKRs, KPIs, or whatever other three letter acronyms you learned from conferences, or blogs, and then simply enforce them — a prime example of techne.

In turn, this leads to another way High Modernism fails. Whatever you do, and as much as generalised metrics are valuable, imposing metrics based solely on techne does not work. You will build your toolkit based on an abstracted model that will not fit reality well.

A very common example of a generalised metric is MTTR (Mean Time To Restore). It seems simple, but there is legibility work involved: we have to define what an incident is, when it starts and ends, when it is repaired, when it is mitigated. We will want to automatically track the MTTR somehow, probably from our organization’s ticketing system. Most of this data will probably be entered in our systems by humans, our own operators.

Will it be a good metric to measure progress and stability of our systems? Well, that is a complex question [5]. But more importantly, what will happen once the humans in our organization figure out that we are going to make decisions based on this information — information that they already shape, by being the one handling the incidents? From experience, it is likely that they are going to adjust their definition of what an incident is, of when it starts and when it ends. Each of these definitions are political and negotiated, after all. Their definition is local. Their meaning depends on a team’s own metis. We simply added one more parameter in the negotiation of the meaning. The tool used to mediate this negotiation previously — the ticketing system — now mediates a separate negotiation about organisational MTTR at the same time.

Teams will work around, hack, and game the generalised metrics, exploiting the gap between how your model imagined the world and how it really is. We can often see that in action with incident metrics, where the definition of what counts as an incident becomes more central to the metric than the real state of the system [6].

Teams will work around, hack, and game the generalised metrics, exploiting the gap between how your model imagined the world and how it really is.

Techne vs Metis

The way out of the pit is to remember what techne and metis are. Techne is high level abstract knowledge, by definition specialised and free of context. Metis is purely contextual knowledge. It is hard to generalize and abstract, but it is highly efficient in getting things done. Metis is the knowledge of how things are and how to adapt constantly to them. It is the glue between the real world and our human conceptual models of it. The tools people use to try to match their reality with the impossible demands imposed from above are metis. If you impose metrics without using this localized knowledge to tie them to the reality your team deals with, you will ultimately end up with metrics that are totally uncorrelated to any real meaning.

But do not lose hope. Legibility and specialized knowledge are deeply useful to build better systems. All of us, as SREs, still have something to contribute. Organizational-level work is not an exercise empty of all meaning — we just need to understand how to make contributions work well with that localized ever changing knowledge that is metis. So how can we make it work? We must actively seek metis as a translation layer between the real world and our models, and use metis to inform our use of techne.

Instead of having to define everything precisely upfront, in the abstract, let's use the fact that our team's knowledge is already adapted to their needs and context. Instead of tracking MTTR through the ticketing system, we could ask our engineers questions like: "How long do you feel it takes you to fix an incident?" or "How fast do you think we are at fixing problems?" as a monthly question to the team. We would define MTTR differently, in a less abstract way then.

If you are afraid that the answers would be too hard to collect and analyze, there are solutions. We could make it a multiple choice question. It would make the answer more legible in the abstract and easier to quantify. The results would probably be as easy to act on for the organization. The difference is that now we would always get a directly localized answer, through a dedicated medium, instead of overloading the meanings of the ticketing system. We would now have a meaningful MTTR — meaningful in the sense that it is charged with the localized meaning of the situations the teams encounter.

If SRE, as a profession, is to specialise in centralising and standardising operations, then we should decide if we want to prioritize being efficient in a complex adaptive world or if we want to make everything legible and standardized. The role of SRE can be to bridge the abstract high modernist view, and the messy reality of building and maintaining complex adaptive systems. We will not escape metrics, but if we want to do our job properly, we cannot escape that meaning is produced with local, in-context knowledge. Our abstract models have no meaning if they are not translated and adapted through metis to the reality of the world.

Instead of imposing easy to measure, but fundamentally meaningless, metrics on our teams, let's offer metrics that are infused in the complex adaptive knowledge of the experts in the very systems we want to improve.

Appendix

References:

[1]: Laura Nolan, ‘Seeing Like an SRE: Site Reliability Engineering as High Modernism’, ;login:, USENIX Association, April 2021. https://www.usenix.org/publications/loginonline/seeing-sre-site-reliability-engineering-high-modernism

[2]: James Scott, Seeing Like a State. Veritas, 2012.

[3]: Jane Jacobs, The Death and Life of Great American Cities. Random House, 1961.

[4]: This is potentially even true of every sufficiently complex system. The question then moves to how complex on this scale are software systems and are they all above that limit. My personal theory is that we are all way past that by the point it becomes a useful software system, but the question is not settled.

[5]: Well I say complex to avoid the debate, but I am personally siding more on the side of the debate this paper points toward: Štěpán Davidovič, ‘Incident Metrics in SRE: Critically Evaluating MTTR and Friends’, O’Reilly and Google Cloud, 2021. https://static.googleusercontent.com/media/sre.google/en//static/pdf/incident_metrics_in_sre.pdf

[6] ‘Please don't count outages (or SEVs, or whatever)’, Rachel by the Bay, June 1 2021. http://rachelbythebay.com/w/2021/06/01/count/

Article Categories:

SRE

Last updated February 8, 2023

Authors:

Before being a software engineer, Thomas Depierre was an electronic engineer. Interested in Human Factors, Systems Thinking and Resilience Engineering, he is also an active contributor of the Elixir community. As a consultant and an engineer, he has helped multiple teams over the years through process changes.

depierre.thomas@gmail.com