Oncall: An Equal-Opportunity Waste of Time

November 3, 2022

Opinion

Authors:

Article shepherded by:

Laura Nolan

The Reward for Good Work is More Work

One of the basic premises of Site Reliability as an engineering specialisation (although certainly not a unique one) is that we always need to be finding a better class of first-world problem. This has been expressed as ‘automating yourself out of a job every 18 months’, or other mnemonics for using enduring engineering to make progress away from repeated manual work that doesn’t make the system meaningfully better (“toil”), repetitive outages, or other undesirable outcomes.

It’s in this way that we advance the state of the art when it comes to our profession. Our time is spent less and less on manual effort, and our tools become sharper and smarter. We self-actualise as an engineering specialisation that thinks of headier things – of products, of customer satisfaction, and of our place in a world without a tolerance for toil.

However, the reward for good work is still more work. We should be proud of, and prepared to do more of the kind of work that got us where we are. If you demonstrate success in delivering something, you’ll be asked to do it again – to do it more, to do it harder, to do it smarter.

Why, then, is SRE still associated with an operations style that begins and ends with in-person oncall interventions, that in many people’s minds defines our value?

A Waste of Time

Lots of obligatory things that result in good outcomes are a waste of time. In the same way that preparing your tax return results (usually) in not being audited or fined, staffing an oncall rotation is a necessary evil when you own a piece of software people are using. In addition, you want to spend as little time, energy and effort as possible doing it. We want it to be less onerous overall. It should be simple, pain-free and you should only really need to look at the outlier cases, as opposed to repeat incidents or manual cranking of processes to keep production running.

The distinction here is between “a waste of time” as opposed to “a bad idea”. There’s a certain amount of incident response that’s inevitable and useful – the amount you need to do, and no more. After that it’s pageantry and busy work.

Aside: Most people in Ireland, where I live, never have to file a tax return themselves. All but special cases are managed by the tax authorities. This is normal to people living here and is often a source of amazement to my American colleagues. The US annual filing requirement is intentional, and mostly unnecessary. This is probably why I used this for my example!

The Fetishisation of Incident Response

Oncall being complicated is – in part – a self-fulfilling prophecy. Many of the teams and organisations I’ve worked with had practices such as a minimum amount of time spent on the team before being considered for oncall, a set of quizzes to ‘qualify’ you for oncall, or a set of key ‘black belt’ folks on a team, where a new oncall engineer had to impress them with their knowledge of the innards of the systems at hand before being permitted to be oncall.

Similarly, the incident management principles are to an extent based on processes (or even just taxonomy) from FEMA and other sources that lionise responders and assign impressive-sounding roles. There’s certainly usefulness to all this, but there’s also an element of pageantry.

It’s a very subtle slippery slope between making prudent preparation for incident response based on assessment of risk and business needs, and making incident response complicated on purpose. At best, making oncall the exclusive responsibility of an elite SRE class increases our tolerance for complexity.

Oncall is a form of toil – it needs to be done but it doesn’t leave our systems in a better state. Often, the expectation of heightened operational awareness based on being in an oncall rotation comes hand in hand with an attitude that there’s value in having folks continue in a cycle of expertise qualifying them for what, in essence, is a waste of time. This also reinforces the ‘only SREs are equipped to be oncall’ trope that is tempting to defer to, especially if you’re a busy development partner.

If oncall is not a good use of engineer time but a necessary evil, then the fairest outcome is we spread it evenly across all qualified engineers; the insistence on SRE taking on more can often be tied back to anachronistic attitudes around ‘ops’, or a fetishisation of the pager that mostly doesn’t stand up to real scrutiny.

Why We Invest in SRE

Another of the basic premises of investment in SRE is common to many specialist functions: the idea that if we invest in hiring an SRE, it adds incrementally more value than if we hired a product SWE or other specialist.

This begs the question of why? Why is it more incrementally valuable to hire more SREs, after the initial investment that gets you an SRE presence and engagement? Unfortunately, the conversation around incident response is often brought up when discussing headcount and further investment, no matter where you are. Product Engineering leads don’t want their engineers to be woken up; they don’t want large-scale incidents to derail sprints or roadmaps or deadlines. It’s not because they’re bad people, it’s because they’re busy people. The value proposition is the one that’s presented to them. Similarly, if an SRE lead is being offered an augment to their team’s capability, they’re unlikely to turn it down. They’re busy people, too.

What would the value proposition of investing in SRE be if you took oncall and incident response out of the equation?

Losing the Training Wheels

The extent to which SREs are treated as special pixies that carry the secret to oncall can certainly be intentional, as it was at Google. SRE at its foundation was a novel and engineering-focused approach to the problem that things often explode in ways we don’t want them to. The baggage that came with that often took the form of existing commitments around oncall, interrupts, and manual work. To many folks (often those paying least attention), this defined our worth.

Times have changed. The value proposition of SRE has shifted. The usefulness of SRE as a buffer for operational toil is a set of training wheels we as SREs should hasten to remove.

A Post-Magical Era

If we remove the ‘magic’ from oncall work, a couple of thought experiments come to mind. These may, of course, be simply thought experiments, depending on your ability to effect change in your organisation – however, they can also form the basis of a discussion with your stakeholders.

The SRE Book’s chapter on being oncall [1] suggests eight engineers (or six each in two sites) as a minimum rotation size. In many cases, seven per site was taken as the minimum, and in the majority of cases, all oncall was done by SREs. One of the primary elephants in the room was of course whether the remit of the SRE team was sufficiently large to justify its size, if you removed oncall from the equation.This often resulted in an arrangement where a single SRE team might cover several services, and sometimes these services were different enough to drive a large cognitive load burden for oncall.

Outside Google, many SRE or SRE-like groups end up staffed to ‘follow the sun’, in other words, to ensure that there are engineers awake to deal with urgent issues 24/7 without having to page people awake. This pattern, naturally enough, happens when engineering organisations are concentrated in a single region. It can be difficult for SREs teams outside of that region to have impact. The ‘follow the sun’ pattern can lead to small and often undervalued SRE teams.

Thought Experiment 1: Imagine you were no longer responsible for oncall, interrupts or tickets for your products or infrastructure. Do you think you can still justify the entire size of your organisation, person by person? What would the delta be?

There is often a ‘tax’ on organisations that collectively have a toil and incident management burden: we budget more headcount than the project work demands in order to allow for interrupt work. However, this can create a perverse incentive. If you place most of this headcount windfall into one group, that group will almost always be both too large, and likely less effective.

The effectiveness hit comes from two main sources:

A watered-down or unclear charter. Stakeholders see high-profile incident response/oncall happening, and don’t demand clarity on what other work the group is undertaking.
The practical inability to plan, since the responsibility for dealing with unexpected events like outages is situated mostly in that group. Ironically, this is often one of the reasons why SRE teams are staffed in the first place; to sweep the issue of a regularly derailed team under the carpet of it not being my team.

Thought Experiment 2: What aspects of the essential parts of incident response do we want everyone to be good at, and how realistic is that prospect?

In the model where mostly expert SREs are oncall (or even expert non-SREs), there’s a certain set of assumptions we make around competence at it. This involves assuming people are good at things like incident triage, the practice of incident command, communications to stakeholders and even customers. Anecdotally, I’m going to go out on a limb and say that a fair percentage of folks who are oncall aren’t good at these things, and that’s okay. The question being asked is whether we think this skillset is essential, and how much we’re willing to invest in training, and risk on the off chance that someone mishandles a major incident.

To go further – incident command and management is a specific set of skills that you can definitely be good at, and where the business really, really needs a consistent and competent response, every time. At Twilio, we have a specific team that manages all incidents, follow-up actions, and operational insights around incidents company-wide. We’ve found that making sure that the data and insights around incidents and their followup flows back into the business is a full-time job. Relying on a rotation of variably interested volunteers to ensure this happens will get you mixed results.

The Value Proposition

Often, teams will have a ‘charter document’, which describes the team’s place in the larger organisation; in some ways this can be used to justify the team’s existence, to tell why the team’s work is essential to the functioning of the business.

Much of this article has attempted to call out some key assumptions around oncall, incident response, operations, and how an SRE team’s charter relates to that.

SRE lies between two extremely compelling forces, when it comes to charter definition:

The supreme temptation of relying on incident response as a justification for investment in the team.
The common lack of scrutiny when it comes to justifying the team size based on project/engineering work alone.

Both of these have commonly observed outcomes, including:

SRE teams can find themselves unable to retain key staff if the engineering work is being overtaken by operational work – this is almost never treated with the urgency that it is for other engineering teams.
SREs complain that there is a lack of impactful engineering work to do. They are often correct in this assessment – the issue being that the truly impactful, necessary engineering work is already staffed. At Google, this was papered over by a strong internal mobility culture; the stated remedy was to go find another team.
It takes a particular kind of fortitude for a manager to admit their team is too large. Often, it is easier to broaden the definition of ‘essential engineering work’ to suit the team size as staffed, rather than the other way round.

Whatever a given team’s charter, it is likely worthwhile for any team that has some operational responsibility to assess their charter without regard for incident response (as per thought Experiment 1, above). Incident response isn’t going away – however, it can increasingly be a shared burden, with teams across all engineering functions having a healthy and ideally co-equal degree of investment, ownership and responsibility.

Appendix

References:

[1] Andrea Spadaccini, 'Being Oncall', in Site Reliability Engineering: How Google Runs Production Systems (O'Reilly, 2016). https://sre.google/sre-book/being-on-call/

Article Categories:

SRE

Sysadmin

Last updated February 8, 2023

Authors:

Dave O'Connor is an SRE Leadership practitioner based in Dublin, Ireland. He's currently VP of Engineering at Twilio, working on reliability and leading the SRE group there, and previously worked at elastic.co and Google, where he was the global lead for Storage and Databases SRE at Google, as well as the head of Engineering at Google Ireland for a time.

He holds various opinions about technology, building and keeping teams and organisations, the role of SRE/Devops/whatever the kool kids are calling it, and various other things. He has run orgs from individual teams up to orgs of 300+ engineers, managers and other functions.

Dave was responsible for the first and second Reddit AMAs participated in by SRE, and is the author of chapter 29 of the SRE Book.

His opinions as expressed here are his own.

doc@gerrup.eu