Lessons Learned in 10 Years of SRE: Part 1 - Starting SRE

February 2, 2022

Opinion

Authors:

Article shepherded by:

Laura Nolan

The technical side of Site Reliability Engineering is reasonably well-understood and documented. There are now a significant number of books, articles, and conferences that deal with how to reason about, operate and improve the reliability of large-scale distributed systems. However, the dynamics of SRE teams themselves are less discussed [1]. Effectively operating an SRE team or organization is very rewarding, but can also be very challenging.

In this article, I will share five lessons I learned about starting SRE teams (or engagements, or organizations). There will be a follow-up piece containing five lessons I learned about sustaining healthy SRE engagements at steady state. I presented a talk version of these articles at SRECon21 [2].

Effectively operating an SRE team or organization is very rewarding, but can also be very challenging.

Know what you want from SRE

The very first question I usually ask to anyone who is planning to start an SRE team or organization is: why do you want to "do SRE"?

I find that this question is very powerful. It is not easy to articulate a universal definition of SRE, and there are many possible ways of implementing SRE, including different team topologies and focus areas. Asking why probes into the motivations of the person you are talking to. It will also help surface possible differences in interpretation of what exactly SREs do and what an SRE team or organization can provide. It will help understand the goals, assess feasibility and suggest an appropriate SRE setup strategy.

SRE can be pretty expensive when implemented as a separate team (or organization) that shares work and ownership with a separate product development team (or organization), focusing on aspects related to reliability. This implementation, which is often used in large organizations, provides the following benefits (among others):

dedicated planning for reliability work
a growing set of best practices and tools
an organization (SRE) which will grow to be more impactful than the sum of its parts, influencing more areas of the company than those it directly engages with.

In order for this approach be effective, the organizational context has to be mature enough to support the separate SRE organization, and the investment needs to be tailored to the needs, the budget and the timeline.

Finally, I find it beneficial to write down in the very early days the goals of the SRE team and how success will be measured. Having in mind some clear goals, ideally tied to measurable results, will sharpen the focus of your SRE journey, and will allow you to demonstrate return on investment.

Ideally, there should be two feedback loops: one from the product development team to SRE, as mentioned, and one from SRE to product development teams. This allows SREs to influence the product roadmap and make sure reliability and customer experience are top concerns in work planning sessions.

Align with business goals and customer needs

SRE teams operate in a broader business context, and therefore must serve the company's business needs. They must do so in cooperation with their product development counterparts.

SRE teams are usually less subject to business pressure to deliver features, and enjoy greater freedom by design. But this freedom has a specific purpose: it is there to allow the team to identify reliability gaps and act on them. This freedom demands discipline in identifying and executing impactful projects that are meaningful for the business and, ultimately, for customers. It is fundamental to set up feedback loops between the SRE team and the product development team to make sure SRE keeps serving business goals.

Ideally, there should be two feedback loops: one from the product development team to SRE, as mentioned, and one from SRE to product development teams, to influence the product roadmap and make sure reliability and customer experience are top concerns in work planning sessions.

The following table summarizes the most common planning documents I've seen SRE teams share and agree with their development counterparts [3]. Depending on the degree of structure needed and the stage of the SRE engagement, you may want to use one or more of these documents.

Type of document	Time horizon	Frequency of production/review
Team Charter / Strategy	years	yearly or less frequently
Team Roadmap	months (small multiple of the planning cycle - e.g., a few quarters)	once or twice a year
Team planning / backlog	weeks to months	once or twice per planning cycle (e.g., quarter or semester)

A team charter is necessary to establish the basics of the SRE engagement and a high-level team strategy. The other documents may appear as the engagement matures and needs more direction and structure. However, writing documents is not sufficient. You need to make sure that the contents are reviewed with the product development team and agree on the given direction and specifics. Bidirectional communication and influence is vital.

Anti-pattern: SRE roadmap drifts from product development team roadmap.

Anti-pattern: SRE roadmap drifts from product development team roadmap. Over time, SRE teams will grow confident in their operation and may - in good faith! - pursue objectives that are increasingly disconnected from the business goals.

For example, after successfully delivering projects to reduce manual operational work (toil) it can be tempting to bring things to the next level and embark in a longer project that will further reduce toil. This kind of project may have a much lower return on investment.

Expertise matters

This is something I learned very recently. And it goes very much against who I am: I am a generalist. I am not an expert in any of the areas I have worked on. I have shipped production code in multiple languages (C++, C#, etc.) but I can't call myself an expert in those languages. My approach is to have solid fundamentals and approach any problem as something I can learn how to solve.

While this can work in large organizations, or when there are very loose time constraints, it is not desirable in smaller contexts when it's important to deliver results reasonably quickly.

What works best is to seed teams with a few “T-shaped” experts: people with broad general knowledge as well as deep vertical knowledge about the technology stack that the SRE team will work with. They will speak the same language as the product development team and help establish credibility, in addition to multiplying the impact of the rest of the team by helping them learn the stack much quicker than if they were alone. If the SRE team doesn't have temporal or geographic proximity with the development team then engineers with relevant deep expertise will be even more crucial.

This is less important if there is a homogeneous production interface or toolset in the company - meaning that lots of knowledge is transferable - but this is not always the case.

You cannot declare "SRE"

How do you tell whether you are "doing SRE?". Just saying that you have successfully implemented SRE doesn't mean it is true. Introducing SRE is a cultural change, and technology is only one part of the equation. As mentioned before, dedicated SRE organizations are complex and expensive. It takes time to implement a transition to SRE, both for a team and - on a larger scale, of course - for a division or a company.

How can we, then, judge success of an SRE transition?

Following from the first lesson, success can be measured as progress towards the stated goals. Did you improve (or set up!) the indicators you wanted to influence? Can you tell whether customers are receiving a more satisfactory level of service? Is the production posture of the services better?

Of course, not everything can be measured. There are some meta-indicators that I use to understand the health of the SRE transition, that could be considered proxies for SRE success:

How is the product dev team interacting with the SRE team? Look for shared work, code reviews and design reviews across teams, meetings organized without managerial pressure.
What type of projects is the SRE team executing? Does the SRE team mostly do operational work, or are they undertaking increasingly more impactful projects, moving up as appropriate in the Dickerson Service Reliability Hierarchy [4]?
In case of larger SRE orgs, is there an SRE community, or just a set of SRE teams? The value of an SRE organization is greater than the sum of its parts: look for (and encourage) cross-team collaboration in any shape or form.

The value of an SRE organization is greater than the sum of its parts: look for (and encourage) cross-team collaboration in any shape or form.

Build Trust

SRE relies heavily on shared ownership. Trust between the SRE team and the product development team is vital.

At the executive level, trust is necessary to gain support, sponsorship and funding. If you are starting an SRE initiative, most likely you were trusted with funds to do so. At the executive-level trust can best be maintained by setting and delivering business goals. As a corollary, trust can be lost by diverging from business goals and delivering projects whose importance is not understood by product development execs. This is why it is crucial to make the SRE impact obvious and to communicate it in a way that can be broadly understood. If that is difficult, it may be time to revise the SRE strategy or roadmap.

At the level of senior individual contributors and management, SREs need to establish relationships with their peers in the product dev organization. This lets SREs gauge the product developers' level of understanding of SRE, and to establish common conceptual grounds. Managing expectations and delivering impactful, concrete and meaningful projects is key to establish trust with these stakeholders.

At the individual contributor level, SREs gain trust by taking the time to understand the product and its stack. Be curious as to why things are the way they are and suspend judgement when you ask questions. Improvements that may seem obvious to you are clearly not obvious to others, and there will be reasons for why things are the way they are. Start small and increase the scope of your ideas over time.

In general, I see two patterns for establishing trust:

build alignment and exercise SRE advocacy, through continuous bidirectional feedback loops.
deliver complete, incrementally more impactful projects over time.

Conclusion

There are lots of factors to consider when starting an SRE team or organization, and this article is by no means exhaustive.

The basics are having clear motivations and goals, aligning with business needs, having the right kind of expertise to successfully bootstrap the team, paying attention to the cultural aspects of SRE and, last but not least, building trust at all levels in the organization.

In the second article in this series, we will cover 5 more items that I think every SRE team should take into account in order to sustainably deliver high-impact work.

References

[1] Matt Brown and Gustavo Franco's 'How SRE teams are organized, and how to get started' (Google Cloud Blog) is a useful introduction to SRE team types and topologies. See: https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started

[2] Andrea Spadaccini, '10 Lessons Learned in 10 Years of SRE', USENIX SREcon. See https://www.usenix.org/conference/srecon21/presentation/spadaccini

[3] Shylaja Nukala and Vivek Rau, 'Why SRE Documents Matter', ACM Queue 16:4 (October 2018). See https://queue.acm.org/detail.cfm?id=3283589

[4] Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Murphy, 'Part III: Practices' in Site Reliability Engineering: How Google Runs Production Systems (O'Reilly, 2018). See https://sre.google/sre-book/part-III-practices/#fig_part-practices_reliability-hierarchy

Article Categories:

SRE

Last updated February 8, 2023

Authors:

Andrea is a Principal Software Engineer in SRE at Microsoft Azure, where he currently acts as a tech lead for all the Azure SRE teams. He works on cross-team projects (currently, SLOs for all Azure products), while also focusing on SRE for Azure SQL products. He joined Microsoft in 2018. Before that, he worked as a Site Reliability Engineer for Google since 2011, in various technical and management roles across SRE teams in CorpEng, Ads, and Google Cloud Platform. He's been lucky enough to contribute to the first and second SRE books, mostly to the chapters about on-call. He received his Ph.D. in Computer Engineering from the University of Catania in 2012, with a thesis on novel traits for biometric recognition. Andrea is the maintainer of the free CPU simulator EduMIPS64. Twitter: @lupino3. Github: @lupino3

andrea.spadaccini@microsoft.com