As organizations progress in their reliability journey, they may build a dedicated team of site reliability engineers. This team can be structured in two major ways: a distributed model, where SREs are embedded in each project team, providing guidance and support for that team; and a centralized model, where one team provides infrastructure and processes for the entire organization. Most structures will be some combination of these ideas, with some SREs focusing on specific projects and other SRE projects completed as an SRE team.
When looking at centralized models of SRE teams, there are further distinctions to make based on the role of each SRE. One perspective says every SRE should be a generalist, capable of performing every duty of the role. This has the advantage of being very robust - if each SRE can do any given job, any person’s absence won’t cause an issue. On the other hand, you could run into a “jack of all trades, master of none” issue, where your potential is limited. This is where the specialization perspective can help.
In this blog post, we’ll look at:
The SRE role is extremely diverse. An SRE may be tasked with contributing to the code base of the service, writing policies and procedures for development practices, spreading cultural values, and everything in between. Even if tasks aren’t in the official SRE job description, SREs are often the ones who pick up glue work. This is work that isn’t technically anyone’s job, but is necessary for work to proceed - gluing everyone else’s efforts together.
Finding one person who’s expert in and enthusiastic about all these different responsibilities is difficult. The roles an SRE can take on often appear to be polar opposites. Sometimes an SRE is the reliability guardian, reigning in teams to make sure they don’t breach SLOs. At other times, an SRE is the champion of failure, encouraging teams to take risks as long as they’re ready to learn from them.
Fortunately, SREs are well positioned to be able to specialize without losing the big picture perspective. They ultimately need to contextualize everything happening in development or operations in a way that has significance across the organization. That means that even if they focus on one aspect of the SRE job, they’ll be performing those duties aligned with the entire team.
Specializing in SRE allows people to spend more of their time and energy on their strongest areas. If you have someone who’s excellent at writing code, but not so much of a public speaker, you can have them work away at infrastructure and in-house tools, and let them skip giving a values update at all-hands. Conversely, SREs can come from not entirely technical backgrounds. If you have someone who is great at developing policy, but can’t grapple with the depths of your codebase, you can let them be a full-time educator and policy writer.
This specialization mindset can even help with hiring. When building your team, you can have roles you’re hoping to fill. You can then look for people who specialize in those roles as far back as the job posting. Just remember that people will always grow and change. As Blameless SRE Jake Englund points out, SRE is a discipline that is “constantly inventing tools”, redefining its capabilities as a whole. Having role experts will naturally educate and encourage the rest of your team purely through what Jake describes as “osmosis”.
Having specialized SREs should be balanced with giving people the opportunity to expand their role and take on more functions.This will create the perfect blend of people playing to their strengths while still having the bases covered if someone’s missing. Jake also emphasizes how much this can help the entire team grow. By having experts leading learning, you end up with stronger engineers that are more satisfied with their jobs.
Now that we’ve looked at how to set up a team of specialists and why you might want to, let’s look at some of the specialist roles you can have. Keep in mind that one person can serve parts of multiple roles, so don’t look for an exact 1:1 fit. Instead, look for people who can grow into these archetypes to get these benefits.
Who they are: SRE is all about building policies, processes, cultural values, and infrastructure that the whole organization can benefit from. Of course, these benefits only happen if the teams actually use them! The educator is someone who teaches and encourages teams to use SRE practices.
What they do: Educators can lead infosessions on new SRE practices to get people up to speed. They can also track how much practices are being used, and gather information on why things might go underutilized. If required, they can provide hands-on coaching to help people advance their abilities.
Skills they need: Educators need to be able to convince people to make the investment of adopting new practices. They need to be expert on the tangible benefits of adopting, able to cite specific figures where relevant. At the same time, they need to be personable and empathetic. They need to understand the pains that can come with having to switch to new practices, and convey that understanding through the connection.
Who they are: One of the key tools in SRE is the service level objective, or SLO. SLOs set a point where the unreliability of a service is such that it starts having a negative impact on customers. Teams set up policies, like slowing development or emergency code freezes, to prevent SLO breaches. SLOs should be understood and monitored across the organization, but this role specializes in being an absolute defense against breaches.
What they do: The SLO guard makes sure that the SLO isn’t breached by building and implementing preventative policies. This isn’t the full story, though. They also need to ensure that the SLO is measuring what it needs to. This involves setting up SLO review meetings, incorporating additional monitoring tools to get more sophisticated data, and researching user expectations.
Skills they need: While discussing different SRE roles, Blameless SRE Jake Englund mentioned the value of “someone who will say no”. When everyone is enthusiastic about some new feature push, no one wants to be the dissenting voice. Telling someone that development needs to be delayed to preserve the SLO is a skill in itself, one that requires an unwavering commitment to reliability, the expertise to back up their decision, and buy-in across teams to support the plan.
Who they are: This role is focused on building SRE infrastructure that the entire organization can use. This covers a ton of different types of project, each with its own sub-specialization: internal tools for monitoring or resolving incidents, documentation and runbooks for procedures, processes for completing projects, or even cultural values to guide people’s decisions. You might want
What they do: Infrastructure architects are in constant communication with other teams to see what’s needed most. Educators can serve as a conduit for these relationships, compiling what they hear into a big picture report. Once the priorities are clear and aligned among teams, the architect works away at building. Of course, these infrastructure meta-projects are developed along the same workflow and processes as any other project. Therefore, the architect is a sort of SRE-developer and needs to work closely with development teams.
Skills they need: The skills needed depend greatly on the type of infrastructure being developed. In some cases, this is one of the most development-focused SRE roles, and so deep knowledge of the organization’s codebase is a must. If focused more on policy and procedure, the architect may not need coding skills, but will still need to understand how their processes will work on the level of development. Either way, this is primarily a technical role, focused on engineering solutions to specific needs. In our discussion, Jake emphasized the idea of SREs existing on a range of socialness - if educators are on the social end, architects can be on the other extreme.
Who they are: Having processes in place to respond effectively and thoroughly to incidents is a major part of SRE. The incident response leader takes responsibility for making your organization as incident-ready as possible.
What they do: The incident response leader plays a role before, during, and after incidents. Before incidents, they lead in setting up runbooks, on-call schedules, and other tools to help respondents. Of course, all of this is done in collaboration with the teams that will be responding. During incidents, they serve as a procedure expert that ensures teams are working effectively. If there’s disagreement over roles and responsibilities, or when to escalate, the incident response leader can serve as a point of authority to keep things moving. After the incident, the incident response leader can drive the creation of a retrospective. This document gathers the lessons of the incident and serves as a hub for followup tasks. The leader makes sure this document is created, reviewed, and acted on.
Skills they need: Incident response leaders need both a lot of people skills - to understand how people will behave while panicking and empathizing with their abilities - and infrastructural skills - to know how the tools they build will interact with the system. They also need a strong ability to prioritize based on the bigger picture. Their world is one where everything is on fire, and they need to distinguish quickly between a big fire and a little fire. This means having a perspective that’s zoomed out to the entire organization while still able to see the little issues that each incident can bring.
Having a team of specialists can be a challenge, but it leads to opportunities. Of course, you may have SREs who can embody several of these specializations; they aren’t mutually exclusive. It’s just often a tradeoff, where one invests their time. People also have their personal interests, something we can appreciate and lean into. By allowing people to flourish in their skills without losing the robustness of shared knowledge, you’ll build the strongest possible team.