Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

SRE Team Roles & Responsibilities Explained

Emily Arnott
|
5.25.2021

Are you considering adopting SRE? We will explain the roles and responsibilities of an SRE team within your organization, and how to start building one.

So what does an SRE team do? An SRE team is responsible for building software that improves the resiliency of systems, implementing fixes, responding to incidents, and automating processes whenever possible.

What do site reliability engineers do?

Site reliability engineering is a holistic practice that incorporates various types of work. So, site reliability engineers play many roles within an organization. Here are some of the major roles, and the typical duties of each. An SRE will likely have responsibilities that fit into several of these categories.

SREs as developers focusing on reliability

SREs can help develop your codebase with a perspective that focuses on reliability. SRE advocates making reliability feature No. 1. When designing the specifications for a product or feature, SREs will include reliability standards. Site reliability engineers will then help keep development focused on meeting that goal. SREs can help prioritize development efforts by establishing SLOs, or service level objectives, an important reliability metric.

Daily responsibilities: SREs work on the codebase like other engineers. They’ll likely focus on system architecture, as this is often the code with the highest potential to impact reliability. They’ll need to understand how changes on one layer of the system affects the reliability of the rest. As such, they’ll have to take on a big-picture perspective.

Required skills: SREs will have to be familiar with the code for the entire stack of your system. They should be comfortable working in (or learning) any language your system uses. They should be able to follow how changes propagate through the entire system.

Potential backgrounds: SREs focusing on this role will often have a computer science background. A system engineering background in particular can be helpful. Yet many, as with all the SRE roles, many people are excellent fits so long as they have a growth mindset.

SREs as operators focusing on optimization

SREs can also work within the operations team. Here, their responsibilities would focus on optimizing the operations process. Here are some ways they achieve this:

  • Automating processes such as updates or tests: By making common processes automatic, SREs save time. This helps teams remove toil and allows them to focus on more high-value work.
  • Creating runbooks and other guides: Runbooks are documents that walk through a specific task. They’re commonly used to respond to incidents, minimizing downtime. They are also helpful with guiding engineers through processes in a codified manner.
  • Aligning operations goals with business value: Operations must choose how to allocate optimization efforts. SREs can craft user journeys to make sure their efforts reflect what customers care about the most.
  • Refining incident response: SREs implement runbooks, on-call policies, alerting tools, and other facets of incident response. They also ensure learning passes on to future incidents.

Daily responsibilities: SREs in operations will focus on contextualizing individual incidents in larger patterns, using them to make further optimizations. They’ll work on coding automation into processes. Sometimes this will involve using external automation tools that interact with the system.

Required skills: This role requires familiarity with your organization’s code stack. They’ll need to understand the contributing factors of incidents. Experience with tools for alerting, automation, and runbooks is also an asset.

Potential backgrounds: Engineers in other operations roles can transition into this SRE role. Security, network operation, QA, and other engineers are also well-suited for this role.

SREs as caretakers of reliability data

The SRE methodology involves making data-driven decisions. This role focuses on gathering data and transforming it into something actionable. Monitoring tools can help gather data, and SREs can develop it into deeper metrics. SREs also ensure that data is available throughout the organization.

Daily responsibilities: SREs work to set up and adjust monitoring as needed. They gather and present monitoring data when decisions are being made. This ensures that reliability isn’t overlooked.

Required skills: Knowing what to monitor is as important as knowing what not to. SREs will need to weigh the investment in establishing monitoring for services. A discerning eye into what matters most to the customer is key. SREs will also need to convey the importance of this data to others.

Potential backgrounds: SREs in this role may have started in data sciences. They might have some experience with statistics and systems analysis.

SREs as developers of infrastructure and tooling

There are many tools that SREs find handy in their tool belts. Implementing these tools isn’t always easy. They may need to be customized for your organization’s particular needs. Your organization may even build new tools in-house. An SRE in this role would focus on building and implementing these tools. They would also work on documentation, including runbooks, procedures, policies, and templates. These projects would be collaborative, drawing expertise from many teams. The SRE would combine these sources and share the information.

Daily responsibilities: These SREs would focus on developing tools and infrastructure. This would involve collaborating with other teams. They would also keep the infrastructure up-to-date. They would organize review meetings to learn what works and doesn’t, then make updates.

Required skills: Being able to code and implement tools is essential for this role. This will likely require familiarity with your whole stack. Strong interpersonal skills are also a must to bring together people’s requirements.

Potential backgrounds: SREs focusing on tooling will likely have some software background. Writing policies and procedures may not require specific coding knowledge. But, it is a benefit to be able to connect steps in a runbook to the codebase.

SREs as leaders aligning development goals with business goals

Through creating SLIs and SLOs, SREs convert customer happiness into actionable metrics. They can focus on steering development decisions to align with these metrics.

Daily responsibilities: SREs confer with management about the direction of the engineering organization. The SRE would gather data from a variety of sources and contextualize it with SLOs.

Required skills: Familiarity with ingesting and interpreting data will help make suggestions meaningful. Making persuasive arguments while representing many perspectives is a core element.

Potential backgrounds: This role involves more soft skills than specific programming knowledge. SREs in this role can come from a variety of backgrounds.

SREs as ambassadors of reliability culture

A major part of SRE is the cultural learning it provides. Some of the major cultural lessons of SRE include:

  • Failure is inevitable, and an opportunity to grow and learn.
  • People should feel psychologically safe to raise issues. Performance is not dependent on any shallow metric.
  • When addressing an issue, work blamelessly. Assume the best intentions from everyone, and look for systemic issues.

SREs are responsible for sharing these values across the organization.

Daily responsibilities: An SRE in this role might guide meetings. They might be in charge of leading incident retrospectives to learn from failure. They can also codify these ideas into policy.

Required skills: This SRE role may not require coding knowledge. Instead, the SRE needs very strong interpersonal skills and emotional intelligence. The SRE must empathize and accommodate the needs of their peers.

Potential backgrounds: This role may not require an engineering background either. SREs focusing on this role can come from a wide variety of backgrounds.

How do you build a site reliability engineering team?

Now that you understand the roles an SRE can take on, let’s look at how these roles exist within teams. SRE teams are similar to DevOps teams. Both align the goals of development and operations with business needs. The major difference with SRE vs DevOps teams is that DevOps teams focus on achieving goals, whereas SRE teams focus on the processes behind achieving goals.

Starting your SRE team

The SRE team can be structured in many different ways. If you have a smaller team, you might not have people dedicated full-time to the SRE role. Instead, you might start assigning SRE practices to other engineers. Everyone becomes an SRE, in a way.

Other times, you'll bring on an entire team of SREs. This is more common for larger engineering organizations.

Hiring vs promoting SREs

Promoting people into the SRE role or hiring SREs are both viable options. There is no one correct choice. Promoted SREs will have more familiarity with your specific systems. Yet, SREs with experience at other organizations can bring new practices with them.

When hiring SREs, look beyond their technical qualifications. New teammates can learn the specifics of your organization’s tech stack.  What’s more important is their perspective on the principles of SRE.

The SRE salary is generally close to what you’re paying your development engineers. Payscale lists the average base salary of an SRE in the United States as $117,717 per year.

Managing the SRE team

There is no one right way to manage your SRE team. Any conventions that your organization uses for other teams, such as daily standups, can work with the SRE team. Make sure to continually review the effectiveness of these practices. SREs have a wide variety of duties, so some management practices may be more or less effective depending on the composition of your team.

The SRE mentality is to create policy and procedure through holistic collaboration. It is not the duty of management to prescribe these for the SRE team. Instead, managing the team involves long-term strategic prioritization. This could involve deciding between investing in a new tool or spending that time on revising runbooks. These decisions should be collaborative and based on customer impact.

SRE team structures

The SRE team can be structured in different ways. How you divide responsibilities depends on how the SRE team fits within the organization. Here are two common examples of team structures:

Centralized SRE team: This structure has the SREs as a central unit. It provides resources for the entire organization. SREs fill the roles of reliability caretakers, developers of infrastructure, alignment leaders, and cultural ambassadors. Every team in the organization would adopt the resources provided by the SRE team.

Embedded SRE team: This structure has the SREs deployed to a team or product. They would focus on establishing reliability standards, monitoring, SLOs, and more. They would also train the team on SRE best practices. Once the team is confident in their abilities, the SRE may embed within a new team.

Whichever model you use, know that SREs are important culture developers. Wherever they go, they'll be sharing SRE culture and SRE principles. This will help up-level the entire organization's reliability.

Building SRE teams with Blameless

Blameless can help you build your SRE team by providing easy-to-use tools and processes. By building a strong foundation of SRE practices, you’re better equipped to scale reliably. To find out how it works, check out a demo. Or, if you’re interested in more content like this, sign up for our newsletter below.

Resources
Book a blameless demo
To view the calendar in full page view, click here.

Check out our full guide on your top priorities when investing in reliability.

Read more