Wondering what SRE is all about? We will explain what it is, how it works, why it was developed, and how it can help your organization.
So what is SRE (Site Reliability Engineering)? SRE is a methodology that fuses software and operations teams, with the goal of producing reliable, resilient, and scalable systems.
Site Reliability Engineering (SRE) was developed by Google engineer Ben Treynor Sloss in 2003. Google’s goal was to increase the reliability of its sites and services. SRE accomplishes this by integrating development and engineering best practices into the infrastructure and operation of services.
SRE is often considered alongside the methodology of DevOps. When comparing SRE vs. DevOps, you’ll find that they have the same end goals. Both try to align development and operations with customer satisfaction.
However, their methods for accomplishing these goals differ. Where DevOps is focused on uniting development and operations to drive business value, SRE is focused on the process by which these goals are reached. Neither SRE or DevOps is better than the other; instead, both methodologies work together.
The core principles of SRE include:
SRE can be a method of implementing DevOps goals. Here are some common pillars of DevOps success according to Google, and how SRE helps achieve them:
The typical SRE process is more of a set of best practices. These best practices include:
Ideally, these practices enforce the principles of SRE and help the humans operating your systems learn more, grow, and feel supported.
There are many benefits to adopting SRE within your organization. We’ve highlighted some of them here.
The primary goal of SRE is to improve the reliability of your service. Yet reliability is a subjective term based on how your customers perceive your service.
To understand the level of reliability you customers expect, use SLIs and SLOs. SLIs, or service level indicators, measure the performance of key metrics such as latency, availability, and more for critical points on a user journey.
SLOs, or service level objectives, set a goal for reliability. This goal should be set for when a service becomes unreliable to the point where customers are pained. SLOs should be more lenient than SLAs, or service level agreements. These are legal guarantees that your service will have some level of reliability. SLOs can act as a safeguard for breaching SLAs.
With SLOs, you can also understand when it’s time to focus on feature No. 1: reliability. Error budgets are an advanced practice for operationalizing SLOs. Error budgets show how much unreliability you can experience before the SLO is breached.
If you’re running out of error budget, or exceeding your allotment, it’s time to prioritize reliability improvements. By using error budgets and SLOs, teams can make better decisions about the most important projects to spend resources on with a strong focus on customer happiness.
SRE also helps you improve development velocity. While conversations about error budgets sometimes shift resources towards reliability improvements, they can also shift resources to development efforts. When the error budget has room, engineers are encouraged to take risks and plunge into new development.
The expectation is that, if the reliability is sufficient, teams can work to delight customers by providing new features.
SRE also helps teams work faster by tightening feedback loops. A key SRE principle is conducting an incident retrospective after each incident. These documents summarize what the team learned from the incident and what actions need to happen going forward. By encapsulating this learning and reviewing it, the lessons learned can stimulate future development.
Additionally, action items from the retrospective can be added into future sprints. Teams can work with product to ensure that these improvements are prioritized according to their importance.
Incidents are inevitable. No service can ever achieve 100% uptime. But SRE can help you respond quickly and recover from incidents, minimizing downtime. The SRE process includes many tools and procedures to respond to incidents:
Another principle of SRE is to automate wherever possible. Automating helps the humans who build, operate, and maintain the systems focus on the tasks that are critical rather than repetitive work. By reducing toil, time and energy can be spent on other projects that bring business value.
Another key component of SRE is standardization. SRE advocates creating documentation, processes, and runbooks for common tasks. This also increases consistency and reduces toil. Standardization exists in the incident response process with codified roles and responsibilities. It also exists within the retrospectives process with custom questions, sections, and analysis. It’s an important part of establishing error budget policies that go into effect when an error budget hits a certain level.
As you can see, throughout all SRE best practices, standardization plays an important role in making sure teams are able to learn, grow, and make the best decisions possible.
SRE can also help you onboard new employees. The principles of SRE make information accessible across your organization. Incident retrospective libraries and runbooks will get newcomers up to speed.
Additionally, on call can be daunting when you’re responsible for a service you don’t have a lot of experience with. By using SRE best practices for monitoring and alerting, runbooks, and incident response, you can eliminate a lot of the pressure for on-call.
New employees will also feel empowered to explore and work on projects. The error budget helps them analyze the risk of changes they make. It acts as a safeguard against negatively impacting customer satisfaction.
Perhaps the most important aspect of SRE is the cultural lessons it imparts. SRE best practices promote a psychologically safe environment. Failure is celebrated, and teammates are encouraged to raise issues. Incidents are addressed blamelessly. Instead of looking for who is responsible, everyone works together to find systemic cause.
SRE views incidents as unplanned investments in reliability. With this perspective, you’ll be encouraged to learn as much as possible each time. Knowledge becomes less siloed, flowing through the entirety of the development process. Everyone learns something from everyone.
Now you’re convinced that the benefits of SRE are plenty enough, so let’s discuss how you can get started.
SRE can be integrated into many existing operational models. As noted above, SRE aligns with DevOps. It also aligns with ITIL principles. ITIL is also focused on having the goals of IT work with your organization’s business goals. Like DevOps, SRE can be implemented as a way to reach ITIL principles. ITIL principles also focus on tasks such as automating and optimizing, collaboration without silos, and continuous learning.
The process of adopting SRE will vary based on your organizational needs. While some practices like SLOs and error budgets are natural next steps for more advanced teams, there are some low-hanging fruit other teams can use to get started today.
Investing in SRE will require buy-in from all stakeholders. In a panel hosted by Blameless, Tony Hansmann had some advice for teams looking to get others on board. He advocates for a method of finding a small team of people who are passionate about the idea, who believe positive change is possible. By working with them on things like eliminating toil, which have clear immediate benefits, you can convince them to commit further.
After building a committed core, you need to educate others. “What we did is we taught SLOs, SLIs, and SLAs, how they're actually related, and how to have a half-hour conversation about them without getting lost in the definition of the acronym,” Tony recommends.
Keep in mind that you don’t need an SRE to have SRE. While site reliability engineers come with a vast amount of experience, teams looking to incorporate SRE into their organization can focus on the best practices. Rather than having an SRE or SRE team, everyone becomes an SRE. Björn Rabenstein calls this “SRE in the Third Age.”
That being said, if you’re looking to build out an SRE team in your organization or make the career move to SRE, here are some of the most common roles you will fill.
An SRE is a multifaceted role. SREs often focus on building tools and infrastructure rather than developing new features. SREs will split their time between development and operations as the need arises. According to Google, this split should be no more than 50% operations work and the rest of the time should be devoted to reliability projects.
Sometimes, the SRE role may not code at all. SREs come from many backgrounds, occasionally transitioning into the role from a non-engineering career. Their time could be entirely spent on building and maintaining SRE processes. This could involve collaborating with teams to build runbooks, hosting incident review meetings, or conducting internal audits for reliability.
SRE teams can be arranged in a variety of ways. Two of the most common structures include:
Blameless can help you begin your SRE journey by making it easy to adopt and operationalize SRE practices. To find out how, check out a demo. Or, if you’re interested in more content like this, sign up for our newsletter below.