What is Site Reliability Engineering [Simple Intro to SRE]

Wondering what SRE is all about? We will explain what it is, how it works, why it was developed, and how it can help your organization.

So what is SRE (Site Reliability Engineering)? SRE is a methodology that fuses software and operations teams, with the goal of producing reliable, resilient, and scalable systems.

The Basics of Site Reliability Engineering

Site Reliability Engineering (SRE) was developed by Google engineer Ben Treynor Sloss in 2003. Google’s goal was to increase the reliability of its sites and services. SRE accomplishes this by integrating development and engineering best practices into the infrastructure and  operation of services.

SRE is often considered alongside the methodology of DevOps. When comparing SRE vs. DevOps, you’ll find that they have the same end goals. Both try to align development and operations with customer satisfaction.

However, their methods for accomplishing these goals differ. Where DevOps is focused on uniting development and operations to drive business value, SRE is focused on the process by which these goals are reached. Neither SRE or DevOps is better than the other; instead, both methodologies work together.

Principles of SRE

The core principles of SRE include:

  • Failure is inevitable, but you can always learn from it
  • Effort spent improving reliability past the point where customers would notice is effort that could be better spent elsewhere
  • Where possible, automate to remove toil
  • Deal with incidents blamelessly: work together to find issues in the system that lead to mistakes

SRE as an implementation of DevOps principles

SRE can be a method of implementing DevOps goals. Here are some common pillars of DevOps success according to Google, and how SRE helps achieve them:

  • Reduce silos of information: SRE achieves this through creating accessible documentation. Learning is fed back into the development cycle.
  • Accept failure as normal: SRE prioritizes developing and operating based on reaching customer’s reliability needs, rather than striving for 100% reliability
  • Implement gradual change: Transformations cannot happen overnight. Instead, you’ll need to make small changes and iterate to reach your goals.
  • Leverage tooling and automation: SRE achieves this with a commitment to automate where possible and a suite of helpful tools.
  • Measure everything: SRE encourages pushing past shallow metrics and looking at deep metrics that reflect meaningful things about your system.

The typical SRE process

The typical SRE process is more of a set of best practices. These best practices include:

  • Monitoring data to detect incidents: Use monitoring tools to gather information from the system on how it’s performing. Then, use automated alerts when an abnormality is detected.
  • Responding to incidents using tools like runbooks: When something goes wrong, have plans in place to deal with it effectively and make the information available to all teammates.
  • Creating incident retrospectives to learn from incidents: Every time you experience a failure, document the process and information around it. This allows you to learn as much as possible.
  • Translating incidents into customer impact with SLIs and SLOs: By looking at how customers use your service, you can measure how impactful incidents really are. This allows you to triage your responses in the best way possible.
  • Developing based on the error budget available: Moderate how quickly you move forward with development based on analyzing the risk of breaching the SLO. Use error budgets to make the best decision for your customers.

Ideally, these practices enforce the principles of SRE and help the humans operating your systems learn more, grow, and feel supported.

Why should I adopt SRE?

There are many benefits to adopting SRE within your organization. We’ve highlighted some of them here.

Better clarity into customer needs

The primary goal of SRE is to improve the reliability of your service. Yet reliability is a subjective term based on how your customers perceive your service. 

To understand the level of reliability you customers expect, use SLIs and SLOs. SLIs, or service level indicators, measure the performance of key metrics such as latency, availability, and more for critical points on a user journey.

SLOs, or service level objectives, set a goal for reliability. This goal should be set for when a service becomes unreliable to the point where customers are pained. SLOs should be more lenient than SLAs, or service level agreements. These are legal guarantees that your service will have some level of reliability. SLOs can act as a safeguard for breaching SLAs.

With SLOs, you can also understand when it’s time to focus on feature No. 1: reliability. Error budgets are an advanced practice for operationalizing SLOs. Error budgets show how much unreliability you can experience before the SLO is breached. 

If you’re running out of error budget, or exceeding your allotment, it’s time to prioritize reliability improvements. By using error budgets and SLOs, teams can make better decisions about the most important projects to spend resources on with a strong focus on customer happiness.

Improved development velocity

SRE also helps you improve development velocity.  While conversations about error budgets sometimes shift resources towards reliability improvements, they can also shift resources to development efforts. When the error budget has room, engineers are encouraged to take risks and plunge into new development.

The expectation is that, if the reliability is sufficient, teams can work to delight customers by providing new features.

SRE also helps teams work faster by tightening feedback loops. A key SRE principle is conducting an incident retrospective after each incident. These documents summarize what the team learned from the incident and what actions need to happen going forward. By encapsulating this learning and reviewing it, the lessons learned can stimulate future development.

Additionally, action items from the retrospective can be added into future sprints. Teams can work with product to ensure that these improvements are prioritized according to their importance. 

Improved incident response

Incidents are inevitable. No service can ever achieve 100% uptime. But SRE can help you respond quickly and recover from incidents, minimizing downtime. The SRE process includes many tools and procedures to respond to incidents:

  • Monitoring tools detect incidents quickly: The sooner an incident is detected within the system, the sooner an alert can be sent out to begin the response. Make sure your monitoring tools are able to capture any areas of abnormality.
  • Alerting tools get the right people involved: When monitoring (or customer support) detects an incident is occurring, on-call people will be alerted to begin responding. Balancing your on-call schedules is essential to avoid burnout. 
  • Runbooks guide respondents through incidents: These documents lay out things to check and how to respond to categories of incidents step by step. Continuously reviewing and improving based on how your runbooks performed helps them stay useful.
  • Role-based collaboration tools get everyone on the same page: Designate roles and responsibilities for each responder. Use shared checklists to keep everyone looped in on the team’s progress within a centralized communication channel.
  • Incident retrospectives show areas of improvement: After each incident, prepare a retrospective document with the steps taken, the resources used, and other contextual information. Review these on a regular basis to allow you to apply their learning to future incidents.

Automation and standardization

Another principle of SRE is to automate wherever possible. Automating helps the humans who build, operate, and maintain the systems focus on the tasks that are critical rather than repetitive work. By reducing toil, time and energy can be spent on other projects that bring business value.

Another key component of SRE is standardization. SRE advocates creating documentation, processes, and runbooks for common tasks. This also increases consistency and reduces toil. Standardization exists in the incident response process with codified roles and responsibilities. It also exists within the retrospectives process with custom questions, sections, and analysis. It’s an important part of establishing error budget policies that go into effect when an error budget hits a certain level. 

As you can see, throughout all SRE best practices, standardization plays an important role in making sure teams are able to learn, grow, and make the best decisions possible.

Onboarding

SRE can also help you onboard new employees. The principles of SRE make information accessible across your organization. Incident retrospective libraries and runbooks will get newcomers up to speed.

Additionally, on call can be daunting when you’re responsible for a service you don’t have a lot of experience with. By using SRE best practices for monitoring and alerting, runbooks, and incident response, you can eliminate a lot of the pressure for on-call.

New employees will also feel empowered to explore and work on projects. The error budget helps them analyze the risk of changes they make. It acts as a safeguard against negatively impacting customer satisfaction. 

A culture of learning and growth

Perhaps the most important aspect of SRE is the cultural lessons it imparts. SRE best practices promote a psychologically safe environment. Failure is celebrated, and teammates are encouraged to raise issues. Incidents are addressed blamelessly. Instead of looking for who is responsible, everyone works together to find systemic cause.

SRE views incidents as unplanned investments in reliability. With this perspective, you’ll be encouraged to learn as much as possible each time. Knowledge becomes less siloed, flowing through the entirety of the development process. Everyone learns something from everyone.

Now you’re convinced that the benefits of SRE are plenty enough, so let’s discuss how you can get started.

How do I integrate SRE into my organization?

SRE can be integrated into many existing operational models. As noted above, SRE aligns with DevOps. It also aligns with ITIL principles. ITIL is also focused on having the goals of IT work with your organization’s business goals. Like DevOps, SRE can be implemented as a way to reach ITIL principles. ITIL principles also focus on tasks such as automating and optimizing, collaboration without silos, and continuous learning.

The process of adopting SRE will vary based on your organizational needs. While some practices like SLOs and error budgets are natural next steps for more advanced teams, there are some low-hanging fruit other teams can use to get started today.

  1. Establishing incident response best practices: This involves creating standard roles and responsibilities for responders. Make sure to collaborate broadly to keep responsibilities fair and balanced. By codifying best practices, you’ll improve consistency, reduce toil, and increase morale.
  2. Creating incident retrospectives for all incidents: Building a library of incident retrospectives will help you tackle each new incident. You can look back at what worked best and implement it. Having this history of incidents also helps people feel psychologically safe. They’ll see how past incidents were dealt with blamelessly and feel comfortable failing without fear of retribution.
  3. Writing documentation and runbooks: Create documentation for the processes you use when dealing with incidents. These resources increase incident response speed and reduce toil. Make sure to keep reviewing and revising them as needed.

Investing in SRE will require buy-in from all stakeholders. In a panel hosted by Blameless, Tony Hansmann had some advice for teams looking to get others on board. He advocates for a method of finding a small team of people who are passionate about the idea, who believe positive change is possible. By working with them on things like eliminating toil, which have clear immediate benefits, you can convince them to commit further.

After building a committed core, you need to educate others. “What we did is we taught SLOs, SLIs, and SLAs, how they're actually related, and how to have a half-hour conversation about them without getting lost in the definition of the acronym,” Tony recommends.

Keep in mind that you don’t need an SRE to have SRE. While site reliability engineers come with a vast amount of experience, teams looking to incorporate SRE into their organization can focus on the best practices. Rather than having an SRE or SRE team, everyone becomes an SRE. Björn Rabenstein calls this “SRE in the Third Age.”

That being said, if you’re looking to build out an SRE team in your organization or make the career move to SRE, here are some of the most common roles you will fill.

What are some roles of an SRE?

An SRE is a multifaceted role. SREs often focus on building tools and infrastructure rather than developing new features. SREs will split their time between development and operations as the need arises. According to Google, this split should be no more than 50% operations work and the rest of the time should be devoted to reliability projects.

Sometimes, the SRE role may not code at all. SREs come from many backgrounds, occasionally transitioning into the role from a non-engineering career. Their time could be entirely spent on building and maintaining SRE processes. This could involve collaborating with teams to build runbooks, hosting incident review meetings, or conducting internal audits for reliability.

What is an SRE team and how does it operate?

SRE teams can be arranged in a variety of ways. Two of the most common structures include:

  • Dedicated engineers focused on tooling and infrastructure shared by the organization: This model has SREs building things like SLOs, runbooks, and templates. These would be used by many teams, adapted to their unique needs.
  • Embedded engineers for each team or product, maintaining reliability for that service area: This model has a small team of SREs (or a single SRE) working with each team. They would help with the specific needs of that team.

Starting your SRE journey with Blameless

Blameless can help you begin your SRE journey by making it easy to adopt and operationalize SRE practices. To find out how, check out a demo. Or, if you’re interested in more content like this, sign up for our newsletter below.

About the Author
Emily Arnott

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Schedule a demo with us today!