Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Why SRE is Critical for Teams & Customers

Wondering why you should choose SRE for your organization? We will explain what it is and all the benefits it can bring to your organization.

What are the benefits of SRE?

SRE fuses software and operations teams, with the goal of producing reliable, resilient, and scalable systems. Some of the benefits of this methodology include:

  • Better team communication
  • Ongoing cultural improvement
  • Reduced toil
  • Happier customers

What is SRE?

Site Reliability Engineering (SRE) was developed by Google in 2003 and popularized in their 2017 book. It’s a collection of practices, tools, and cultural philosophies that aims to improve the reliability of your services. “Reliability” is defined as a subjective metric, reflecting not just the availability of services, but how important they are to users. As such, SRE is focused on aligning development and operations teams on improving customer happiness.

The major practices of SRE include:

And the major cultural values include:

  • Accepting failure as normal and adopting a blameless approach
  • Creating strong teams and relationships
  • Hiring team players and educating your hires
  • Creating a shared ownership of the product among teams
  • Balancing a resiliency-first approach with risk acceptance

As software services become more depended on by users, reliability becomes all the more essential. Companies of all sizes are embracing SRE as a way to address this need. We’ll break down why SRE is the best route to improving customer satisfaction and team cohesion.

Goals of SRE

Getting ahead of incidents

You can never completely prevent new incidents from happening, but you can mitigate the worst effects of incidents by preparing for them. By giving you the tools to track patterns in incidents, SRE allows you to predict the most impactful or common types of incidents. Once identified, you can build resources like playbooks and run training for these types of incidents.

SRE also helps you understand the true impact of incidents. Tools like SLIs and SLOs can factor in all the aspects of your customers' experience, showing how incidents impact their typical usage of the service. This allows you to align and prioritize based on customer happiness.

Analyze and improve your DevOps process

By tracking the progress of your incident response process for every incident, you can start to identify roadblocks and bottlenecks. Are some types of incidents taking a long time to report? Are some diagnostic tools consistently not delivering useful results? Are solutions delayed when trying to deploy to production? SRE can highlight questions like this and start giving you answers.

Learn from every incident with incident retrospectives

On top of statistics and patterns that you can gather across incidents, SRE also allows you to dive in to the unique factors of each and every incident. Incident retrospectives are documents you build for each incident that tell the story of how the incident was detected, diagnosed, and solved. These documents can serve as a resource for solving future incidents. By searching for retrospectives for similar incidents via incident tags, you can get a head start on diagnosis.

Keep customers happy!

The ultimate goal of SRE, and of your org as a whole, is happy customers. Understanding how to prioritize based on customer happiness can be difficult. How do you know when to step on the gas and deliver desired features ASAP, and when to slow down on development and make sure your service is reliably delivering what customers expect? Answering this question is at the core of SRE. Error budgets are a tool that can guide you to the perfect balance of velocity and reliability. SRE's focus on good incident management keeps the impact of inevitable incidents on customer happiness as low as possible.

Benefits of SRE

Aligning teams on user happiness by understanding user experiences

SRE advocates the use of service level indicators and objectives (SLIs and SLOs) to measure the health of services. These aren’t just simple metrics of availability, but can be something that  reflects a user journey. It can turn how customers use your services and what makes them happy into a metric.

Once you can make user happiness into a metric, you can use it to understand the real impact of decisions and incidents. The more you can understand your users’ perspectives, the more you’re able to prioritize their happiness in everything you do. SRE advocates for dynamic and iterative releases. Rather than infrequent big releases, SRE teams frequently push small updates in response to user needs.

This alignment on user happiness helps your teams work together too. It can be difficult to understand when to prioritize increasing development velocity vs when to improve your service’s reliability. SRE helps teams get on the same page, reduce silos and friction, and share learning by putting user happiness at the center of everything.

Minimizing user and on-call pain through better incident response

One of the key lessons of SRE is that failure is inevitable. You can mitigate the effects of incidents and reduce their frequency, but you can’t ever expect to eliminate them entirely. Because of that inevitability, improving incident response is a major component of SRE.

SRE takes advantage of tooling and automation to remove toil from incident response. With incident classification, you can triage based on impact to user happiness. Then, you can link automated runbooks to work through common solutions without any manual intervention. This frees up engineers to focus on more creative problem solving. Once the incident is over, incident retrospectives ensure you’ve learned everything you can.

Improving incident response benefits your customers by reducing the downtime of services they rely on. When something critical to them fails, you’ll be able to give it the attention it deserves. These improvements also benefit your teams. By reducing the manual toil of incident response, on-call engineers have reduced stress and burnout.

Empowering teams through cultural and practical changes

While implementing SLOs and incident response tools provide great benefits for your teams and users, the most profound benefits of SRE emerge through cultural change. Everything SRE advocates is based on its cultural foundations. Therefore, if you can instill these cultural values, SRE best practices will develop naturally.

At the heart of the SRE cultural shift is the idea of blamelessness. When something goes wrong, rather than trying to find an individual to blame, use it as a chance to make systemic changes to improve the system. For example, if someone accidentally pushes code to production before it’s reviewed, causing an error, don’t blame the person. Instead ask questions like:

  • What manual checks could be in place to prevent this?
  • Can the deployment process require an indicator that the code has been reviewed?
  • What communication or education was lacking that led to the engineer believing that the code could be pushed?

By following these thought processes, you’ll find changes that improve the reliability of your system. Your teams will appreciate the opportunity to do meaningful work, rather than pointing fingers. Blamelessness gives engineers the psychologically safe space and agency to experiment, leading to better work. Your users will also benefit from this cultural evolution. Wasting time on blaming and punishment does nothing for them, while making systemic changes means more reliable services for them.

Implementing SRE

Now that we’ve discussed some of the benefits of SRE, let’s break down how best to integrate the practice into your organization. SRE can fit into any organization’s model. It doesn’t require major investments right away — you don’t have to jump right to hiring a dedicated SRE team. 

Instead, you can build up your SRE practice piece by piece depending on your needs. If you struggle with fast incident response, start building runbooks. If your teams are disagreeing on priorities, align them with SLOs. Cultural changes will always benefit organizations without any major investments. Adopting the SRE perspective in what you do will gradually prove itself beneficial, leading to further adoption.

As your SRE practice matures, you can invest more into hiring and tooling to take your practices to the next level.

SRE vs DevOps

SRE and DevOps share many of the same goals. They mostly differ in how they recommend these goals are met. However, this doesn’t make them incompatible. SRE can be thought of as a method to implement the principles of DevOps. If you’ve implemented DevOps, you’re in a great position to bridge the gap to SRE. Each SRE practice you add will be bolstered by the DevOps structures you’ve already built.

Starting with SRE can be intimidating, but the benefits are more than worth it. Give yourself the best start by working with Blameless. Our SLOs, incident response tools, and retrospectives make it easy to realize SRE benefits. See how by checking out a demo!

Book a blameless demo
To view the calendar in full page view, click here.