Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

What is Site Reliability Engineering [Simple Intro to SRE]

Emily Arnott
|
2.26.2024
|
SRE Fundamentals

Site Reliability Engineering, or SRE, was first introduced by Google to help fill the gaps between their development and IT operation teams. Today it is the critical missing link providing a proactive form of quality assurance (QA) helping improve the reliability of systems across various industry applications. If you’re wondering what SRE is all about, here we take a deep dive into the world of site reliability engineering, explaining what it is, how it works, why it was developed, and how it can help your organization.

So what is SRE (Site Reliability Engineering)? SRE is a methodology that fuses software and operations teams, with the goal of producing reliable, resilient, and scalable systems.

What is SRE?

SRE is a methodology that unites software and operations teams, sharing a common goal to produce reliable, resilient, and scalable systems. In 2003, Google was the first massive web company to recognize the need for site reliability engineering when faced with an unmanageable amount of information. This need to improve user experience occurred before DevOps even existed, proving necessity truly is the mother of invention. Google created a new type of code-driven reliability management to resolve its growing infrastructure issues.

SRE spread to web scalers like Facebook by 2010 and continued to gain momentum with companies like Netflix, Uber, and LinkedIn by 2016/17. Adoption of SRE by non-tech companies followed suit. Some companies simply adopted the new buzzword as a title for their existing IT operations, while others acknowledged growing demands for reliability to meet consumer expectations for better user experiences.

The Basics of Site Reliability Engineering

Site Reliability Engineering (SRE) was developed by Google engineer Ben Treynor Sloss in 2003. Google’s goal was to increase the reliability of its sites and services. SRE accomplishes this by integrating development and engineering best practices into the infrastructure and  operation of services.

SRE is often considered alongside the methodology of DevOps. When comparing SRE vs. DevOps, you’ll find that they have the same end goals. Both try to align development and operations with customer satisfaction.

However, their methods for accomplishing these goals differ. Where DevOps is focused on uniting development and operations to drive business value, SRE is focused on the process by which these goals are reached. Neither SRE or DevOps is better than the other; instead, both methodologies work together.

Principles of SRE

The core principles of SRE include:

  • Failure is inevitable, but you can always learn from it
  • Effort spent improving reliability past the point where customers would notice is effort that could be better spent elsewhere
  • Where possible, automate to remove toil
  • Deal with incidents blamelessly: work together to find issues in the system that lead to mistakes

Key Metrics for Site Reliability Engineers

Key metrics for site reliability engineers include:

Error Budget/Error Rate

Error budgeting determines the maximum time a technical system can fail without leading to contractual consequences. This normalizes an amount of failure to allow teams to balance innovation without risking their SLAs.

Availability

Availability identifies a system’s ability to fulfill intended functions at a certain point in time. Using historical availability measurements also helps determine whether the system will continue to perform as expected over time.

Mean Time to Recover (MTTR)

This identifies the average recovery time after a product or system failure.

Service Level Indicators (SLIs)

SLI provides a quantitative measure for specific aspects, such as how long it takes to return a response to a request.

Service Level Objectives (SLOs)

SLO is a target level for the reliability of your service in relation to aspects of your SLAs.

Service Level Agreements (SLAs)

SLAs outline the agreements made between a service provider and a client to set standards based on the metrics the provider must adhere to.

SRE as an implementation of DevOps principles

SRE can be a method of implementing DevOps goals. Here are some common pillars of DevOps success according to Google, and how SRE helps achieve them:

  • Reduce silos of information: SRE achieves this through creating accessible documentation. Learning is fed back into the development cycle.
  • Accept failure as normal: SRE prioritizes developing and operating based on reaching customer’s reliability needs, rather than striving for 100% reliability
  • Implement gradual change: Transformations cannot happen overnight. Instead, you’ll need to make small changes and iterate to reach your goals.
  • Leverage tooling and automation: SRE achieves this with a commitment to automate where possible and a suite of helpful tools.
  • Measure everything: SRE encourages pushing past shallow metrics and looking at deep metrics that reflect meaningful things about your system.

Responsibilities of Site Reliability Engineers (SREs)

Site Reliability Engineers cover a broad range of responsibilities that can vary from enterprise to enterprise and platform to platform. However, common core responsibilities include:

System Stability and Reliability

SREs ensure systems are reliable, available, and performant to help mitigate the risks of system failures. In this role, the SRE proactively identifies potential functionality and performance issues but also takes steps to predict potential system degradation based on factors such as increasing loads. They can solve reliability and common software maintenance issues through stability and reliability testing.

Incident Response and Management

Handling incidents and contributing to an effective incident response process ensures companies have a systematic approach using a proven set of practices to mitigate issues when an incident occurs. SREs also prioritize on-call actions to reduce the disruption of system availability, reliability, and performance.

Capacity Planning

SREs are always forward-thinking and proactive, anticipating future growth to facilitate scalability. Analyzing historical data, they can expertly forecast future demands and inform the development team on details such as how much-increased traffic the system will have to contend with.

Performance Optimization

Ongoing review of processes allows SREs to identify bottlenecks and opportunities to improve performance to optimize the efficiency of systems. Through analysis of system metrics and conducting load testing, they gain insights into perfect configurations to manage increasing loads and identify peak times. As a result, they continuously optimize end-user experience.

Automation of Repetitive Tasks

SREs identify tasks for automation reviewing workflows for repetitive, time-consuming tasks that are also prone to errors. Automation can then be applied to reduce the impact of these tasks on limited resources and detract from business-critical functions. Automation might include testing, configuration management or automated alerts and monitoring. SREs then choose the most effective tools to facilitate automation to streamline processes.

The typical SRE process

The typical SRE process is more of a set of best practices. These best practices include:

  • Monitoring data to detect incidents: Use monitoring tools to gather information from the system on how it’s performing. Then, use automated alerts when an abnormality is detected.
  • Responding to incidents using tools like runbooks: When something goes wrong, have plans in place to deal with it effectively and make the information available to all teammates.
  • Creating incident retrospectives to learn from incidents: Every time you experience a failure, document the process and information around it. This allows you to learn as much as possible.
  • Translating incidents into customer impact with SLIs and SLOs: By looking at how customers use your service, you can measure how impactful incidents really are. This allows you to triage your responses in the best way possible.
  • Developing based on the error budget available: Moderate how quickly you move forward with development based on analyzing the risk of breaching the SLO. Use error budgets to make the best decision for your customers.

Ideally, these practices enforce the principles of SRE and help the humans operating your systems learn more, grow, and feel supported.

Why should I adopt SRE?

There are many benefits to adopting SRE within your organization. We’ve highlighted some of them here.

Better clarity into customer needs

The primary goal of SRE is to improve the reliability of your service. Yet reliability is a subjective term based on how your customers perceive your service. 

To understand the level of reliability you customers expect, use SLIs and SLOs. SLIs, or service level indicators, measure the performance of key metrics such as latency, availability, and more for critical points on a user journey.

SLOs, or service level objectives, set a goal for reliability. This goal should be set for when a service becomes unreliable to the point where customers are pained. SLOs should be more strict than SLAs, or service level agreements. These are legal guarantees that your service will have some level of reliability. By being triggered first, SLOs can act as a safeguard for breaching SLAs.

With SLOs, you can also understand when it’s time to focus on feature No. 1: reliability. Error budgets are an advanced practice for operationalizing SLOs. Error budgets show how much unreliability you can experience before the SLO is breached. 

If you’re running out of error budget, or exceeding your allotment, it’s time to prioritize reliability improvements. By using error budgets and SLOs, teams can make better decisions about the most important projects to spend resources on with a strong focus on customer happiness.

Improved development velocity

SRE also helps you improve development velocity.  While conversations about error budgets sometimes shift resources towards reliability improvements, they can also shift resources to development efforts. When the error budget has room, engineers are encouraged to take risks and plunge into new development.

The expectation is that, if the reliability is sufficient, teams can work to delight customers by providing new features.

SRE also helps teams work faster by tightening feedback loops. A key SRE principle is conducting an incident retrospective after each incident. These documents summarize what the team learned from the incident and what actions need to happen going forward. By encapsulating this learning and reviewing it, the lessons learned can stimulate future development.

Additionally, action items from the retrospective can be added into future sprints. Teams can work with product to ensure that these improvements are prioritized according to their importance. 

Improved incident response

Incidents are inevitable. No service can ever achieve 100% uptime. But SRE can help you respond quickly and recover from incidents, minimizing downtime. The SRE process includes many tools and procedures to respond to incidents:

  • Monitoring tools detect incidents quickly: The sooner an incident is detected within the system, the sooner an alert can be sent out to begin the response. Make sure your monitoring tools are able to capture any areas of abnormality.
  • Alerting tools get the right people involved: When monitoring (or customer support) detects an incident is occurring, on-call people will be alerted to begin responding. Balancing your on-call schedules is essential to avoid burnout. 
  • Runbooks guide respondents through incidents: These documents lay out things to check and how to respond to categories of incidents step by step. Continuously reviewing and improving based on how your runbooks performed helps them stay useful.
  • Role-based collaboration tools get everyone on the same page: Designate roles and responsibilities for each responder. Use shared checklists to keep everyone looped in on the team’s progress within a centralized communication channel.
  • Incident retrospectives show areas of improvement: After each incident, prepare a retrospective document with the steps taken, the resources used, and other contextual information. Review these on a regular basis to allow you to apply their learning to future incidents.

Automation and standardization

Another principle of SRE is to automate wherever possible. Automating helps the humans who build, operate, and maintain the systems focus on the tasks that are critical rather than repetitive work. By reducing toil, time and energy can be spent on other projects that bring business value.

Another key component of SRE is standardization. SRE advocates creating documentation, processes, and runbooks for common tasks. This also increases consistency and reduces toil. Standardization exists in the incident response process with codified roles and responsibilities. It also exists within the retrospectives process with custom questions, sections, and analysis. It’s an important part of establishing error budget policies that go into effect when an error budget hits a certain level. 

As you can see, throughout all SRE best practices, standardization plays an important role in making sure teams are able to learn, grow, and make the best decisions possible.

Onboarding

SRE can also help you onboard new employees. The principles of SRE make information accessible across your organization. Incident retrospective libraries and runbooks will get newcomers up to speed.

Additionally, on call can be daunting when you’re responsible for a service you don’t have a lot of experience with. By using SRE best practices for monitoring and alerting, runbooks, and incident response, you can eliminate a lot of the pressure for on-call.

New employees will also feel empowered to explore and work on projects. The error budget helps them analyze the risk of changes they make. It acts as a safeguard against negatively impacting customer satisfaction. 

A culture of learning and growth

Perhaps the most important aspect of SRE is the cultural lessons it imparts. SRE best practices promote a psychologically safe environment. Failure is celebrated, and teammates are encouraged to raise issues. Incidents are addressed blamelessly. Instead of looking for who is responsible, everyone works together to find systemic cause.

SRE views incidents as unplanned investments in reliability. With this perspective, you’ll be encouraged to learn as much as possible each time. Knowledge becomes less siloed, flowing through the entirety of the development process. Everyone learns something from everyone.

Now you’re convinced that the benefits of SRE are plenty enough, so let’s discuss how you can get started.

How do I integrate SRE into my organization?

SRE can be integrated into many existing operational models. As noted above, SRE aligns with DevOps. It also aligns with ITIL principles. ITIL is also focused on having the goals of IT work with your organization’s business goals. Like DevOps, SRE can be implemented as a way to reach ITIL principles. ITIL principles also focus on tasks such as automating and optimizing, collaboration without silos, and continuous learning.

The process of adopting SRE will vary based on your organizational needs. While some practices like SLOs and error budgets are natural next steps for more advanced teams, there are some low-hanging fruit other teams can use to get started today.

  1. Establishing incident response best practices: This involves creating standard roles and responsibilities for responders. Make sure to collaborate broadly to keep responsibilities fair and balanced. By codifying best practices, you’ll improve consistency, reduce toil, and increase morale.
  2. Creating incident retrospectives for all incidents: Building a library of incident retrospectives will help you tackle each new incident. You can look back at what worked best and implement it. Having this history of incidents also helps people feel psychologically safe. They’ll see how past incidents were dealt with blamelessly and feel comfortable failing without fear of retribution.
  3. Writing documentation and runbooks: Create documentation for the processes you use when dealing with incidents. These resources increase incident response speed and reduce toil. Make sure to keep reviewing and revising them as needed.

Investing in SRE will require buy-in from all stakeholders. In a panel hosted by Blameless, Tony Hansmann had some advice for teams looking to get others on board. He advocates for a method of finding a small team of people who are passionate about the idea, who believe positive change is possible. By working with them on things like eliminating toil, which have clear immediate benefits, you can convince them to commit further.

After building a committed core, you need to educate others. “What we did is we taught SLOs, SLIs, and SLAs, how they're actually related, and how to have a half-hour conversation about them without getting lost in the definition of the acronym,” Tony recommends.

Keep in mind that you don’t need an SRE to have SRE. While site reliability engineers come with a vast amount of experience, teams looking to incorporate SRE into their organization can focus on the best practices. Rather than having an SRE or SRE team, everyone becomes an SRE. Björn Rabenstein calls this “SRE in the Third Age.”

That being said, if you’re looking to build out an SRE team in your organization or make the career move to SRE, here are some of the most common roles you will fill.

What are some roles of an SRE?

An SRE is a multifaceted role. SREs often focus on building tools and infrastructure rather than developing new features. SREs will split their time between development and operations as the need arises. According to Google, this split should be no more than 50% operations work and the rest of the time should be devoted to reliability projects.

Sometimes, the SRE role may not code at all. SREs come from many backgrounds, occasionally transitioning into the role from a non-engineering career. Their time could be entirely spent on building and maintaining SRE processes. This could involve collaborating with teams to build runbooks, hosting incident review meetings, or conducting internal audits for reliability.

What is an SRE team and how does it operate?

SRE teams can be arranged in a variety of ways. Two of the most common structures include:

  • Dedicated engineers focused on tooling and infrastructure shared by the organization: This model has SREs building things like SLOs, runbooks, and templates. These would be used by many teams, adapted to their unique needs.
  • Embedded engineers for each team or product, maintaining reliability for that service area: This model has a small team of SREs (or a single SRE) working with each team. They would help with the specific needs of that team.

Common SRE Tools

DevOps automation tools are available in several categories, including:

  • Incident Management Tools: These tools lower the risk of software incidents and catch and resolve problems through:
    o   Support
    o   Incident tracking
    o   Communication among team members
    o   Real-time incident updates
    o   Speedy resolution times
    o   Consistency, and more

Tools include Blameless, PagerDuty‍, ServiceNow‍, OpsGenie and ExMatters‍.

  • Measurement Tools: These tools collect and analyze performance-based information that is then processed and organized for future optimization, such as New Relic‍, Datadog‍, Pingdom‍, Dynatrace‍ and Splunk‍.
  • Continuous Testing Tools: Continuous testing tools conduct regularly scheduled automated tests and update with each code change, such as Selenium‍, JUnit‍, TestNG and Cucumber‍.
  • Continuous Delivery Tools: CD tools facilitate upgrading, testing, and building software and ensure consistency in code and user environments including Jenkins‍, Circle CI‍, Travis CI‍, and GitLab CI/CD‍.
  • Monitoring Tools: Monitoring tools detect potential problems before they occur through automated monitoring that removes issues for enhanced system performance. Some popular tools include Prometheus‍, Nagios‍, Zabbix‍, and SolarWinds‍.
  • Configuration Management Tools: This tool is essential for creating consistency and efficiency in software to automate necessary processes for understanding and tracking software updates and changes. Some examples include Ansible, Chef, Puppet and SaltStack.

Can SREs Code?

Yes, SREs need to understand code and build code from scratch, typically in a programming language such as Go, Python, or Ruby. With an intimate understanding of specific programming languages, an SRE can also make plans to deploy and upgrade their software.

Salaries for SREs

The compensation structure for SREs considers the experience and areas of expertise of the SRE as well as the size and level of the role. Smaller companies and less experienced SREs start at $135,000 and larger organizations and higher levels of responsibility can increase salaries upwards of $225,000. However, the median salary is $175,500. 

Pros and Cons of Being an SRE

Life as an SRE has its pros and cons, including:

Pros:

Lots of opportunities for advancement and different areas to explore, such as Cloud computing, Cybersecurity, Automation, Infrastructure as Code (IaC), etc.

Opportunity for development with new discoveries every day that contribute to technical skills such as coding, programming languages, automation tools, tech innovations, etc.

Above average median salary of $175,000 in hand with growth opportunities, flexibility such as remote work and benefits such as healthcare, retirement plans, and stock options/equity

Cons:

  • On-call duties are often par for the course, especially in junior positions, which means being ready to work evenings, weekends, and holidays
  • Continuous learning challenges as new tools, coding languages, and system designs are constantly being introduced, putting pressure on SREs to remain on the leading edge of innovations
  • Managing complex tasks can be stressful and time-consuming, which can mean more overtime and pressure to ensure nothing breaks

Starting your SRE journey with Blameless

Blameless can help you begin your SRE journey by making it easy to adopt and operationalize SRE practices. To find out how, check out a demo. Or, if you’re interested in more content like this, sign up for our newsletter below.

Resources
Book a blameless demo
To view the calendar in full page view, click here.

Want to dive deeper into SRE? Check out our Essentials Guide!

Read more