Ebook

Bridging the Gap: DevOps to SRE

Your guide to implement the principles of SRE within your organization: incident response, service level objectives (SLOs), and team culture.

Bridging the Gap: DevOps to SRE

Your guide to implement the principles of SRE within your organization: incident response, service level objectives (SLOs), and team culture.

Summary

Before Google’s SRE Handbook was released in 2016, most orgs focused on DevOps. Today, we know the two practices aren’t mutually exclusive. Organizations now have the opportunity to marry the two in order to strengthen their systems, procedures, and service reliability. 

This eBook will guide you through implementing the principles of SRE within your organization. We’ll break down the obstacles you might encounter, especially as you begin to make sense of how SRE works together with DevOps. You’ll also learn how the tools you already have in place give you a head start. By the end, we’ll have established three solid foundations of SRE: incident response, service level objectives (SLOs), and team culture.

Key Takeaways

  1. SRE is primarily about the customer while DevOps is based in internal operations:
    While the goal of DevOps is to create alignment between developers and operators, the goal of SRE is to improve the end-user experience.
  2. Incident management with SRE saves teams from chaos:
    With SRE, incident management is performed with a library of runbooks that give the engineer a head start whenever something goes wrong, role-based checklists that assign key roles and ensure the most important work gets done, retrospectives with followup items carry the lessons of each incident forward, and balanced schedules keep on-call teams at their best.
  3. Your incident response toolbox consists of runbooks, classifications, and retrospectives.
    a.     Runbooks: they guide engineers through incident response. They’re a series of steps and checks curated for different types of incidents.
    b.     Classifications: they categorize incidents according to significant features that set them apart. Assign specific roles and responsibilities based on the type of incident.
    c.     Incident retrospective: they summarize the incident and jot down what can be learned. Suggest improvements in process, tools, and practices in order to manage incidents better in the future.
  4. The road to SLOs:
    to get started with SLOs, set up the data you want to monitor, build SLIs based on fundamental metrics, set policies, and initiate review cycles.
  5. A Blameless culture is essential:
    to succeed, your culture must be blameless, holistic, put reliability first, and embrace risk.
  6. Perfection is not the aim:
    Systems are never perfect, and neither are the humans that build them. There’s room for patience and grace, but there’s also room for improvement, always.

Table of Contents

1. Life with SRE

2. Incident Management

How to elevate your incident management with SRE

Your incident response toolbox

3. SLOs

What SLOs can do for you

What are SLOs

This sounds pretty tough

The road to mastering SLOs

4. Culture

What culture can do for you

Be blameless

Be holistic

Put reliability first

Embrace risk

5. Plot your maturity

6. Summary

"I have less anxiety being on-call now. It’s great knowing comms, tasks, etc. are pre-configured in Blameless. Just the fact that I know there’s an automated process, roles are clear, I just need to follow the instructions and I’m covered. That’s very helpful."
Jean Clermont, Sr. Program Manager, Flatiron
"I love the Blameless product name. When you have an incident, "Blameless" serves as a great reminder to not blame anything or anyone (not even yourself) and just focus on the incident resolving itself."
Lili Cosic, Sr. Software Engineer, Hashicorp
Read their stories

Sign up for our monthly newsletter

Be the first to hear about new content and events happening at Blameless.