Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.
Ebook

Bridging the Gap: DevOps to SRE

Your guide to implement the principles of SRE within your organization: incident response, service level objectives (SLOs), and team culture.

Bridging the Gap: DevOps to SRE

Your guide to implement the principles of SRE within your organization: incident response, service level objectives (SLOs), and team culture.

Key Takeaways

  1. SRE is primarily about the customer while DevOps is based in internal operations:
    While the goal of DevOps is to create alignment between developers and operators, the goal of SRE is to improve the end-user experience.
  2. Incident management with SRE saves teams from chaos:
    With SRE, incident management is performed with a library of runbooks that give the engineer a head start whenever something goes wrong, role-based checklists that assign key roles and ensure the most important work gets done, retrospectives with followup items carry the lessons of each incident forward, and balanced schedules keep on-call teams at their best.
  3. Your incident response toolbox consists of runbooks, classifications, and retrospectives.
    a.     Runbooks: they guide engineers through incident response. They’re a series of steps and checks curated for different types of incidents.
    b.     Classifications: they categorize incidents according to significant features that set them apart. Assign specific roles and responsibilities based on the type of incident.
    c.     Incident retrospective: they summarize the incident and jot down what can be learned. Suggest improvements in process, tools, and practices in order to manage incidents better in the future.
  4. The road to SLOs:
    to get started with SLOs, set up the data you want to monitor, build SLIs based on fundamental metrics, set policies, and initiate review cycles.
  5. A Blameless culture is essential:
    to succeed, your culture must be blameless, holistic, put reliability first, and embrace risk.
  6. Perfection is not the aim:
    Systems are never perfect, and neither are the humans that build them. There’s room for patience and grace, but there’s also room for improvement, always.

Table of Contents

1. Life with SRE

2. Incident Management

How to elevate your incident management with SRE

Your incident response toolbox

3. SLOs

What SLOs can do for you

What are SLOs

This sounds pretty tough

The road to mastering SLOs

4. Culture

What culture can do for you

Be blameless

Be holistic

Put reliability first

Embrace risk

5. Plot your maturity

6. Summary

Pricing calculator   - Blameless Images
Incident Impact Calculator

Find out how much 
you could save

Incidents can do real damage to companies that aren't sufficiently prepared them. Use our calculator to estimate the full cost of incidents for your team.
use the calculator