Your guide to implement the principles of SRE within your organization: incident response, service level objectives (SLOs), and team culture.
Bridging the Gap: DevOps to SRE
Before Google’s SRE Handbook was released in 2016, most orgs focused on DevOps. Today, we know the two practices aren’t mutually exclusive. Organizations now have the opportunity to marry the two in order to strengthen their systems, procedures, and service reliability.
This eBook will guide you through implementing the principles of SRE within your organization. We’ll break down the obstacles you might encounter, especially as you begin to make sense of how SRE works together with DevOps. You’ll also learn how the tools you already have in place give you a head start. By the end, we’ll have established three solid foundations of SRE: incident response, service level objectives (SLOs), and team culture.
- SRE is primarily about the customer while DevOps is based in internal operations:
While the goal of DevOps is to create alignment between developers and operators, the goal of SRE is to improve the end-user experience.
- Incident management with SRE saves teams from chaos:
With SRE, incident management is performed with a library of runbooks that give the engineer a head start whenever something goes wrong, role-based checklists that assign key roles and ensure the most important work gets done, retrospectives with followup items carry the lessons of each incident forward, and balanced schedules keep on-call teams at their best.
- Your incident response toolbox consists of runbooks, classifications, and retrospectives.
a. Runbooks: they guide engineers through incident response. They’re a series of steps and checks curated for different types of incidents.
b. Classifications: they categorize incidents according to significant features that set them apart. Assign specific roles and responsibilities based on the type of incident.
c. Incident retrospective: they summarize the incident and jot down what can be learned. Suggest improvements in process, tools, and practices in order to manage incidents better in the future.
- The road to SLOs:
to get started with SLOs, set up the data you want to monitor, build SLIs based on fundamental metrics, set policies, and initiate review cycles.
- A Blameless culture is essential:
to succeed, your culture must be blameless, holistic, put reliability first, and embrace risk.
- Perfection is not the aim:
Systems are never perfect, and neither are the humans that build them. There’s room for patience and grace, but there’s also room for improvement, always.
Table of Contents
1. Life with SRE
2. Incident Management
How to elevate your incident management with SRE
Your incident response toolbox
What SLOs can do for you
What are SLOs
This sounds pretty tough
The road to mastering SLOs
What culture can do for you
Put reliability first