Glossary

Definitions for common SRE terminology
This is some text inside of a div block.

Acknowledge

The action of identifying an event as an actual incident that needs to be worked on, triaged, or debugged which then leads to the next natural step of taking ownership of the incident response.
This is some text inside of a div block.

Availability/Uptime

Defines the percentage of time a system is accessible and functioning as intended, usually measured over a month-long time frame.
This is some text inside of a div block.

Detect

The process and work to analyze and introspect using a variety of tools (APM, Monitoring, Observability) to determine what caused an incident.
This is some text inside of a div block.

Latency

The response time of a system or the total time a system takes to respond to a request.
This is some text inside of a div block.

MTBF (Mean Time Between Failures)

The average time between failures (or incidents)
This is some text inside of a div block.

MTTD (Mean Time to Detect)

The average time interval between when an issue occurs and when an alert for the issue is triggered.
This is some text inside of a div block.

MTTR (Mean Time to Repair)

The average time lapsed between acknowledging an incident and resolving the incident.
This is some text inside of a div block.

Monitoring

The practice of ”watching” or continuously monitoring important predetermined metrics in charts and dashboards that tell you how the system is behaving overall.
This is some text inside of a div block.

Network

The connection between computers, servers, and other devices that enables data sharing, allowing users to pull requests from a server and to run services using data and code contained on servers.
This is some text inside of a div block.

Observability

Observability is being able to fully understand our systems. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. This is the ability to ask any question of your system to better understand how it’s behaving without having to re-instrument or build new code.
This is some text inside of a div block.

Postmortem/Retrospective

A post-incident record documenting the impact, process steps, and resolution of an incident, which helps teams improve and manage incidents better in the future.
This is some text inside of a div block.

Reliability

A measure of how likely it is that a system will perform and function properly as it is intended to, which includes assessment of availability, latency, and stability among other performance metrics.
This is some text inside of a div block.

Resolve

The process of fixing the contributing factors that led to an incident in order to restore the service.
This is some text inside of a div block.

Respond

An organized approach to addressing and managing an incident including logging steps, recording actions or tasks by ‘owner’, and communicating across relevant stakeholders.
This is some text inside of a div block.

Runbook

A document compiling the necessary procedures and operations to follow when an incident happens. In other words the ‘recipe’ for how to manage an incident end-to-end.
This is some text inside of a div block.

SRE (Site Reliability Engineering)

A set of practices and principles aimed to improve a service’s reliability.
This is some text inside of a div block.

System

A grouping of interconnected components including code, infrastructure, and networking that together make a greater whole - i.e. “the system”