Definitions for common SRE terminology
The action of identifying an event as an actual incident that needs to be worked on, triaged, or debugged which then leads to the next natural step of taking ownership of the incident response.
Defines the percentage of time a system is accessible and functioning as intended, usually measured over a month-long time frame.
The process and work to analyze and introspect using a variety of tools (APM, Monitoring, Observability) to determine what caused an incident.
The response time of a system or the total time a system takes to respond to a request.
MTBF (Mean Time Between Failures)
The average time between failures (or incidents)
MTTD (Mean Time to Detect)
The average time interval between when an issue occurs and when an alert for the issue is triggered.
MTTR (Mean Time to Repair)
The average time lapsed between acknowledging an incident and resolving the incident.
The practice of ”watching” or continuously monitoring important predetermined metrics in charts and dashboards that tell you how the system is behaving overall.
The connection between computers, servers, and other devices that enables data sharing, allowing users to pull requests from a server and to run services using data and code contained on servers.
Observability is being able to fully understand our systems. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. This is the ability to ask any question of your system to better understand how it’s behaving without having to re-instrument or build new code.
Postmortem / Retrospective
A post-incident record documenting the impact, process steps, and resolution of an incident, which helps teams improve and manage incidents better in the future.
A RACI chart is a project management tool used for tracking roles and responsibilities. RACI is an acronym that stands for responsible, accountable, consulted, informed.
A measure of how likely it is that a system will perform and function properly as it is intended to, which includes assessment of availability, latency, and stability among other performance metrics.
The process of fixing the contributing factors that led to an incident in order to restore the service.
An organized approach to addressing and managing an incident including logging steps, recording actions or tasks by ‘owner’, and communicating across relevant stakeholders.
A document compiling the necessary procedures and operations to follow when an incident happens. In other words the ‘recipe’ for how to manage an incident end-to-end.
SRE (Site Reliability Engineering)
A set of practices and principles aimed to improve a service’s reliability.
A grouping of interconnected components including code, infrastructure, and networking that together make a greater whole - i.e. “the system”