Glossary

Definitions for common SRE terminology

Acknowledge

The action of identifying an event as an actual incident that needs to be worked on, triaged, or debugged which then leads to the next natural step of taking ownership of the incident response.

Availability/Uptime

Defines the percentage of time a system is accessible and functioning as intended, usually measured over a month-long time frame.

Detect

The process and work to analyze and introspect using a variety of tools (APM, Monitoring, Observability) to determine what caused an incident.

Latency

The response time of a system or the total time a system takes to respond to a request.

MTBF (Mean Time Between Failures)

The average time between failures (or incidents)

MTTD (Mean Time to Detect)

The average time interval between when an issue occurs and when an alert for the issue is triggered.

MTTR (Mean Time to Repair)

The average time lapsed between acknowledging an incident and resolving the incident.

Monitoring

The practice of ”watching” or continuously monitoring important predetermined metrics in charts and dashboards that tell you how the system is behaving overall.

Network

The connection between computers, servers, and other devices that enables data sharing, allowing users to pull requests from a server and to run services using data and code contained on servers.

Observability

Observability is being able to fully understand our systems. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. This is the ability to ask any question of your system to better understand how it’s behaving without having to re-instrument or build new code.

Postmortem / Retrospective

A post-incident record documenting the impact, process steps, and resolution of an incident, which helps teams improve and manage incidents better in the future.

RACI Chart

A RACI chart is a project management tool used for tracking roles and responsibilities. RACI is an acronym that stands for responsible, accountable, consulted, informed.

Reliability

A measure of how likely it is that a system will perform and function properly as it is intended to, which includes assessment of availability, latency, and stability among other performance metrics.

Resolve

The process of fixing the contributing factors that led to an incident in order to restore the service.

Respond

An organized approach to addressing and managing an incident including logging steps, recording actions or tasks by ‘owner’, and communicating across relevant stakeholders.

Runbook

A document compiling the necessary procedures and operations to follow when an incident happens. In other words the ‘recipe’ for how to manage an incident end-to-end.

SRE (Site Reliability Engineering)

A set of practices and principles aimed to improve a service’s reliability.

System

A grouping of interconnected components including code, infrastructure, and networking that together make a greater whole - i.e. “the system”

Trusted by more than 19,000 responders

Incident Impact Calculator

Find out how much you could save

Incidents can do real damage to companies that aren't sufficiently prepared them. Use our calculator to estimate the full cost of incidents for your team.

use the calculator

Get industry insights and events in your inbox.
Sign up for our monthly newsletter.

Company

About us Newsroom careers contact

Product

pricing integrations interactive Demo

Help Center

Getting Started Implementation Security Documents APIs & Webhooks

resources

Blog ebooks Incident Impact Calculator videos glossary Comparisons How Long do you Spend on an Incident?

legal

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Based on the applicable laws of your country, you may have the right to request access to the personal information we collect from you, change that information, or delete it. To request to review, update, or delete your personal information, please fill out and submit a data subject access request to support@blameless.com.

I Accept

Preferences