How much time are engineering teams spending on incidents?
Want to up-level your reliability program? Let's start by identifying your opportunities for growth.

Glossary

Definitions for common SRE terminology

Acknowledge

The action of identifying an event as an actual incident that needs to be worked on, triaged, or debugged which then leads to the next natural step of taking ownership of the incident response.

Availability/Uptime

Defines the percentage of time a system is accessible and functioning as intended, usually measured over a month-long time frame.

Detect

The process and work to analyze and introspect using a variety of tools (APM, Monitoring, Observability) to determine what caused an incident.

Latency

The response time of a system or the total time a system takes to respond to a request.

MTBF (Mean Time Between Failures)

The average time between failures (or incidents)

MTTD (Mean Time to Detect)

The average time interval between when an issue occurs and when an alert for the issue is triggered.

MTTR (Mean Time to Repair)

The average time lapsed between acknowledging an incident and resolving the incident.

Monitoring

The practice of ”watching” or continuously monitoring important predetermined metrics in charts and dashboards that tell you how the system is behaving overall.

Network

The connection between computers, servers, and other devices that enables data sharing, allowing users to pull requests from a server and to run services using data and code contained on servers.

Observability

Observability is being able to fully understand our systems. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. This is the ability to ask any question of your system to better understand how it’s behaving without having to re-instrument or build new code.

Postmortem / Retrospective

A post-incident record documenting the impact, process steps, and resolution of an incident, which helps teams improve and manage incidents better in the future.

RACI Chart

A RACI chart is a project management tool used for tracking roles and responsibilities. RACI is an acronym that stands for responsible, accountable, consulted, informed.

Reliability

A measure of how likely it is that a system will perform and function properly as it is intended to, which includes assessment of availability, latency, and stability among other performance metrics.

Resolve

The process of fixing the contributing factors that led to an incident in order to restore the service.

Respond

An organized approach to addressing and managing an incident including logging steps, recording actions or tasks by ‘owner’, and communicating across relevant stakeholders.

Runbook

A document compiling the necessary procedures and operations to follow when an incident happens. In other words the ‘recipe’ for how to manage an incident end-to-end.

SRE (Site Reliability Engineering)

A set of practices and principles aimed to improve a service’s reliability.

System

A grouping of interconnected components including code, infrastructure, and networking that together make a greater whole - i.e. “the system”

Trusted by more than 19,000 responders

VMware Logo - Blameless ImagesCitrix Logo - Blameless ImagesProcore Logo - Blameless ImagesMasterclass Logo - Blameless ImagesPalaolto Logo - Blameless ImagesEventbrite Logo - Blameless ImagesUnder armour Logo - Blameless ImagesTicket master Logo - Blameless ImagesVimeo Logo - Blameless ImagesCrowdstrike Logo - Blameless ImagesZoopla Logo - Blameless ImagesCulture Amp Logo - Blameless ImagesHotjar Logo - Blameless ImagesFox Logo - Blameless ImagesGojek Logo - Blameless ImagesZapier Logo - Blameless ImagesAfterpay Logo - Blameless ImagesAddepar Logo - Blameless ImagesStitch fix Logo - Blameless ImagesVivint Logo - Blameless Images
VMware Logo - Blameless ImagesCitrix Logo - Blameless ImagesProcore Logo - Blameless ImagesMasterclass Logo - Blameless ImagesPalaolto Logo - Blameless ImagesEventbrite Logo - Blameless ImagesUnder armour Logo - Blameless ImagesTicket master Logo - Blameless ImagesVimeo Logo - Blameless ImagesCrowdstrike Logo - Blameless ImagesZoopla Logo - Blameless ImagesCulture Amp Logo - Blameless ImagesHotjar Logo - Blameless ImagesFox Logo - Blameless ImagesGojek Logo - Blameless ImagesZapier Logo - Blameless ImagesAfterpay Logo - Blameless ImagesAddepar Logo - Blameless ImagesStitch fix Logo - Blameless ImagesVivint Logo - Blameless Images
Pricing calculator   - Blameless Images
Incident Impact Calculator

Find out how much you could save

Incidents can do real damage to companies that aren't sufficiently prepared them. Use our calculator to estimate the full cost of incidents for your team.
use the calculator