Wondering what the difference is between observability and monitoring? In this post, we explain how they are related, why they are important, and some suggested tools that can help.
The difference between observability and monitoring is that observability is the ability to understand a system’s state from its outputs, often referred to as understanding the “unknown unknowns”. Observability gives you the ability to ask any question of the system in order to more deeply understand how code is behaving. Monitoring is the ability to determine the overall state of the system and is usually related to the system's infrastructure.
Observability is a tooling or technical practice that enables SRE, engineering, and ops teams to debug their system diligently. It explores new patterns and properties that were perhaps not defined or identified in advance. Because code can behave differently in production (vs staging), it’s important to proactively observe what’s occurring in production as it impacts users. In order to perform true system Observability, you need to instrument your code to generate telemetry that aids in asking any new question.
“Many vendors, including those in the network and security domains, are using the term “observability” to differentiate their products. However, little consensus exists on the definition of observability and the benefits it provides, causing confusion among enterprises purchasing tools.”
By contrast, Monitoring is a practice that enables SRE and Ops teams to watch and comprehend different states of their system which is often done through predefined metrics, dashboard reports that are updated in real-time. The data feeding those dashboards is based on assembling a predefined set of metrics or logs that are important to you. More on that in a moment.
In the same Gartner Hype Cycle report, a Monitoring obstacle cited includes:
“Due to the conservative nature of IT operations -- In many large enterprises, the role of IT operations has been to keep the lights on, despite constant change. This, combined with the longevity of existing monitoring tools, means that new technology is slow to be adopted.”
How are Observability and Monitoring Related to each other?
Observability and monitoring have a symbiotic relationship and they actually serve different purposes. Observability is making the data accessible, whereas monitoring is the task of collecting and displaying that data, which is then relied upon for ongoing review or ‘watching’.
What is SRE Observability?
SRE Observability also known as o11y -- a term derived from control theory -- addresses the issue by encouraging engineers to code their services in such a way that it emits metrics and logs. These metrics and logs are then used to observe the “what’s actually occurring with my code in production”.
Observability can be further broken down into three key areas, logs, metrics, and traces that are discussed below that are important elements to enable SRE Observability.
The foundation of monitoring - metrics are aggregated data about the performance of a service. It usually consists of a single number that is tracked over time. Traditionally, system-level metrics such as CPU, memory, and disk performance were used for tracking. That includes data such as:
The challenge here is that while this gives enough information about the system, it doesn’t tell you about the user experience or how to improve your code’s performance. To tackle the issue, some modern monitoring services also offer APM (Application Performance Monitoring) - features to track application-level metrics. These metrics include requests per minute and error rates with each metric tracking only one variable, which can be relatively cheap to store and send.
The DevOps, Ops, or SRE team usually determines the best set of metrics to watch for, which can vary depending on the service itself and its overall maturity. Often teams watch metrics dashboards when code changes occur or when a new fix or release is shipped.
Logs represent the output from your code, sometimes referred to as events that are immutable, time-stamped records that can be used to identify certain patterns in a system. All processes in a system emit logs that usually include information such as records of individual user queries and debugging information generically associated with the service.
Logs can be any arbitrary string. However, programming languages and frameworks use libraries to generate logs from the running code with relevant data at various levels of specificity (e.g. INFO vs. DEBUG mode). Among programming communities, there’s no standard about what should be included on various log levels.
In a distributed system, a trace displays the operation flow from the parent event to the child event where both events are timestamped. When individual events form a trace, they are referred to as spans. Each span stores the following information: start time, duration, and parent-id. Without the parent-id, spans are rendered as root spans.
Traces allow individual execution flows to be traced through the system that helps teams figure out which component or set of code is causing a potential system error. Teams can use dedicated tracing tools to look into the details of a certain request. By looking at trace spans and waterfall views that show multiple spans in your system, you can run queries to examine timing (latency), errors, and dependency details.
Many observability tools provide tracing capabilities as part of their offerings.
As SRE teams monitor, data is collected, processed, aggregated, and displayed in charts about the system and services. Monitoring aims to optimize any performance issue and alert teams if fixes or resolutions are needed to reduce any impact on the end-users.
According to Google’s SRE Book:
“Your monitoring system should address two questions: what’s broken, and why? The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause.”
Here are some examples of symptoms and possible causes:
Monitoring is usually classified into two broad types: black-box monitoring and white-box monitoring. Depending on the situation, one or both types of monitoring will be used. Usually, heavy use of white-box monitoring is combined with a modest (but critical) use of black-box monitoring.
In white-box monitoring, metrics that are exposed by the system’s internals such as logs, HTTP handlers, interfaces, etc. are relied upon. This provides the team members insights into various parts of the tech stack. Some examples of white-box metrics include CPU or dependency. Alerts from white-box monitoring can identify:
In black-box monitoring, the external, visible user behavior is examined. The metrics used in black-box monitoring include error responses, latency from the user’s perspective, etc. In black-box monitoring, users typically don’t have visibility into the system or knowledge of how it works. In terms of symptoms and causes, black-box monitoring is symptom-oriented (focused on symptoms). The team member who is monitoring these metrics doesn’t predict the problems but simply watches the system.
On the other hand, in a multi-layered system, one person’s symptoms can be another’s cause. For example, if the performance of your database is slow, then the DB reads are considered symptoms of an issue for the database SRE. While, for an SRE team-member monitoring the system’s front-end showing a slow web page, DB reads are considered a cause. Therefore, white-box monitoring can be either symptom-oriented or cause-oriented.
White-box is also important for collecting telemetry data for further debugging. Telemetry data - data generated by the system documenting its stats. This data is used to determine and improve the health and performance of the overall system.
Google has defined four golden signals for monitoring that include:
Implementing a monitoring and observability system in your organization is an evolving process and can improve over time. In order to deploy a full observability tool for all teams in engineering to learn, you need to instrument your code, which will emit the right set of logs and metrics for you to monitor, observe, and query.
Essentially it’s the detailed system and data about your code that you are running queries against in order to best understand exactly how code is behaving in production. Following are some metrics to track via postmortems or when conducting monthly surveys.
By implementing both Monitoring and Observability tools, you can collectively determine what is alert-worthy. Some issues can be handled by distinct teams and don’t require full incident resolution. Having SLOs in place is a good way to automate alerts based on acceptable thresholds of system uptime, performance, and errors. The team should first establish Monitoring and Observability practices and playbooks and then move to implement SLOs, team-wide.
Tracking some or all of these metrics helps you understand whether your monitoring and observability systems are running and working efficiently for your organization. You can further break down the measurements by product, operational team, etc, to gain insight into your system, process, and people!
For any service, engineering toil is very risky as it leads to repetitive manual tasks consuming most of SREs time. In the SRE book, Google states:
“Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect.”
The goal is to keep toil under 50% because, if left unchecked, it is in danger of growing and will quickly exceed half the team members' working week. The engineering in site reliability engineering (SRE) represents the practice of reducing toil and scaling up services. That’s what enables SRE teams to perform more efficiently than a pure development or pure Ops team.
Here are a few tools that can help SREs with observability and system monitoring:
Monitoring tools (some open source)
Observability tools and frameworks
Ultimately, tools definitely help but they are not nearly enough to achieve one’s objectives. Observability and monitoring are the combined responsibility of SRE/Ops and development teams.
The responsibility of monitoring and observing a system should not fall solely upon an individual or a dedicated team. Not only will that help you avert a single point of failure, but also improves your ability to comprehend and improve the system as an organization. Therefore, ensuring that all developers are proficient in monitoring will promote a culture of data-driven decision-making and reduce outages.
In most organizations, only the operations team, NOC, or a similar team can make changes in the monitoring system. That’s not a good idea and ideally be replaced by a system that follows CD (continuous delivery) patterns. This ensures that all changes are delivered in a safe, fast, and sustainable manner.
The ultimate goal of observability and monitoring is to improve the system, which is a continuous process. DevOps Research and Assessment (DORA) research offers a comprehensive monitoring and observability solution alongside some other technical practices that can contribute to continuous delivery.
If you’re aiming to start the journey towards observability and monitoring, then Blameless can help by integrating into your chosen tools in order to collect the right data and analysis for faster incident resolution and ongoing team learning. Reliability Insights platform by Blameless helps you explore, analyze and share your reliability data with various stakeholders. To learn more, request a demo or sign up for our newsletter below.