Wondering about SLAs and SLOs? We explain service level agreements and service level objectives, their differences, and the importance of each.
What are the major differences between service level agreements (SLAs) and service level objectives?
An SLA is a legal agreement between the business and the customer that includes a reliability target and the consequences of failing to meet it. An SLO is an internal target that measures how customers use the service.
Site Reliability Engineering (SRE) is a discipline that aims to increase the reliability of services using a variety of tools, processes, and cultural values. A major goal of SRE is to ensure that your services are never so unreliable to the point that they cause customers pain. This requires understanding how customers perceive the reliability of your service, which is based on how critical parts of the service are to their experience. It also requires you to have responses ready to stop your service from crossing that threshold of unreliability.
The tools SRE uses to achieve these goals are SLOs, SLIs, and SLAs. SLIs are Service Level Indicators, metrics that are compared to the SLO and SLA. They can vary from simple metrics showing a service’s uptime to sophisticated metrics that capture an entire user journey. This allows you to measure how users perceive reliability. Then you use SLAs and SLIs to set standards for how reliable your service should be.
SLAs and SLOs are both reliability metrics based on SLIs. Sometimes you will have both an SLO and an SLA for the same SLI. Both are monitored continuously, usually on a rolling schedule per month or week. That is, it states that some metric must be above some threshold when looking at the last 30 or 7 days. They also each have policies that kick in at certain thresholds to prevent breaches. Despite these similarities, the purpose of each is very different.
A service level agreement (SLA) is a legally binding agreement between an organization and their customers or end users. It guarantees that their service will meet certain agreed-upon reliability standards. These are usually built on simple, objective, and strictly-defined metrics. For example, you can have an SLA that says your service will be online 99.99% of the time. There’s no nuance or debate to this metric: the service is either online or offline.
Generally, SLAs are a type of key performance metric that can be monitored by outside observers, without needing to have internal monitoring data. This allows stakeholders to verify for themselves that the SLA is being met. In a situation where a company sets up an SLA, stakeholders include c-level executives, product, sales, and customer success teams, in addition to the customer. And of course, the SLA is also monitored by in-house SRE, DevOps, or other engineering teams to ensure there’s no risk of breaches.
SLAs are created through discussion between the organization and its stakeholders. As mentioned above, these stakeholders can include:
These groups each might have their own needs for the SLA. For example, investors and customers may want a very strict SLA, to ensure the service is always available. On the other hand, engineering teams may want a more lenient SLA to allow for more errors. Factors such as unavoidable downtime for maintenance must also be considered. Negotiation and empathy are key to aligning everyone.
A good SLA will be unambiguous, as it needs to be legally binding, while still considering all stakeholders’ needs and factors. Also because of its legal strictness, changing it takes a lot of time and effort. It should therefore change infrequently, and be fairly conservative - better to have something strict than have customers dissatisfied while still meeting the SLA.
SLAs will also usually include consequences for failing to meet them. Organizations may have to pay a penalty fee or refund users if their service isn’t reliable enough. With these external costs on top of the normal costs of incidents, it is very critical to meet your SLAs.
Service level objectives (SLOs) are internal goals for your SLIs. They generally aren’t shared with external stakeholders and have no legal bindings or consequences. SLOs can help ensure that your SLAs aren’t breached, keeping you safe from legal trouble. Because of this, SLOs should always be made more strict than corresponding SLAs.
SLOs have other uses besides safeguarding the SLA. For example, they can measure more sophisticated SLIs than SLAs, as they don’t require legal precision or external monitoring. You can use complicated combinations of metrics, including ones only internally accessible, to capture how users experience your service. You can add weight to aspects of the service that are more critical or frequently used and remove weight from niche service areas.
This advanced SLI can represent users’ happiness in using your service. By knowing what’s most important to users, you’ll be able to properly understand the impact of incidents. If a service area that only 1% of customers use experiences an outage for 10 minutes, does that impact customers more negatively than a service that’s used by 90% of customers going down for 1 minute? SLIs and SLOs can give you the answer.
An SLO can also act as the gas pedal and brakes for development. You set the SLO to the point where unreliability will start causing customer pain. Like an SLA, this is usually seen as a percentile to be reached for some rolling period. For example, 99% of the “photo upload user journeys” occurred at a satisfactory level in the last month. When incidents happen or the service is otherwise impacted, you check to make sure you’re still on pace to make that objective.
If you aren’t going to reach the target, SLOs slam on the brakes. Policies like code freezes or bug bashes stop further development, reducing the risk of further issues, and refocusing efforts toward making the current code more reliable. This keeps customers happy and prevents the chance of an SLA breach.
On the other hand, if you’re comfortably reaching your SLO, it’s time to hit the gas. Improving reliability beyond customers’ expectations has greatly diminishing returns. Customers likely will never notice the difference between 99.99% uptime and 99.999% uptime, essentially wasting the time and effort spent on getting the extra nine.
Instead, aim to just meet your SLO. Any excess you should think of as an “error budget”, which you can set up policies to help spend. These policies would increase development velocity by taking strategic and safe risks.
When you’re comfortably meeting your SLOs, another option is to change them to make them tighter — if you think your customers will notice and appreciate an increase in reliability. Conversely, if you think your customers will be just as satisfied with a more lenient SLO, you can loosen it to increase error budget and development velocity.
The important thing in both cases is to frequently review and adjust your SLOs. Unlike SLAs, which need to be formalized in legal agreements, SLOs can be changed entirely in-house. They should stay up-to-date with customer needs and development progress.
Like SLAs, your SLOs are fed by monitoring data to reflect the current health of your systems. As SLOs might reflect more complex things like user journeys, you may need tools to weigh and combine other metrics. SREs or engineers in SRE roles take on the responsibility of implementing and monitoring SLOs with feedback from the rest of the organization.
A good SLO is one that reflects something meaningful about your service. Knowing how often a service goes down isn’t enough - you need to know how important that service is to users. By combining severity and importance, you’ll understand how incidents actually impact customer happiness.
Once you’ve determined that indicator, you should set the objective at the pain point of the customer. Meeting your SLO should mean that customers aren’t being pained by unreliability, but you aren’t overspending by making a service unnoticeably more reliable.
Making an SLO isn’t a “set it and forget it” tool. You won’t get it right the first time, or even the second time. Getting the most out of SLOs means iterating continuously.
As SLOs aren’t usually externally shared and aren’t legally binding, there aren’t any predefined repercussions for not meeting them. However, if your customers are being pained by unreliability, they’re much more likely to leave your service. Therefore you should expect that not meeting SLOs will negatively affect your business. Also, breaching your SLO puts you at risk of breaching your SLA, as the safeguard will be broken.
Here is a chart summarizing the differences and similarities between SLAs and SLOs:
Starting off with SLOs and SLAs can be tricky. The Blameless SLO manager helps teams set new SLOs and track how they are meeting their goals. It’s a great way to get started with SLOs, SLAs, and SLIs. The product guides you through everything from beginning to end. Smart wizards walk you through setting objectives and monitoring metrics, all the way to advanced techniques like capturing user journeys. See how in a demo!