Wondering about SLAs and SLOs, and SLIs? We explain service level agreements, and service level objectives, service level indicators, their differences, and the importance of each.
SRE and SLAs, SLOs, SLIs
Site Reliability Engineering (SRE) is a discipline that aims to increase the reliability of services using a variety of tools, processes, and cultural values. A major goal of SRE is to ensure that your services are never so unreliable to the point that they cause customers pain. This requires understanding how customers perceive the reliability of your service, which is based on how critical parts of the service are to their experience. It also requires you to have responses ready to stop your service from crossing that threshold of unreliability.
The tools SRE uses to achieve these goals are SLOs, SLIs, and SLAs. SLIs are Service Level Indicators, metrics that are compared to the SLO and SLA. They can vary from simple metrics showing a service’s uptime to sophisticated metrics that capture an entire user journey. This allows you to measure how users perceive reliability. Then you use SLAs and SLOs to set standards for how reliable your service should be.
What is an SLI?
A "Service Level Indicator" is a metric that tracks how your users perceive your service, based on their usage.
SLIs are a quantitative measure, typically provided through your APM platform. Traditionally, these refer to either latency or availability, which are defined as response times, including queue/wait time, in milliseconds. A collection of SLIs, or composite SLIs, are a group of SLIs attributed to a larger SLO. These indicators are points on a digital user journey that contribute to customer experience and satisfaction.
When a developer sets up SLIs measuring their service, they do them in two stages:
SLIs that will directly impact the customer.
SLIs that directly influence the health and the availability or the latency and performance of certain services.
Once you have SLIs set up, you move into your SLOs, which are targets against your SLI.
What is an SLO?
A “Service Level Objective” (SLO) is an internal target that measures how customers use the service.
Service level objectives become the common language that companies use that allows teams to set guardrails and incentives to drive high levels of service reliability.
Today many companies operate in a constantly reactive mode. They're reacting to NPS scores, churn, or incidents. This is an expensive, unsustainable use of time, and resources, let alone the potentially irrecoverable damage to customer satisfaction and the business. SLOs give you the objective language and measure of how to prioritize reliability work for proactive service health.
What is an SLA?
A “Service Level Agreement” (SLA) is a legal agreement between the business and the customer that includes a reliability target and the consequences of failing to meet it.
Service level agreements are set by the business rather than engineers, SREs, or ops. After your SLO is breached, you're at risk of breahcing your SLAs. They're the actions that are taken when your agreed upon standard for service fails and often result in financial or contractual consequences.
SLAs vs SLOs vs SLIs
SLAs and SLOs are both reliability metrics based on SLIs. Sometimes you will have both an SLO and an SLA for the same SLI. Both are monitored continuously, usually on a rolling schedule per month or week. That is, it states that some metric must be above some threshold when looking at the last 30 or 7 days. They also each have policies that kick in at certain thresholds to prevent breaches. Despite these similarities, the purpose of each is very different.
Service Level Agreements
A service level agreement (SLA) is a legally binding agreement between an organization and their customers or end users. It guarantees that their service will meet certain agreed-upon reliability standards. These are usually built on simple, objective, and strictly-defined metrics. For example, you can have an SLA that says your service will be online 99.99% of the time. There’s no nuance or debate to this metric: the service is either online or offline.
Generally, SLAs are a type of key performance metric that can be monitored by outside observers, without needing to have internal monitoring data. This allows stakeholders to verify for themselves that the SLA is being met. In a situation where a company sets up an SLA, stakeholders include c-level executives, product, sales, and customer success teams, in addition to the customer. And of course, the SLA is also monitored by in-house SRE, DevOps, or other engineering teams to ensure there’s no risk of breaches.
SLAs are created through discussion between the organization and its stakeholders. As mentioned above, these stakeholders can include:
Investors in the company
These groups each might have their own needs for the SLA. For example, investors and customers may want a very strict SLA, to ensure the service is always available. On the other hand, engineering teams may want a more lenient SLA to allow for more errors. Factors such as unavoidable downtime for maintenance must also be considered. Negotiation and empathy are key to aligning everyone.
What makes a good SLA
A good SLA will be unambiguous, as it needs to be legally binding, while still considering all stakeholders’ needs and factors. Also because of its legal strictness, changing it takes a lot of time and effort. It should therefore change infrequently, and be fairly conservative - better to have something strict than have customers dissatisfied while still meeting the SLA.
Consequences for SLAs
SLAs will also usually include consequences for failing to meet them. Organizations may have to pay a penalty fee or refund users if their service isn’t reliable enough. With these external costs on top of the normal costs of incidents, it is very critical to meet your SLAs.
Service Level Objectives
Service level objectives (SLOs) are internal goals for your SLIs. They generally aren’t shared with external stakeholders and have no legal bindings or consequences. SLOs can help ensure that your SLAs aren’t breached, keeping you safe from legal trouble. Because of this, SLOs should always be made more strict than corresponding SLAs.
SLOs and user journeys
SLOs have other uses besides safeguarding the SLA. For example, they can measure more sophisticated SLIs than SLAs, as they don’t require legal precision or external monitoring. You can use complicated combinations of metrics, including ones only internally accessible, to capture how users experience your service. You can add weight to aspects of the service that are more critical or frequently used and remove weight from niche service areas.
This advanced SLI can represent users’ happiness in using your service. By knowing what’s most important to users, you’ll be able to properly understand the impact of incidents. If a service area that only 1% of customers use experiences an outage for 10 minutes, does that impact customers more negatively than a service that’s used by 90% of customers going down for 1 minute? SLIs and SLOs can give you the answer.
SLO policies and error budgets
An SLO can also act as the gas pedal and brakes for development. You set the SLO to the point where unreliability will start causing customer pain. Like an SLA, this is usually seen as a percentile to be reached for some rolling period. For example, 99% of the “photo upload user journeys” occurred at a satisfactory level in the last month. When incidents happen or the service is otherwise impacted, you check to make sure you’re still on pace to make that objective.
If you aren’t going to reach the target, SLOs slam on the brakes. Policies like code freezes or bug bashes stop further development, reducing the risk of further issues, and refocusing efforts toward making the current code more reliable. This keeps customers happy and prevents the chance of an SLA breach.
On the other hand, if you’re comfortably reaching your SLO, it’s time to hit the gas. Improving reliability beyond customers’ expectations has greatly diminishing returns. Customers likely will never notice the difference between 99.99% uptime and 99.999% uptime, essentially wasting the time and effort spent on getting the extra nine.
Instead, aim to just meet your SLO. Any excess you should think of as an “error budget”, which you can set up policies to help spend. These policies would increase development velocity by taking strategic and safe risks.
Building and changing SLOs
When you’re comfortably meeting your SLOs, another option is to change them to make them tighter — if you think your customers will notice and appreciate an increase in reliability. Conversely, if you think your customers will be just as satisfied with a more lenient SLO, you can loosen it to increase error budget and development velocity.
The important thing in both cases is to frequently review and adjust your SLOs. Unlike SLAs, which need to be formalized in legal agreements, SLOs can be changed entirely in-house. They should stay up-to-date with customer needs and development progress.
Like SLAs, your SLOs are fed by monitoring data to reflect the current health of your systems. As SLOs might reflect more complex things like user journeys, you may need tools to weigh and combine other metrics. SREs or engineers in SRE roles take on the responsibility of implementing and monitoring SLOs with feedback from the rest of the organization.
What makes a good SLO
A good SLO is one that reflects something meaningful about your service. Knowing how often a service goes down isn’t enough - you need to know how important that service is to users. By combining severity and importance, you’ll understand how incidents actually impact customer happiness.
Once you’ve determined that indicator, you should set the objective at the pain point of the customer. Meeting your SLO should mean that customers aren’t being pained by unreliability, but you aren’t overspending by making a service unnoticeably more reliable.
Making an SLO isn’t a “set it and forget it” tool. You won’t get it right the first time, or even the second time. Getting the most out of SLOs means iterating continuously.
Consequences of not meeting SLOs
As SLOs aren’t usually externally shared and aren’t legally binding, there aren’t any predefined repercussions for not meeting them. However, if your customers are being pained by unreliability, they’re much more likely to leave your service. Therefore you should expect that not meeting SLOs will negatively affect your business. Also, breaching your SLO puts you at risk of breaching your SLA, as the safeguard will be broken.
Service Level Indicators
Service level indicators are the metrics on which your SLOs are based. They can range from simple metrics, like availability and latancy, to a complex combination of weighted averages reflecting how customers use your service.
Who needs SLIs?
Having meaningful metrics that reflect user satisfaction requires building SLIs. Anyone with a service that has varying use cases should use SLIs to understand what users expect from them in a quantifable, trackable way.
How SLIs work
SLIs can range from very simple availability metrics to complex weighted metrics representing a user journey.
For simple SLIs, you can use monitoring tools directly to understand your service's uptimes, latency, error rate, and other simple metrics.
For more complex metrics, first start by developing a user journey that reflects the steps a user would take for a common usage of your service. For example, all the steps required to add an item to your cart on an ecommerce website. Then, think about how important each step is to the user's experience. If the product search feature is a bit slow or misses some items, that isn't as impactful as not being able to log in or the service forgetting what's in the user's cart.
Take the relevant metrics for each step, and weigh them based on how important they are to the journey. Then combine all the weighted metrics to complete your sophisticated SLI.
How to choose good SLIs
The important part of a good SLI is making sure it accurately reflects your users' expectations and priorities for your service. Do both qualitative and quantitative user research: track statistics about what users do, and talk to individual users to ask about their values.
SLIs aren't "set it and forget it". Regularly review this data to ensure your SLIs are still reflecting what matters most to users.
Good SLIs could include user journeys for your most popular features, user journeys for the users who spend the most on your service, and user journeys for the most common steps of using your service, such as logging in.
Challenges of SLIs
SLIs require a lot of monitoring tools to get the specific data required to build them. For example, you don't just need the overall availability of your product, but the availability for each service and subservice.
The other major challenge of SLIs is making sure they accurately reflect user expectations. You can't expect to get all the weighting and prioritizing right off the bat, so the key is continual revision and reviews.
SLA vs SLO vs SLI chart
Here is a chart summarizing the differences and similarities between SLAs, SLOs, and SLIs:
How these terms help with reliability: an example case study
Imagine an organization is looking to increase reliability. The company has recently begun investigating expensive SLA breaches and wants to know why its reliability is suffering. This organization breaches its SLA for availability almost every month. As it onboards more customers with SLAs, these expenses can grow if it doesn’t meet its performance guarantees.
This fictitious organization is also dealing with low NPS scores. The team is aware of the problem, but NPS scores are a lagging indicator with respect to customers that have already begun to churn. The team met to discuss what needs to be done. The first step to this is breaking down the company’s SLIs.
Identifying SLIs that matter to the user
The team knows it needs to examine availability and set SLOs for it, so it begins looking at the user journey. The QA team has already done some documentation, so the team refers to the user journeys outlined there and augments this documentation with their own journeys.
The team identifies critical points that receive the brunt of complaints. Team members also look into black box monitoring, a tactic that helps identify issues from a user’s perspective. With black box monitoring, the team acts as an external user of the service with no access to the internal monitoring tools. This allows team members to concentrate on a few metrics that directly correlate with user happiness.
After looking at their user’s journey, the team determines that the individual load pages of each tab on the expenses feature don’t load slowly individually, but when someone needs to skim through 2 or more pages, it becomes tedious. So the team also decides to create an SLO for response time at the load balancers as well.
Establishing corresponding SLOs
After the team determines its SLIs, it’s time to set up the SLOs. The team is looking at availability of the site (a common complaint), as well as the latency issue on the expense page. While the team plans to add more SLOs later, these two will serve as the guinea pigs.
For the latency issue, the team sets an SLO for all pages to load in under 1 second. This faster load time means that users won’t be irritated scrolling through multiple pages. The team then moves on to the availability SLO.
Based on traffic levels, customer usage, NPS scores, the team has determined that its customers are likely to be happy with 99.5% availability. On the other hand, data from previous months suggests customer satisfaction and usage doesn’t seem to increase when uptime is greater than 99.9%. This means that there’s no reason to optimize at this point for higher than a 99.5% uptime metric.
With SLOs in place, the team will need to work on what to do if these targets are missed by creating an error budget policy. This policy will detail:
The acceptable level of failure in the system over a given period of time (the error budget)
Alerting and on-call procedures for the service
Escalation policies in the event of error budget depletion
An agreement to halt feature development and focus on reliability after a certain amount of time where the error budget is exceeded.
Once everyone agrees, the SLOs are launched. The team watches carefully and reiterates at the monthly error budget meeting. After a few months, the team feels confident enough to add more SLOs.
Agreeing on SLAs
SLAs are an external metric, therefore not goaled the same way as SLOs. SLAs are a business agreement with users that dictates a certain level of usability. The engineering team is aware of SLAs, but doesn’t set them. Instead, the team sets SLOs more stringently than the SLAs, giving themselves a buffer.
For example, the team’s 99.5% availability SLO means the service can only be down 3.65 hours per month. However, the SLA that the organization signs with users specifies that it must maintain a 99% availability. This means the service can be down 7.31 hours per month. The team has a buffer of 3.66 hours per month. Now, the team can work on new features with guardrails for reliability. The organization will benefit from happier users and the team has the confidence to innovate while remaining reliable.
Starting off with SLIs, SLOs and SLAs can be tricky, but a culture of revision, iteration, and blamelessness will help you achieve your reliability goals. The Blameless SLO Manager helps teams set new SLOs and track how they are meeting their goals. It’s a great way to get started with SLOs, SLAs, and SLIs. The product guides you through everything from beginning to end. Smart wizards walk you through setting objectives and monitoring metrics, all the way to advanced techniques like capturing user journeys. See how in a demo!
"I have less anxiety being on-call now. It’s great knowing comms, tasks, etc. are pre-configured in Blameless. Just the fact that I know there’s an automated process, roles are clear, I just need to follow the instructions and I’m covered. That’s very helpful."
"I love the Blameless product name. When you have an incident, "Blameless" serves as a great reminder to not blame anything or anyone (not even yourself) and just focus on the incident resolving itself."