Service Level Objectives, or SLOs, are an internal goal for the essential metrics of a service, such as uptime or response speed. We’re probably familiar with this definition, but what is the value of setting these goals? We’ll take a look at SLOs as both a powerful safety net and a tool to inform the allocation of engineering resources, while also considering the cultural learnings of SLO adoption.
As SLOs are always set to be more stringent than any external-facing agreements you have with your clients (SLAs), they provide a safety net to ensure that issues are addressed before the user experience becomes unacceptable. For example, you may have an agreement with your client that the service will be available 99% of the time each month. You could then set an internal SLO where alerts activate when availability dips below 99.9%. This provides you a significant time buffer to resolve the issue before violating the agreement:
Service Level Agreement with Clients: 99% availability – 7.31 hours acceptable downtime per month
Service Level Objective Internally: 99.9% availability – 43.83 minutes acceptable downtime per month
Safety Buffer: 6.58 hours
Knowing that you’ll have over six and a half hours between your internal objective and an agreement breach can provide some peace of mind as you deploy. However, it can be difficult to determine a buffer that provides sufficient time to respond when disruptions occur. Garrett Plasky, who previously led Evernote’s SRE team, describes this challenge:
“Setting an appropriate SLO is an art in and of itself, but ultimately you should endeavor to set a target that is above the point at which your users feel pain and also one that you can realistically meet (i.e. SLOs should not be aspirational).”
It may be tempting from a management perspective to set an SLO of 100%, but it just isn’t realistic. Development would be paralyzed by fear that the smallest change could trigger an SLO breach. Moreover, such a high target isn’t helpful. As Garrett points out, the SLO should still be set above the point where the users of the service are pained, as any refinement beyond that quickly gives diminishing returns for additional user satisfaction. This points to another key element of choosing good service indicators and SLOs: they should be relevant to the users.
In a world of increasing system complexity, it’s tempting but ultimately impossible to measure everything. To set meaningful goals, you’ll need to understand what metrics really matter. As Charity Majors, CTO of Honeycomb put it in an interview with InfoQ:
“In the chaotic future we're all hurtling toward, you actually have to have the discipline to have radically fewer paging alerts ... not more. People... lean heavily on over-paging themselves with clusters of tens or hundreds of alerts, which they pattern-match for clues about what the root cause might be."
SLOs are the perfect way to implement Charity’s advice. Each SLO requires choosing specific indicators and writing policy around them. By carefully considering deliberate objectives, hundreds of alerts and signals can be naturally pared down to the few that matter.Of course, this raises the question of how these indicators are chosen. Although good load balancing might have an impact on the latency that a user experiences, it’s several steps removed from a user’s experience. There are many metrics that contribute just as much to latency, and trying to peg your SLO to each and every one of them will generate far too much noise. Instead, typical experiences of a service can be consolidated into a few key metrics that will be most relevant to their happiness:
99.99% of requests successful SLOs with the user experience in mind become an even more powerful safety net. You can feel confident that the metrics you’re most vigilantly monitoring are the best ones to ensure the happiness of your users. Moreover, you can establish different alerting and response workflows for incidents that don’t directly impact SLOs, allowing for better prioritization and reduced alert fatigue.
By carefully considering deliberate objectives, hundreds of alerts and signals can be naturally pared down to the few that matter.
When creating SLOs, don’t “set it and forget it,” as things can get stale very quickly. A continuous cycle of review and revision is critical to success. As LinkedIn SRE leader Kurt Andersen explains:
“How would you come up with the best and most reasonable SLO? You can’t, not at the beginning. It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.”
This lifelong commitment isn’t a downside to SLOs, but rather a significant benefit that reflects changing realities. Reviewing and revising SLOs is an opportunity to confirm the priorities of the system. Having a practice in place to regularly work through why SLIs have been chosen and where SLOs have been set provides a jumping off point for meaningful discussion of larger questions. For example, evaluating if an SLI of latency on login is more appropriate than an SLI of latency for each page load can lead to reevaluating basic assumptions about how users access services.
Kurt also mentions the importance of buy-in for teams implementing SLOs. The value of having a safety net for downtime might be obvious for SREs and other people in operations, but developers may hesitate when they consider the commitments. However, error budgeting can align incentives for both SREs and developers.Going back to our table, consider the 43 minutes of acceptable downtime as the amount of time you can spend on pushing new features, even if they may cause some failed requests. For example, a proposed change could decrease average load time by several milliseconds, but might cause brief outages during implementation. How do you know it’s safe to make this change?
With the mentality that any outage is unacceptable, it may be difficult to justify this decision. Error budgeting provides not just a justifying metric, but actually can encourage taking such risks. As Garrett explains, “SRE practices encourage you to strategically burn the budget to zero every month, whether it’s for feature launches or architectural changes. This way you know you are running as fast as you can without compromising availability.”
With this mindset, developers shouldn’t view SLOs as a hindrance, but rather an opportunity to better understand how to maximize both innovation speed and quality. Even the overhead of meetings to refine SLOs can be reframed as important opportunities to accelerate development further. If SLOs are regularly being met, reviews can be the perfect time to relax your SLOs in favor of higher error budgets, giving the development team more freedom.
Error budgeting provides not just a justifying metric, but actually can encourage taking such risks.
We’ve seen how SLOs can provide peace of mind for SREs and empower risk-taking in developers, but their true strength lies in the mentality behind them. Blameless engineer Kevin Greenan explains:
“Failure is the norm. We just have to accept it. Your internet carrier cannot deliver 100% uptime, so it’s not worthwhile for you to do so. When you set an SLO (e.g. 99% availability), you will have automatically embraced failure with an error budget (e.g. 1% downtime). Google, Facebook, and some of the best tech companies have embraced failure as the norm (even outside of software, as seen with hard drives with RAID), leading to phenomenal results like 99.999% availability.”
99.999% availability is certainly phenomenal, but it can never be achieved by hoping for 100%. This can be a difficult mentality to adopt. When asked how many minutes of outage a company could tolerate, Kevin predicts that most executives will answer “zero.” When the bar is unrealistic, leadership and team buy-in is difficult.
However, the data generated from SLO usage can be tremendously helpful in guiding leadership teams, creating a common language to understand how to invest engineering resources and improve reliability over time. Garrett describes what happened when his team used graphs of SLOs to drive product road-mapping decisions: “Our SVP of Engineering [at the time] came to the service review meeting where I was presenting our performance against our SLOs. We had a not-so great-month. He jumped right in and asked ‘So what are we doing about it?’ He was asking all the relevant questions without needing any context. He just got it!”
99.999% availability is certainly phenomenal, but it can never be achieved by hoping for 100%.
Even skeptics will see the benefits of incorporating SLOs once they’ve experienced it. By creating a buffer around acceptability and empowering development through strategic burning of error budgets, anyone can become an SLO believer.