Want to up-level your reliability program? Let's start by identifying your opportunities for growth.
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

What Are Service-Level Objectives? Lessons Learned

Emily Arnott
|
1.21.2020

Service Level Objectives, or SLOs, are an internal goal for the essential metrics of a service, such as uptime or response speed. We’re familiar with this definition, but what is the value of setting these goals? We’ll take a look at SLOs as both a powerful safety net and a tool to inform the allocation of engineering resources. We'll also consider the cultural learnings of SLO adoption.

SLOs act as a safety net

SLOs are always set to be more stringent than any external-facing agreements you have with your clients (SLAs). They provide a safety net to ensure that teams address issues before user experience becomes unacceptable. For example, you may have an agreement with your client that the service will be available 99% of the time each month. You could then set an internal SLO where alerts activate when availability dips below 99.9%. This provides you a significant time buffer to resolve the issue before violating the agreement:

Service Level Agreement with Clients: 99% availability – 7.31 hours acceptable downtime per month

Service Level Objective: 99.9% availability – 43.83 minutes acceptable downtime per month

Safety Buffer: 6.58 hours

Knowing that you’ll have over six and a half hours between your internal objective and an agreement breach can provide some peace of mind as you deploy. But, it can be difficult to determine a buffer that provides enough time to respond when disruptions occur. Garrett Plasky, who led Evernote’s SRE team, describes this challenge:

“Setting an appropriate SLO is an art in and of itself, but ultimately you should endeavor to set a target that is above the point at which your users feel pain and also one that you can realistically meet (i.e. SLOs should not be aspirational).”

It may be tempting from a management perspective to set an SLO of 100%, but it isn’t realistic. Fear that the smallest change could trigger an SLO breach would paralyze development. Moreover, such a high target isn’t helpful. As Garrett points out, the SLO should still be set above the point where the users of the service notice issues. Any refinement beyond that gives diminishing returns for user satisfaction. This points to another key element of choosing good service indicators and SLOs: they should be relevant to the users.

SLOs reduce pager fatigue

In a world of increasing system complexity, it’s tempting but impossible to measure everything. To set meaningful goals, you’ll need to understand what metrics matter. As Charity Majors, CTO of Honeycomb put it in an interview with InfoQ:

“In the chaotic future we're all hurtling toward, you actually have to have the discipline to have radically fewer paging alerts ... not more. People... lean heavily on over-paging themselves with clusters of tens or hundreds of alerts, which they pattern-match for clues about what the root cause might be."

SLOs are the perfect way to follow Charity’s advice. Each SLO requires choosing specific indicators and writing policy around them. By considering deliberate objectives, you can can pare down hundreds of alerts and signals to the few that matter.

Of course, this raises the question of how to choose these indicators. Although good load balancing might have an impact on the latency that a user experiences, it’s several steps removed from a user’s experience. There are many metrics that contribute to latency, and trying to peg your SLO to each one of them will generate far too much noise. Instead, merge typical experiences of a service into a few key metrics that will be most relevant to their happiness:

99.99% of requests successful SLOs with the user experience in mind become an even more powerful safety net. You can feel confident that the metrics you’re monitoring are the best ones to ensure the happiness of your users. Moreover, you can establish different alerting and response workflows for incidents that don’t impact SLOs, allowing for better prioritization and reduced alert fatigue.

Revisions provide product insight

When creating SLOs, don’t “set it and forget it,” as things can get stale. A continuous cycle of review and revision is critical to success. As LinkedIn SRE leader Kurt Andersen explains:

“How would you come up with the best and most reasonable SLO? You can’t, not at the beginning. It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.”

This lifelong commitment isn’t a downside to SLOs, but rather a significant benefit that reflects changing realities. Reviewing and revising SLOs is an opportunity to confirm the priorities of the system. Having a practice in place to work through SLIs and SLOs provides a jumping off point for meaningful discussion of larger questions. For example, evaluating if an SLI of latency on login is more appropriate than an SLI of latency for each page load can lead to reevaluating basic assumptions about how users access services.

Empowering Development Agility with SLOs

Kurt also mentions the importance of buy-in for teams implementing SLOs. The value of having a safety net for downtime might be obvious for SREs and other people in operations, but developers may hesitate when they consider the commitments. But, error budgeting can align incentives for both SREs and developers.

Going back to our table, consider the 43 minutes of acceptable downtime as the amount of time you can spend on pushing new features, even if they may cause some failed requests. For example, a proposed change could decrease average load time by several milliseconds, but might cause brief outages during implementation. How do you know it’s safe to make this change?

With the mentality that any outage is unacceptable, it may be difficult to justify this decision. Error budgeting provides not only a justifying metric, but actually can encourage taking such risks. As Garrett explains:

SRE practices encourage you to strategically burn the budget to zero every month, whether it’s for feature launches or architectural changes. This way you know you are running as fast as you can without compromising availability.

With this mindset, developers shouldn’t view SLOs as a hindrance, but rather an opportunity to better understand how to maximize both innovation speed and quality. Even the overhead of meetings to refine SLOs are important opportunities to accelerate development further. If SLOs are being met, reviews can be the perfect time to relax your SLOs in favor of higher error budgets, giving the development team more freedom.

The Cultural Lesson of SLOs

We’ve seen how SLOs can provide peace of mind for SREs and empower risk-taking in developers, but their true strength lies in the mentality behind them. Blameless engineer Kevin Greenan explains:

“Failure is the norm. We just have to accept it. Your internet carrier cannot deliver 100% uptime, so it’s not worthwhile for you to do so. When you set an SLO (e.g. 99% availability), you will have automatically embraced failure with an error budget (e.g. 1% downtime). Google, Facebook, and some of the best tech companies have embraced failure as the norm (even outside of software, as seen with hard drives with RAID), leading to phenomenal results like 99.999% availability.”

99.999% availability is phenomenal, but you can never achieve it by hoping for 100%. This can be a difficult mentality to adopt. When asked how many minutes of outage a company could tolerate, Kevin predicts that most executives will answer “zero.” When the bar is unrealistic, leadership and team buy-in is difficult.

But, the data generated from SLO usage can be helpful in guiding leadership teams, creating a common language to understand how to invest engineering resources and improve reliability over time. Garrett describes what happened when his team used graphs of SLOs to drive product road-mapping decisions: “Our SVP of Engineering [at the time] came to the service review meeting where I was presenting our performance against our SLOs. We had a not-so great-month. He jumped right in and asked ‘So what are we doing about it?’ He was asking all the relevant questions without needing any context. He just got it!”

Even skeptics will see the benefits of incorporating SLOs once they’ve experienced it. By creating a buffer around acceptability and empowering development through strategic burning of error budgets, anyone can become an SLO believer.

If you’re interested in learning more about SLOs and other SRE practices, sign up for a demo.

Resources
Book a blameless demo
To view the calendar in full page view, click here.