At Blameless, we value every opportunity to learn. Whether it’s taking time on Focus Fridays to attend a cool webinar, or conducting retrospectives for incidents, lost deals, events, and more, learning is core to our mission.
To learn even more about our craft, we decided to start a book club at Blameless. People from every team (engineering, sales, SRE, marketing, product, people, and more) attended. One of the books we’ve been reading together is none other than Alex Hidalgo’s Implementing Service Level Objectives.
Below is a summary of key topics from Alex’s book, along with thoughts our team had while reading. In this blog post, we’ll cover part one of Implementing Service Level Objectives, “SLO Development.”
This introductory chapter covers what Alex calls the reliability stack. The stack consists of three elements that build on top of eachother: SLIs, SLOs, and error budgets. He details why these elements are important. As Alex writes, “It doesn’t matter if you can point to zero errors in your logs, or perfect availability metrics, or incredible uptime; if your users don’t think you’re being reliable, you’re not.”
The bottom line is that reliability is in the eyes of the user. SLIs, SLO, and error budgets are only tools that help you provide the level of service your users will expect.
Alex also emphasizes that the goal of reliability is NOT 100% uptime. As he states, “Not only is it impossible to be perfect, but the costs in both financial and human resources as you creep ever closer to perfection scale at something much steeper than linear.”
Instead, we should strive for good enough. This level of service keeps users happy, and gives engineers the room to make mistakes and learn.
In this chapter, Alex explains what reliability means to users as well as what reasonable expectations are for a service. While the past can’t always predict the future, it is important to know what the history of a service looks like.
As Alex said, “It doesn’t really matter if your users have been happy or upset with your service in the past: the important thing is to understand where you’ve been and where you are today.”
No service is doomed based on past performance, and no service is guaranteed to operate at the same level of reliability forever. However, knowing the typical level of reliability for a service can inform customer expectations. For example, if a service has had very little downtime, customers could become unhappy with the service if outages become frequent.
Alex also notes how important it is to share your goals with both internal and external stakeholders. As he says, “your goals are only partially as useful as they could be if they’re not discoverable by other people. Transparency with your users is a powerful tool.”
One member of our team explained how this chapter of the book helped transform the way he thinks about SLOs. “Everyone is talking about SLOs as if they’re kind of the golden ticket, but Alex changed my thinking. SLIs are the hardest part and the foundation.”
In this chapter, Alex hones in on how to create SLIs that correlate to customer happiness. He begins by laying out how SLIs can make for happier users, engineers, and a happier business as a whole. Here are the main arguments for each:
Alex then reminds us that if we’re feeling overwhelmed with setting SLIs, that we could bundle them. He gives an example of this. “‘Are the payloads of the responses the data actually being requested?’ It turns out that if you can figure out a way to measure this, you’re also measuring ‘Are the responses in the correct data format?’ From a user’s perspective, you can’t possibly be receiving the correct data if the data isn’t formatted in the way you expect it to be.”
While Alex conveys a very difficult topic in a way that’s understandable, setting SLIs can be a challenge in practice.
One of Blameless’ Staff Engineers described why setting SLIs can be so complex. “The complexity of a service level indicator is often in its implementation, but more often than not in its design. Every service requires up-front thinking to analyze and design what an indicator represents."
Another team member also noted why SLIs can be so tricky to get right. “It is a culture shift. You have to dedicate time with members of different departments to determine what user journeys are.”
The cross-functional alignment for SLIs is key, and plays a big part in setting up SLOs and error budgets as well. Alex addresses this in later chapters.
This chapter focuses on what makes for a good SLO. Alex writes about how to set targets for reliability, as well as how to deal with reliability for services you do not own.
When setting reliability targets, it’s not only about providing the best reliability. After all, the best is expensive, time consuming, and difficult. Plus, your customers might not actually care. Or, if they get used to the heightened level of reliability, you could end up making outages more customer-impacting. It’s a game of tradeoffs.
As Alex writes, “Even if your SLO is published and discoverable, people are going to end up expecting that things will continue to be 99.99% reliable, because humans generally expect the future to look like the past. Even if it was true that in the past everyone was actually happy with 99.9%, their expectations have now grown.”
Additionally, it’s important to keep a pulse on customer happiness. One of Blameless’ SREs said, “People have to be mindful that your SLOs are there for customer satisfaction. The SLO needs to be a leading indicator. A more complex SLO based upon this or that SLI doesn't mean that it's a good SLO. If your error budget is depleted yet customers are saying nothing, then it’s likely a useless SLO.”
When setting up SLOs, it’s crucial to remember that SLOs can and will be wrong. You will need to revise them with time. The biggest sign that your SLO isn’t meeting your needs is qualitative feedback from customers. Another team member shared, “SLOs are a process, not a project. If you're only getting data rather than qualitative information, then you'll still have difficulty getting in touch with your customers' pain points.”
In this chapter, Alex also reveals how to deal with dependencies that affect your service’s reliability. “Imagine you have 40 total components, each of which promises a 99.9% reliability target and has equal weight in terms of how it can impact the reliability of the collective service. In such situations, the service as a whole can only promise much less than 99.9% reliability… So, 40 service components running at 99.9% reliability can only ensure that the service made up of these components can ever be 96% reliable.”
You’ll need to take dependencies like this into account when setting your reliability targets. Alex also notes that this is a great decision making tool in evaluating all the investments supporting your operations. Consider what your customer needs. If users need 97.5% reliability, then your 96% reliability won’t satisfy. As your dependencies don’t allow you to optimize for the reliability you need, you can use this as a guideline to shop for new vendors and reevaluate contracts.
SLOs are great for spurring data-driven decision making for more than just vendor discussions. Alex goes into using SLOs to have reliability conversations in the next chapter.
Error budgets are great, but do you know you can use it for more than just balancing features with reliability work? In this chapter, Alex describes the many applications of error budgets as well as what they look like in practice. While straddling innovation and reliability is a key use case, it’s certainly not the only one.
Alex also talks about how error budget can be purposely burned down to accommodate valuable forms of experimentation such as chaos engineering, load testing, and more. They can even be employed to help encourage people to take enough vacation, or help ensure tickets are completed within a certain timeframe. These applications all help keep teams on the same page and aware of how things are trending over time.
The most important use case of an error budget, though, is as a framework for communication. As Alex writes, “Using error budgets to make decisions is a sign that the maturity of the SLO culture in your organization is reaching high levels.”
However, getting to this level of maturity is very difficult as capturing time series data can be easier said than done. SLOs and error budgets are an iterative process; in order for them to be successful, they will need fine-tuning over time. Not only is it nearly impossible to get them right on the first try, but your service will also change over time. Even when teams reach a high maturity level, incidents or unexpected events will happen. The error budget will be consumed. But, as Alex reminds us, this is normal.
After reading part 1 of Implementing Service Level Objectives, our team is eager to read more. Alex has taught us a lot, inspired great discussions, and brought home the most important concept: the customer is always right.
If you enjoyed this blog post, check out these resources: