When learning SRE, you might find its principles a bit unintuitive. For example, it might be difficult to learn why aiming for 100% reliability is wasteful, or how reliability isn’t the same as availability, or why failure ought to be celebrated. Believe it or not, there is a method to these ideas. My goal in this article is to shed light on the principles and to leave you a believer, such that you’ll take steps towards starting SRE practices.
Often with complex concepts, it helps to use analogies that apply the concept in a different context. I’m personally a big fan of fighting video games. These are usually one-on-one games against another person where I get to decide every punch, kick, and fireball that leads me to victory. In those types of games, victory is about not being the first to run out of health. That person is the loser. Turns out, there are quite a few similarities between how you win at fighting games and executing successful site reliability engineering.
Let’s walk through some analogies to help you understand the philosophies behind SRE. Along the way, maybe you’ll improve your Street Fighter skills too! We’ll talk about:
- Getting the win with SLOs and error budgets
- Building a game plan and building an SLI
- Saving mental energy with runbooks and set play
- Analyzing matches and incidents with retrospectives
- How to achieve a winning reliability mentality
Getting the win with SLOs and error budgets
One thing you learn early on with fighting video games is that it doesn’t matter whether you win with 1% of your health remaining or with 100% — a win is a win. Sure, a flawless victory is good for bragging rights, but that’s not really what’s important here.
Once you realize this, it revolutionizes how you strategize. Consider two strategies: one is risky and doesn’t avoid most attacks, but it deals a lot of damage to your opponent; the other will avoid more attacks, but it serves weaker attacks to your opponent. Which do you choose? As you progress through a game, an important factor to bear in mind is your current health. If your health is high, you can afford to fail a few times before you actually lose the round.
This is the mentality when building SLOs and error budgets. In fact, 100% reliability is practically impossible; some amount of failure is inevitable. Plus, at a certain point, reliability improvements are imperceptible to users. Like trying to win a game with 100% of your health, not taking risks — especially when you can afford to — will only limit you. When you’ve got the error budget to burn, take a big swing! Similarly, by taking risks and accelerating development, you can get the edge over your competitors.
On the other hand, when your health starts dwindling, it’s time to switch tactics. The number one priority is to not lose the match, so playing defensively and minimizing risks become the path to success. Breaching the SLO is like getting knocked out (KO’d) in a fighting game. When the error budget runs low, set up policies like code freezes to get back on track.
Building a game plan and creating an SLI
Let’s take a step deeper into the connection between SRE and fighting games by looking at what it means to strategize. In fighting games, your game plan could look something like this:
- Throw your projectile attack at the opponent to pressure them into making a move.
- When you’re able to take a risk, throw out your big attack.
- Watch out for their big attack, and be ready to counterattack after they try it.
Think about the importance of each step in leading you to success. The projectile attack is maybe half as important as being able to try your big attack, which is less important than avoiding the opponent’s big attack.
This mentality of breaking down intent, weighed by the importance of each goal, is very similar to user journeys and SLIs! These capture how users engage with your services and what’s most important to them when they do. Let’s use the example of an application that allows users to upload a picture to an editing service:
- The uploading tool should accurately reflect your current upload limits, say 99% of the time.
- The time it takes to upload a picture should be less than 10s per megabyte.
- The rate of successful uploads should be over 99%.
These reflect the steps that a user takes to upload an image. The importance of each isn’t equal. If the upload process itself fails and the user gives up on your service, that’s losing the game. But if the service only occasionally takes longer to upload, or the displayed storage is a bit out of date, your perceived reliability takes minimum impact.
This sophisticated game plan elevates your SLOs to the next level. Rather than setting risk tolerance based solely on your service’s overall uptime (like a game player’s health bar), weigh it dynamically based on the level of impact each factor has on reliability.
If your projectile assault isn’t landing, or your database sometimes updates slowly, you might not need to change plans. But if you’re getting hit by big attacks and your health is draining fast, or users can’t even upload pictures, it’s time to put up the defenses as soon as possible.
The goal is to build metrics that reflect what you really care about — winning the match, or in the real world, keeping users happy. SLIs and game plans show you the difference between these metrics and your health bar or uptime.
Saving mental energy with runbooks and set play
Now that we’ve understood our game plan, let’s talk about what to do when things go wrong. When resolving incidents, especially the more critical ones, it’s a lot like facing a super strong opponent in a fighting game. The best trick is to create a guide ahead of time that walks you through how to handle the fight. This is especially useful for alleviating cognitive burden when knee-deep managing the incident or in the fight. Put more simply, think more now and less later.
In fighting games, your character comes with a set of various attacks. To make a good offense against a strong opponent, you need to combine several of your attacks, aka doing a “combo”. It works for defense too. When you start to become familiar with your opponent’s sequence of attacks, you can prepare yourself with the right set of defensive tactics. These small situations are referred to as “set play”. You become a better player once you’re able to break down the game into these mastered situations.
For reference, here’s a diagram of set play showing what to do in Super Smash Bros. Melee in a particular situation, from this reddit post.
Set play is similar to mastering runbooks for incident management. Runbooks are guides that navigate you to resolution. Runbooks are built and improved over time as teams continue to learn from incidents. As you experience incidents, like a new opponent, you learn what signs to look out for, what solutions seem to work well, and how to respond to the situation.
Fighting games don’t end at just one opponent, just like incident management doesn’t set out to prevent future incidents from occurring ever again. Incidents will always occur unexpectedly, just like there will be new opponents that play in new ways. The goal of runbooks is to make the key steps as efficient as possible.
Levelling up runbooks is a lot like levelling up your “set play” too. As you practice combos, they’ll become so familiar that you could do them in your sleep. These are like mental shortcuts that let you do combos “automatically”, freeing up your focus for strategic challenges. Runbooks can be shortcutted too by leveraging software tools. Automated runbooks are the best way to navigate incidents smoothly, freeing even more mental energy.
Analyzing matches and incidents with retrospectives
Fighting games at a competitive level are usually recorded and shared online, allowing gamers from around the world to study high profile matches. For the SRE, there’s no recording you can watch later and study. Instead, you should build incident retrospectives that capture important information about an incident.
Game play videos and incident retrospectives are not the same, but the best practices around studying them have many similarities. Let’s look at some here:
- Looking into systemic causes. When you lose a match, take it as a learning lesson rather than conceding “they’re better than me”. Don’t blame it on the video game or even a real person in the room “distracting” you. The same applies to incident management. Don’t assume there isn’t a solution and don’t defer blame. Ask yourself why events played out the way they did, and make improvements for the next time.
- Break down things into chunks. When reviewing matches, you can home in on a single interaction and ask yourself, Why did I get hit? Why didn’t that combo work? And then work backwards to find answers. You can analyze incidents the same way, starting at one particular point of failure and exposing bugs or other areas for improvement.
- Work as a team. Group analysis with friends is common when reviewing fighting games. First off, it’s fun to have friends around. Second, it benefits you to hear another person’s ideas that you couldn’t figure out yourself. It’s the same with incidents. Swapping notes within and across teams makes everyone stronger.
- Be holistic. It goes without saying that nobody is perfect. You can’t expect to always play a perfect game. When studying past matches, you consider everything you experienced in the moment. You were tired, hungry, or stressed out. Give the same consideration to incidents. Identify factors, related to the job or even in personal lives, that can impact how the team responds to incidents. And practice grace.
- Build an action plan. Learnings should turn into positive change. As a gamer, when you keep messing up a particular combo, you take time to practice it. When it comes to services, if a particular area of the service keeps experiencing incidents, plan a bug bashing session for it. Stay positive, and remember that it’s well worth the investment.
The goal is to use every incident, like every match, to make you stronger for the next one. Win or lose, minor or major incident, you can learn something new each time.
How to achieve a winning reliability mentality
Sure, the connection between fighting games and site reliability engineering may be a bit of a stretch. Thanks for humoring me. Still, online fighting gamers and SREs do share similar mentalities. They use different names to describe similar tactics — heath vs. reliability, game plan vs. SLI, set play vs. runbooks, and game play videos vs. retrospectives. If you’re unfamiliar with SRE concepts, it’s not too hard to pick up on them.
A fundamental pillar of SRE is shared ownership of system reliability and the underlying culture behind that. As a gamer, having the determination to keep practicing, improving, and staying positive boosts you into becoming a better player. Your attitude influences your actions and vice versa. You’ll notice there’s a feedback loop that exists between process and culture that hopefully pushes your org toward growth and improvement.
On a final note, here’s one last parallel between fighting games and SRE. Fighting games usually offer a “practice mode” where you can spend time developing your skills and building set plays while exploring new environments. If you want to test out the SRE practices mentioned above, Blameless helps teams streamline everything from SLOs, runbooks, retrospectives, and more. Try it out in “practice mode” for free and discover how you can become a reliability champion!