When learning SRE, you might find its principles a bit unintuitive. For example, it might be difficult to learn why aiming for 100% reliability is wasteful, or how reliability isn’t the same as availability, or why failure ought to be celebrated. Believe it or not, there is a method to these ideas. My goal in this article is to shed light on the principles and to leave you a believer, such that you’ll take steps towards starting SRE practices.
Often with complex concepts, it helps to use analogies that apply the concept in a different context. I’m personally a big fan of fighting video games. These are usually one-on-one games against another person where I get to decide every punch, kick, and fireball that leads me to victory. In those types of games, victory is about not being the first to run out of health. That person is the loser. Turns out, there are quite a few similarities between how you win at fighting games and executing successful site reliability engineering.
Let’s walk through some analogies to help you understand the philosophies behind SRE. Along the way, maybe you’ll improve your Street Fighter skills too! We’ll talk about:
One thing you learn early on with fighting video games is that it doesn’t matter whether you win with 1% of your health remaining or with 100% — a win is a win. Sure, a flawless victory is good for bragging rights, but that’s not really what’s important here.
Once you realize this, it revolutionizes how you strategize. Consider two strategies: one is risky and doesn’t avoid most attacks, but it deals a lot of damage to your opponent; the other will avoid more attacks, but it serves weaker attacks to your opponent. Which do you choose? As you progress through a game, an important factor to bear in mind is your current health. If your health is high, you can afford to fail a few times before you actually lose the round.
This is the mentality when building SLOs and error budgets. In fact, 100% reliability is practically impossible; some amount of failure is inevitable. Plus, at a certain point, reliability improvements are imperceptible to users. Like trying to win a game with 100% of your health, not taking risks — especially when you can afford to — will only limit you. When you’ve got the error budget to burn, take a big swing! Similarly, by taking risks and accelerating development, you can get the edge over your competitors.
On the other hand, when your health starts dwindling, it’s time to switch tactics. The number one priority is to not lose the match, so playing defensively and minimizing risks become the path to success. Breaching the SLO is like getting knocked out (KO’d) in a fighting game. When the error budget runs low, set up policies like code freezes to get back on track.
Let’s take a step deeper into the connection between SRE and fighting games by looking at what it means to strategize. In fighting games, your game plan could look something like this:
Think about the importance of each step in leading you to success. The projectile attack is maybe half as important as being able to try your big attack, which is less important than avoiding the opponent’s big attack.
This mentality of breaking down intent, weighed by the importance of each goal, is very similar to user journeys and SLIs! These capture how users engage with your services and what’s most important to them when they do. Let’s use the example of an application that allows users to upload a picture to an editing service:
These reflect the steps that a user takes to upload an image. The importance of each isn’t equal. If the upload process itself fails and the user gives up on your service, that’s losing the game. But if the service only occasionally takes longer to upload, or the displayed storage is a bit out of date, your perceived reliability takes minimum impact.
This sophisticated game plan elevates your SLOs to the next level. Rather than setting risk tolerance based solely on your service’s overall uptime (like a game player’s health bar), weigh it dynamically based on the level of impact each factor has on reliability.
If your projectile assault isn’t landing, or your database sometimes updates slowly, you might not need to change plans. But if you’re getting hit by big attacks and your health is draining fast, or users can’t even upload pictures, it’s time to put up the defenses as soon as possible.
The goal is to build metrics that reflect what you really care about — winning the match, or in the real world, keeping users happy. SLIs and game plans show you the difference between these metrics and your health bar or uptime.
Now that we’ve understood our game plan, let’s talk about what to do when things go wrong. When resolving incidents, especially the more critical ones, it’s a lot like facing a super strong opponent in a fighting game. The best trick is to create a guide ahead of time that walks you through how to handle the fight. This is especially useful for alleviating cognitive burden when knee-deep managing the incident or in the fight. Put more simply, think more now and less later.
In fighting games, your character comes with a set of various attacks. To make a good offense against a strong opponent, you need to combine several of your attacks, aka doing a “combo”. It works for defense too. When you start to become familiar with your opponent’s sequence of attacks, you can prepare yourself with the right set of defensive tactics. These small situations are referred to as “set play”. You become a better player once you’re able to break down the game into these mastered situations.
Set play is similar to mastering runbooks for incident management. Runbooks are guides that navigate you to resolution. Runbooks are built and improved over time as teams continue to learn from incidents. As you experience incidents, like a new opponent, you learn what signs to look out for, what solutions seem to work well, and how to respond to the situation.
Fighting games don’t end at just one opponent, just like incident management doesn’t set out to prevent future incidents from occurring ever again. Incidents will always occur unexpectedly, just like there will be new opponents that play in new ways. The goal of runbooks is to make the key steps as efficient as possible.
Levelling up runbooks is a lot like levelling up your “set play” too. As you practice combos, they’ll become so familiar that you could do them in your sleep. These are like mental shortcuts that let you do combos “automatically”, freeing up your focus for strategic challenges. Runbooks can be shortcutted too by leveraging software tools. Automated runbooks are the best way to navigate incidents smoothly, freeing even more mental energy.
Fighting games at a competitive level are usually recorded and shared online, allowing gamers from around the world to study high profile matches. For the SRE, there’s no recording you can watch later and study. Instead, you should build incident retrospectives that capture important information about an incident.
Game play videos and incident retrospectives are not the same, but the best practices around studying them have many similarities. Let’s look at some here:
The goal is to use every incident, like every match, to make you stronger for the next one. Win or lose, minor or major incident, you can learn something new each time.
Sure, the connection between fighting games and site reliability engineering may be a bit of a stretch. Thanks for humoring me. Still, online fighting gamers and SREs do share similar mentalities. They use different names to describe similar tactics — heath vs. reliability, game plan vs. SLI, set play vs. runbooks, and game play videos vs. retrospectives. If you’re unfamiliar with SRE concepts, it’s not too hard to pick up on them.
A fundamental pillar of SRE is shared ownership of system reliability and the underlying culture behind that. As a gamer, having the determination to keep practicing, improving, and staying positive boosts you into becoming a better player. Your attitude influences your actions and vice versa. You’ll notice there’s a feedback loop that exists between process and culture that hopefully pushes your org toward growth and improvement.
On a final note, here’s one last parallel between fighting games and SRE. Fighting games usually offer a “practice mode” where you can spend time developing your skills and building set plays while exploring new environments. If you want to test out the SRE practices mentioned above, Blameless helps teams streamline everything from SLOs, runbooks, retrospectives, and more. Try it out in “practice mode” for free and discover how you can become a reliability champion!