In SRE, we believe that some failure is inevitable. Complex systems receiving updates will eventually experience incidents that you can’t anticipate. What you can do is be ready to mitigate the damage of these incidents as much as possible.
One facet of disaster readiness is incident response - setting up procedures to solve the incident and restore service as quickly as possible. Another strategy involves reducing the chances for failure with tactics like reducing single points of failure. Today, we’ll talk about a third type of readiness: having backup systems and redundancies to quickly restore function when things go very wrong.
Having backup systems can give organizations peace of mind: no matter what goes wrong, you just switch to the backup system for a bit, and then everything is fine... right? But will it really go so smoothly when disaster hits? In this blog post, we’ll help you ensure that your backup systems will perform as expected when you need them most by looking at:
- The value of running restoration drills
- Thinking holistically and laterally about incidents
- Increasing resilience through imagining and preparing for “black swan” events
The value of running restoration drills
Many organizations have backups for their data and infrastructure that they can switch to when the main system fails. But what exactly is this “switch”? Consider this story shared with me by an engineer about a nightmare outage caused by a total wipe of their databases. The team had backup databases in place, but they needed to be decompressed before they could be used. How long would that take? They had no idea.
This engineer’s story is all too common. Organizations feel secure that everything is backed up, but they can’t actually rely on those backups to be immediately available in a disaster. Also key in the engineer’s story is a lack of resources: as they didn’t have an in-house infrastructure team, they had to rely on a single inexperienced person following an out-of-date runbook.
The solution to this nightmare is running regular restoration drills. Simulate a situation where everything needs to get switched from the production systems to the backups. How long does it take you? Are there any obstacles you encounter that you can remove now? Also look at what resources you’re relying on. Are individual people being consulted for advice? Are you using infrastructure to access runbooks? What if those people were missing, or that infrastructure was also down? You’ll want to prepare for these possibilities too.
Once you’ve finished these drills, collectively review where improvements could be made. And then - this is the most important part - schedule the next drill. As the codebase changes and databases grow, keep making sure that backup restoration runs smoothly.
Don’t become complacent with untested backups. In a recent discussion, Blameless SRE Jake Englund summarized it thusly: when it comes to having a backup policy, if you aren't testing your restore process then you can't be certain your backups are useful, and if you aren't sure that your backups are useful then they probably aren't.
Thinking holistically and laterally about incidents
When something goes wrong, it can be tempting to think in the singular: something goes wrong. The server goes down, a typo in code causes an error, high traffic causes latency, etc. But really, most incidents create a domino effect of other failures. When preparing for failure, it’s important to consider everything that could go wrong.
Here are some types of things to consider:
- Will your typical tools for communicating also be down?
- Will resources like runbooks be available if your tools go down?
- Will the services you use to restore backups also go down?
- Will people you expect to be able to respond be dealing with bigger priorities in the event of a major outage? Will they be available at all?
- Will engineering teams be stressed, burned out, and incapable of performing to their normal standards?
Each organization will have their own issues that could arise during incidents. Past incidents are your best teacher for finding them. Create incident retrospectives to investigate the causes and effects of incidents. Techniques like contributing factor analyses help you uncover these aligned issues.
Once you’ve identified these issues, make sure your backup plan compensates for them. Don’t leave anything out: consider every factor, from the technical to the personal. If there’s an in-house tool you use to spin up new servers, don’t assume you’ll have it. If engineers will be panicking when something goes wrong, make sure the solution is obviously marked and easily accessed.
Really think outside the box, and dig deep into your proposed solutions to uncover problems with them that could occur. An example Jake shared is relying on backup generators arriving via truck as a solution to a power outage — what if the truck gets stuck in traffic or breaks down? Don’t be content with just one solution; have a solution for if your solution breaks, and have a backup for your backup.
Imagining and preparing for “black swan” events
A “black swan” event is one that is nearly impossible to predict or even imagine, but causes catastrophic damage. In retrospect, it may seem obvious that the black swan event was a possibility; however, before it happens, it’s unthinkable.
An example of a black swan event in tech is the recent Facebook outage. Facebook didn’t prepare for a total collapse of their DNS servers, nor could they imagine the many problems that came downstream from them - like being unable to physically enter their offices. If a normal incident creates a domino effect, a black swan event can be like knocking over a house of cards.
So how do you prepare for an unthinkable incident? One strategy involves getting creative. Jake shared an example from his time at Google: simulate that the entire Mountainview Google HQ has been hit by a meteor. During the practice response, stop yourself every time you try to contact someone there, access a server hosted there, or even rely on the bandwidth managed there. You can’t: it’s been hit by a meteor.
Now, are your headquarters actually going to be wiped off the map by a meteor? Almost certainly not. And if they were, would the branch departments really be scrambling to restore service? No, they’d likely have bigger concerns. But by attacking this worst-case scenario, you prepare yourself for other events that you couldn’t otherwise imagine.
Jake emphasizes the importance of testing “not just for what you want to test”. The point of disaster preparedness isn’t to get the results you want, but to uncover vulnerabilities and drive systemic change. Jake describes this idea as differentiating between robustness - testing for everything you know could go wrong - and resilience - “testing what you hope you won’t have to know.” Generally, Jake finds that orgs are very good at the former and very bad at the latter.
Building resilience by testing for the unknown is a practice that requires iteration and reflection. There’s no one right way to do it, and no one right frequency to explore these scenarios. The important thing is to thoroughly document your process and results. Then analyze which types of experiments are yielding insights, and build future tests around them. Keep up the practice by always ensuring that the next experiment is scheduled when the last one finished.
But what solution could possibly exist for these apocalyptic scenarios? Jake discussed one viewpoint, which may seem counterintuitive at first. Generally a path to maturity and growth for an organization involves first relying on third party tooling, then building more and more tools and infrastructure in-house. Major enterprise organizations may build their own communication, alerting, and tracking tools.
However, black swan events suggest that there could be an even further stage of maturity: incorporating third party tools as backups. If you can’t use your tools to solve issues with your tools, you should have some other tools at the ready. Of course, like any backup system, you’d need to run drills to ensure functionality will actually be restored by the switch.
Learning from past incidents is the best way to know if you’re ready for new ones. With Blameless SLOs and retrospectives, you can understand the impact and causes of every incident, from minor outages to meteor strikes. Find out how by checking out a demo!