With remote work and distributed teams as the norm, incident response is trickier. Years ago, everyone would gather in a war room and sort through the issue together, boots on the ground. Now, things have shifted. Teams need to adapt to resolve incidents, even if team members are a thousand miles away. But how can we make great incident response a reality?
There are three components to being exceptional during an incident. These components are crucial whether you’re in the office or at home working from your couch.
Ability to recognize how bad the situation is, and prioritize it
So, there’s a new incident. It’s only natural to be a little nervous, but keeping a level head during this tough time is key. To be exceptional at handling incidents, it’s important to know what you’re dealing with and react as fit.
Is your incident a Sev 0, or a Sev 3? Acknowledging the difference can change the entire tone of the incident. Do you need to call the entire team on the weekend, or can one on-call member handle it until Monday morning? If too many people spend weekends responding to incidents, they'll burn out. This is especially true if a large percentage of the incidents could have waited until the work week.
According to Amy Tobey, one way to tell what sort of incident you’re having is to consider the customer impact. If there is no customer impact, the incident will take a lower priority. But, if there are high rates of customer impact, it’s time to call/Slack for backup.
No matter how severe the incident is, it’s important to keep calm and carry on. Emergency room doctor Dan Dworkis, MD PhD, wrote a piece on how to respond productively when things go wrong. He states “The first step is to acknowledge that what happened was, in fact, bad.”
Of course, we want fewer incidents, as we want to minimize customer impact. But we need to know how to go about resolving them without losing our minds. Dan suggests addressing this by using the phrase “Well, this is suboptimal.
Dan gives an example of a car accident for how this mantra can come in handy. Imagine you’re involved in a car accident that damages your tire. You can’t continue to drive on it as normal, as the car won’t function. Something bad happened, and you need to address it. But, this doesn’t warrant you stepping out of your car in the middle of a busy street screaming and crying. It’s an issue with a tire, nobody died. The middle ground is to say “Well, this is suboptimal” and begin to resolve the issue.
Having this level-headed mindset during an incident can be a massive boon to your team, especially when you’re working together to decide what level of response is necessary. Situational awareness is key.
Effective communication skills
This one should come as no surprise, especially in the context of remote work. Communicating during an incident is a necessity. With distributed teams, it can be especially challenging to know who is doing what. Great incident response means communicating with teammates and superiors/customers as needed. This ensures that everyone is on the same page.
Great incident response is built on procedures. A very important part of communicating is telling your team what step of the procedure you’re working on. To begin, let your team know that you’re listening, active, and responding to the incident by checking in. Checking in, either on Slack or in your incident management platform, lets your team know that you’re on board. That simple gesture can create a lot of solidarity.
Another important part of incident response is streamlining communication with affected parties and internal stakeholders. Managers will want to be looped in on developments during the resolution process. If the incident is large enough, executives and customers will want to know the status of the service as well. There are two components to making sure this happens:
Designating someone to take the reins for communicating to stakeholders. Managers don’t want repeat messages, or worse, mixed messages from different people. And customers won’t want tons of emails about your outage if nothing has improved. Selecting a single person to communicate developments minimizes the chances of wasteful overlaps.
Communicating developments with your team and remembering to tag or @ the communications lead to ensure they see what’s going on. It's important to update your communication lead on your progress. This gives your stakeholders accurate visibility into the incident.
Incidents are tricky, and bad communication will only make them harder. Instead, focus on working together as a team and talking through the whole process. With remote work, this is important as you may be in different cities or even countries.
Compassionate responses to mistakes and a learning mindset
Every engineer makes mistakes; it’s how we learn. When an incident happens, it’s easy to place blame on the last person who pushed code. But, people are never the root cause of an incident; processes are. To be great at incident response, you will need to be compassionate in the face of these mistakes and learn from them.
Issues won’t only cause incidents; they’ll pop up during incidents. Sometimes a fix can cause more damage to a service than it repairs. You’ll need to learn to have compassion during these moments, too. Instead of getting angry with a team member, remember that they are trying to help. Everyone is making the decisions they feel are best at that moment. Support one another. The occasional emoji or GIF here and there can help create a sense of camaraderie. It also helps communicate that you know all mistakes were made with good intentions.
And once the incident is all said and done, it’s important that you take a closer look at it to learn. Great incident management comes from treating each incident as a learning opportunity. This will help you be more successful at resolving future incidents, and can even prevent some from happening.
Process is important here, too. Just because you and your team learned something doesn’t mean everyone else has. In fact, often only the people involved in the incident learn from it. The rest of the information is buried in files or forgotten. This problem is only exacerbated for distributed teams.
To make sure you capture your progress, write a comprehensive incident retrospective. You’ll need to help with aggregating all the key components (such as graphs, timelines, etc.) to form a narrative of what happened. With more data at hand, a clearer story begins to form and teams gain context without placing blame.
Great incident response is within your grasp
An important thing to note from all three of these components is that they are teachable. With experience, you can become a great incident commander as well. You can learn about incidents through participation and reading retrospectives. You can practice networking and inspiring people while keeping them focused. You can be intentional about language, look at things from different perspectives, and focus on improving processes without blaming people.
If you enjoyed this blog post, check out these resources:
"I have less anxiety being on-call now. It’s great knowing comms, tasks, etc. are pre-configured in Blameless. Just the fact that I know there’s an automated process, roles are clear, I just need to follow the instructions and I’m covered. That’s very helpful."
"I love the Blameless product name. When you have an incident, "Blameless" serves as a great reminder to not blame anything or anyone (not even yourself) and just focus on the incident resolving itself."