With remote work becoming more common, and distributed teams the norm, incident response has become even trickier. Years ago, everyone would gather in a war room and sort through the issue together, boots on the ground. Now, things have shifted. Remote work is only projected to increase, and teams need to be able to adapt in order to resolve incidents quickly and efficiently, even if team members are a thousand miles away. But how can we make great incident response a reality?
There are three components to being exceptional during an incident, and these components are crucial whether you’re in the office or at home working from your couch.
So, there’s a new incident. It’s only natural to be a little nervous when you hear that your service is having an issue, but keeping a level head during this tough time is key. To truly be exceptional at handling incidents, it’s important to know what you’re dealing with and react accordingly.
Is your incident a Sev 0, or a Sev 3? Acknowledging the difference can change the entire tone of the incident. For instance, do you need to call the entire team on the weekend to help take care of this, or can one on-call member handle it until people login Monday morning? If too many people spend weekend after weekend fighting fires, eventually you’ll face widespread burnout, especially if half of those fires were slow burns rather than emergencies.
According to Amy Tobey, one way to tell what sort of incident you’re having is to consider the customer impact. If no customers are impacted, the incident will likely take a lower priority. However, if half of your customers are impacted, it’s time to call/Slack for backup and get the team in a virtual war room immediately.
However, no matter how severe the incident is, it’s important to keep calm and carry on. Emergency room doctor Dan Dworkis, MD PhD, wrote a piece on how to respond productively when things go wrong. He states “The first step is to acknowledge that what happened was, in fact, bad.”Of course, we want fewer incidents, as we want to minimize customer impact. But we need to know how to go about resolving them without losing our minds. Dan suggests addressing this by using the phrase “Well, this is suboptimal.”
Dan gives an example of a car accident for how this mantra can come in handy. Imagine you’re involved in a car accident and your tire is damaged. You can’t just continue to drive on it as normal, as the car won’t function properly. Something bad happened, and it needs to be addressed. However, this doesn’t warrant you stepping out of your car in the middle of a busy street screaming and crying; it’s just an issue with a tire, nobody died. The middle ground is to say “Well, this is suboptimal” and begin to resolve the issue. Having this level-headed mindset during an incident can be a massive boon to your team, especially when you’re working together to decide what level of response is necessary for a particular incident. Situational awareness is key.
The middle ground is to say “Well, this is suboptimal” and begin to resolve the issue.
This one should come as no surprise, especially in the context of remote work. Communicating during an incident is a necessity, and when teams are distributed, it can be especially challenging to know what’s being done and by whom. Great incident response means communicating with teammates and superiors/customers as needed to ensure that everyone is on the same page.
Great incident response is built on procedures, and a very important part of communicating with your team is letting them know what step of the procedure you’re working on. To begin, let your team know that you’re listening, active, and responding to the incident by checking in. Checking in, either on Slack or in your incident management platform, lets your team know that you’re on board. Just that simple gesture can create a lot of solidarity.
Another important part of incident response is streamlining communication with affected parties and internal stakeholders. Managers will want to be looped in on developments during the resolution process. If the incident is large enough, executives and customers will want to know the status of the service as well. There are two components to making sure this happens:
Incidents are tricky, and bad communication will only make them harder. Instead, focus on working together as a team and talking through the whole process, even though you may be in totally different cities or countries, to improve your response.
Every engineer makes mistakes; it’s how lessons are learned. When an incident happens, it’s easy to place blame on the last person who pushed code. However, people are never the root cause of an incident; processes are. To be great at incident response, you will need to be compassionate in the face of these mistakes and seek to learn from them.
Issues won’t just cause incidents; they’ll pop up during incidents. Sometimes a fix can cause more damage to a service than it repairs, and you’ll need to learn to have compassion during these moments too. Instead of getting angry with a team member, remember that they are just trying to help. Everyone is making the decisions they feel are best at that moment in time with the information they have. Support one another. The occasional emoji or gif here and there can help create a sense of camaraderie and communicate that you understand that all mistakes were made with good intentions.
And once the incident is all said and done, it’s important that you take a closer look at it to learn. Great incident management comes from treating each incident as a learning opportunity and gleaning all possible insights from the event. This will help you be more successful at resolving future incidents, and possibly prevent them from happening in the first place.
Process is important here, too. Just because you and your team learned something doesn’t mean everyone else has. In fact, often only the people involved in the incident learn from it, and the rest of the information is buried in files, or simply forgotten, and this problem is only exacerbated for distributed teams. To make sure that all your progress is captured, you need to write a comprehensive, narrative postmortem. You’ll need to help with aggregating all the key components, such as graphs, timelines, etc. to form a complete story of what happened. With more data at hand, a clearer narrative begins to form, so that teams have the context to discuss what happened without placing blame.
Great incident management comes from treating each incident as a learning opportunity and gleaning all possible insights from the event.
An important thing to note from all three of these components is that they can all be learned. With experience, you can become a great incident commander as well. You can learn how impactful an incident is through participating in incidents and reading postmortems. You can practice networking, writing swiftly and effectively, and inspiring people while keeping them focused on the task at hand. You can be intentional about language, look at things from different perspectives, and focus on improving processes without blaming people.
If you need a little help getting started, click here to register for our live incident management training with some experts who have been doing this for decades. Blameless’ Staff SRE Amy Tobey and Head of CRE Geoff White will be hosting this virtual workshop on March 26. Space is limited, so be sure to register for your seat today.