Even the best software fails sometimes. How quickly those failures get addressed, and how your teammates and customers feel about you after the fact, comes down to how well you communicate with them. Users, customer success managers, Ops team members, IT, security, engineering leadership, even the executive team. Each has a vested interest in resolving engineering incidents quickly. All need to be updated with the right information at the right time.
When a system goes down, a poor framework for communication can create chaos across an organization, and be a distraction for teams working to resolve the problem.
Throughout this guide, we’ll delve deeper into the importance of incident communication, how it works, steps to take, best practices, and more, to help your team master incident communication.
The Importance of Incident Communication
When working to efficiently resolve incidents, communication must be purposeful, reliable, and consistent. An incident response communication plan ensures everyone involved in incident response:
Is aware of the issue
Knows the current status of the issue
Understands the steps to take for a resolution
Stays updated on actions taken and planned
A quick response time limits damage and potential costs resulting in loss of data, income, productivity, and more.
Preparing for Incident Communication
Before an incident begins, there are things your team must prepare if they hope to respond effectively. Trying to establish an effective process for incident communication, while responding to a high severity incident, is like trying to repair a plane while in flight. Implementing incident communication best practices requires a little research into your team’s needs before something breaks.
Define What Constitutes an Incident for Your Team
Incident management looks different depending on the system or product in use. An incident is any situation disrupting your system or software. You can define ‘what is an incident’ based on the risk to your team, users, investors, and system.
Determine How You Categorize Incidents
The impact of an incident is measured by its severity level. A low number means a high impact. It’s useful to make a chart of each level and the type of incident it clarifies. For example:
Determine Your Team’s Roles and Responsibilities for Incident Response
Each department or team member should have a unique role in incident response. Whether it’s monitoring, reporting, or resolving. The best way to ensure your team is prepared and organized is to:
Preparation is key to successfully navigating any incident type.
Incident Communication Steps
Once an incident is categorized and the appropriate team or team members understand their roles, communication steps are implemented. These should be outlined in your incident response communication plan. Some steps to include in your strategy include:
1. Categorize the Incident
There are two main ways to categorize the incident:
Categorize by incident type
Categorize by incident severity
Refer to your impact incident severity level chart (or add to it if needed) to categorize the current issue.
2. Select Appropriate Incident Communication Channels
Knowing the type and severity of the incident helps you identify the proper channel for communication. You’ll know whether it goes to the DevOps team, IT, operations, upper management, or even customers. Make the most of this process by:
Selecting appropriate communication tools (Slack, etc. for different scenarios)
Utilizing multiple channels for redundancy and wider reach
3. Identify Predefined Communication Templates
An incident response communication template reduces the time better spent resolving the incident. These templates are customized by:
Blameless offers several types of templates for DevOps or SRE teams, including incident response and communication.
4. Alert the Incident Response Team
An incident response team is a group pre-determined to react to incidents at certain levels of severity. Your organization may have response teams that manage all incidents or one team for minor and another for moderate to severe incidents.
Alerting the team includes:
Implementing pre-planned alerting mechanisms
Ensuring timely response to incident notifications
Communication is a key factor in the organization and implementation of the response teams.
Incident Communication Best Practices
Incident communication involves many steps and protocols. To best support your organization and response teams, it helps to have a list of incident communications best practices. Some common best practices to include are:
Clearly defining communication channels and escalation paths
Using clear and concise language
Being empathetic during communication
Tailoring language to suit the audience (no jargon or department-specific acronyms)
Maintaining transparency so all team members know what is expected of them
Incident communication is not only internal. It often involves alerting users to potential downtime or disruptions. Clear communication ensures trust and credibility are maintained, despite the issue.
What to Look for in Incident Communication Platforms
Incident communication platforms are a type of software or website designed to guide and automate incident communication. When seeking the right platform for your organization or system, there are some key factors to consider.
Here are some must-have features and functionalities for efficient communication:
The faster your response team is alerted to an incident, the faster the incident is resolved. Quick turnaround times build trust in your brand and enhance authority and credibility in your industry.
Centralized communication hub
A communication system with a centralized hub helps consolidate all incident-related communication. Response teams, management, and anyone else involved in the response process can see updates, add comments, and communicate in real-time.
Templates go a long way to improve consistency and speed up the communication and response process. Templates include basic information necessary to inform on a variety of incidents and are further customizable for specificity.
Communication channel capability
Choose communication channels based on speed, privacy level, and system compatibility. Some examples of communication channels include email, SMS, voice, and Slack.
Collaboration tools facilitate swift decision-making and resolution. There are many collaboration tools to choose from, including New Relic, Salute Safety, Swimlane, and Freshservice.
Integration Analytics and Reporting
Communication platforms with enhanced analytics and reporting seamlessly fuse data solutions into applications and workflow. This brings data sources together for inclusive and far-reaching metric reports.
Blameless Platform’s Incident Communication Capabilities
Throughout this guide, we explored incident communication. Incident communication is important to share issue awareness, incident status, understanding resolution steps, and planned actions. It keeps all relevant departments and individuals informed of the incident.
Some of the topics covered in this guide include:
Categorizing incidents by severity and type
Incident communication templates for streamlined communication
What to look for in an incident communication platform
Incident communication best practices
At Blameless, we offer reliability management solutions for engineering, DevOps and ITOperations teams, such as incident management, incident response, CommsFlow, incident retrospectives, reliability insights, and SLO manager. One of the areas Blameless specializes in is the control of incident communication.
With Blameless you can inform stakeholders, create visibility, create and locate incident communication templates, and leverage swimlanes for faster, more efficient incident communication.
"I have less anxiety being on-call now. It’s great knowing comms, tasks, etc. are pre-configured in Blameless. Just the fact that I know there’s an automated process, roles are clear, I just need to follow the instructions and I’m covered. That’s very helpful."
"I love the Blameless product name. When you have an incident, "Blameless" serves as a great reminder to not blame anything or anyone (not even yourself) and just focus on the incident resolving itself."