When things go wrong, we try to learn for the next time. Every incident should be a learning opportunity to make your system more reliable for the future. Luckily with Blameless Reliability Insights, you can see patterns in incidents at a glance, right out of the box. In fact, the ability to tag incidents makes reliability data even more helpful by allowing you to collect granular details about reliability, especially as they pertain to your unique business needs.
The final piece of the puzzle is in combining your incident retrospectives with your reliability insights. Your retrospective templates can include custom questions, like “where did conversations about the incident happen?” or “what positions did incident responders hold?”. Your reliability insights dashboards can gather the answers to these questions, providing information on trends and statistics.
Combining the per-incident qualitative data of retrospectives with the multiple incident quantitative data of reliability insights allows for a complete picture of your system’s health. We’ll get you started with some types of sample questions. Try these out, then review if they’re capturing the info you need and adjust.
Questions about incident responders
Understanding who works on incidents and what they do helps you plan on-call schedules, develop roles and checklists, and assess which types of incidents create the most work. Here’s some example questions that uncover people’s contributions:
What position/tenure/seniority did incident responders hold?
Which responders were called in after the initial response?
Why was each additional responder called in? Holding authority, specialized knowledge, etc.
How many responders played the following roles:
Executing diagnostic runbooks
Making changes in the codebase
Communicating incident information to stakeholders
Observing the incident progress but not directly contributing
Gathering this data will highlight incidents that require an especially high amount of contributions, especially from people who are senior or otherwise in-demand. These are incidents that are more costly due to the high impact planned work of these engineers being delayed. Finding patterns in these incidents will help you better prepare for them.
Add Custom Questions to Retrospectives
Questions about how the incident was resolved
Each incident will require a unique solution. However, patterns will emerge as you study how incidents were solved. Your incident retrospective captures the full narrative of the solve, and custom questions can highlight common aspects across multiple incidents. You can codify steps that are consistently useful into processes, or even automate them.
How many communication channels (private and public) were used while resolving the issue?
What areas of the codebase were changed during the resolution? During followup actions?
What other incidents were occurring at the same time as this incident?
What incident response policies were relevant to the solution?
It can be difficult to compare apples to apples when assessing how incidents were solved. Looking at how long incidents with different policies or procedures take to solve may seem to indicate how efficient those processes are, but don’t assume too much. Instead, use these trends to highlight and investigate outlier incidents, then see where improvements could be made.
Leverage Custom Questions to Build Unique Reliability Reports
Questions about the incident aftermath
Ultimately, you want to understand incidents in the context of customer happiness – that will reveal their true business cost. Some customer impact can be assessed with metrics like downtime or data lost, but there’s also a qualitative side. Looking for the types of incidents that cause poor customer responses can help you regain trust, get ahead of churn, and prioritize fixes.
What customer complaints did the incident generate?
What PR statements were made?
Was the retrospective published publicly?
Were any SLAs breached?
What prospective clients were considering your product at the time of the incident? Were prospective sales compromised?
Understanding how the incident affected current and potential future customers will allow you to find the incidents with the biggest business impact. It isn’t always the incident with the longest downtime or the most services impacted. If a service critical to certain customers has a brief reduction in speed at a critical time, it could be more upsetting to them than a major outage of an unpopular service. This customer context shows you the best way to triage and prioritize.
It isn’t always easy to turn qualitative data from retrospective questions into patterned quantitative data. You need to understand where to bucket similar answers and where to delineate. Also, remember that there’s a tradeoff between having a lot of questions and having a retrospective template that’s quick to complete. Don’t expect to get everything right off the bat. Experiment with how your answers are parsed until useful patterns emerge.
At each stage of your reliability journey, Blameless can guide you with our best practices and assist you with our suite of SRE tools. Learn how to incorporate these questions into more meaningful retrospectives or use these insights to build better automatic communication. Your reliability needs are unique and will change as your organization grows. To see how our experts can use Blameless to take your reliability to the next step, sign up for a demo here!