Wouldn’t it be nice to learn which parts of your service see the most incidents, or why one service experiences more Sev1 incidents than the others? It’s not always easy to see the full disruptive impact of an engineering incident. Even harder to see trends across incidents and over time. Developing incident insights that you can use to help guide and shape the way your team designs and operates your product takes time, careful consideration, team engagement and the right tooling. If you take the time to step back and look at your entire DevOps team playbook, tooling and data analysis on what exactly is going on, you’ll soon be reaching whole new levels of operational excellence.
Shifting out of the reactive, defensive posture in which your team is crouched is a lift and requires time and resources. In order to turn incidents into a learning mechanism you need to start small. Having a dedicated incident management solution to adapt to your distinct needs is a great start. If you want to surface hidden trends across all your incidents, you need to start tagging. It’s the only way to correctly collect pertinent data and information in order to really understand what’s happening.
Focus on resolution, automate data collection
In the 6 years since the original SRE book was published, the tools available to support incident response have taken leaps forward in supporting incident learning. Metadata tagging, automatically compiled incident timelines, and incident performance analytics all fundamentally take the onus off incident responders to keep documentation top of mind while they’re heads-down focused to find and fix whatever broke.
Automatic compiling allows you to gather actionable data about each incident without slowing the incident response process. Having the “what happened” in a detailed report post incident is extremely valuable for everyone that handled the incident plus it’s invaluable learning to be shared outside engineering and into other business functions. Additionally knowing which explicit follow up actions need to take place that were uncovered during incident response helps everyone stay on track and truly close out the issue so its recurrence probability goes down.
“As a team, we do not want to repeat the same efforts…We hire fantastic engineers who are diligent and want to help the system improve. As a leader, I also want to encourage this type of collaboration” Machinify Sr. VP Operations, David Levinger
The Blameless Incident Learning Workflow
So how do we make this happen at Blameless? The Blameless incident learning workflow starts with tagging and ends with retrospectives and reliability insights. It feels cliche to say “everyone's business is a little different”. Of course, cliches exist because they identify something true. Every engineering team and every product is a little different. Tagging has been at the center of how we do incident data collection and analysis from the beginning. If you’re interested in the long history of how we’ve leveraged this feature you can check out our original tagging blog here. If you’re interested in the updates we’re making to the tagging system, reach out to your customer success manager for more information.
The Tag Management Interface
We’ve taken the approach of creating an easily configurable system of tagging inside Blameless. Customizing tags to cover the specific metrics that matter to your business facilitates the automatic collection and simple downstream reporting of specific data points. This is especially a benefit to team leaders who want to look across multiple incidents to identify trends that wouldn’t have been immediately obvious to the team tasked with the retrospective. For example, if you tag each incident with the feature that was impacted, tracking incidents over time can reveal the features that need additional attention. Now that’s valuable to know as it’s directly related to customer impact.
Tags are easily configured at the outset by your infra or DevOps teams when you onboard with Blameless. These are easily created by the admin team member with access to Blameless. Tags can be used to identify things like contributing factors, impacted services, customer outcomes, stakeholders, you name it. This allows incident responders to get very granular with their data collection, identifying items like the issue type such as a network interruption creating a login issue, or the specific region that is affected by an outage in a highly distributed global cloud environment, impacting customers in the Southwestern United States. That same interface also makes it easy to update those tags as the engineering team grows and evolves or as new microservices are added to the platform. Alternatively if your lists of predefined tags change often you can programmatically update them using our tagging APIs. These tags become a simple, natural component of the items that appear on the retrospective timeline and with the Blameless incident analytics tool Reliability Insights
Creating a healthy standard around data collection
Before you ever generate an actionable insight, it’s important to create standards around data collection. Garbage in garbage out as they say. One major benefit of the new Tagging system is in creating standardization around the data that gets collected while responders are addressing an incident. It can be confusing for responders or even commanders, trying to select which tag to use from a menu of similar choices. The implications of capturing the non-standard or incorrect information can be really problematic for downstream data analysis. Fortunately Blameless has significantly simplified this. The Blameless admin creates and maintains a simple list of predefined and agreed upon standard tags associated with an incident type. So when responders attach a tag to an incident, it’s crystal clear which tags to choose, everybody speaks the same language. The captured information is consistent producing high quality reports for your leadership
How tagging helps transform your operations
The benefit of these tags for reporting is best understood through the lens of the mashup reporting possible in Blameless Reliability Insights (Watch the video to learn more). Run queries to slice and dice your goldmine of data across incidents, teams, learnings, and much more. Get a big picture overview of your site reliability, or dive deeper into specific services. To enhance your contextual understanding, you can pull reliability data from your app and infrastructure monitoring tools including Prometheus, New Relic, Datadog and combine that with the tagged metadata in Blameless. In this way you can start to identify side-by-side the services that are having the biggest impact on your MTTx metrics and your team resilience. This incident data combined with automatic respondent tracking lets you manage the load on your on-call teams.
You keep the lights on and we’ll do the rest
Your team’s first responsibility will always be to resolve the incident in front of them and restore service quickly and Blameless streamlines that entire flow. But importantly, Blameless will also help you take the next step of leveraging incident data to bring calm to the chaos of incidents.
Discover how Blameless can help you leverage data and insights to improve your incident response. Sign up for a free trial today.
Based on the applicable laws of your country, you may have the right to request access to the personal information we collect from you, change that information, or delete it. To request to review, update, or delete your personal information, please fill out and submit a data subject access request to email@example.com.