Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

7 Ways Tagging Incidents Can Teach You About System Health

Emily Arnott
|
7.12.2022

One of the most powerful ways to prepare for future incidents is to study and learn from patterns in past incidents. Blameless Reliability Insights highlights these patterns for you, with out-of-the-box dashboards that automatically collect and present all types of statistical information about your incidents.

To supplement the data automatically collected from the incident, you have the ability to freely tag incidents with any useful information for now or in the future – for example, tagging which products are impacted by an incident. For any tag, your dashboard can present reports on:

  • How frequently that tag occurs
  • How severe incidents with that tag are
  • How often that tag occurs alongside another tag
  • And much more!

List of Tags Shown in the Retrospective Details

But what sort of tags should you use? Of course, the answer depends on your organization – every team will have a set of tags that best suit them. As your system grows and changes, the parts of incidents that are most relevant will change too. We’ll get you started with seven categories of tags that are often helpful.

{{download-module}}

1. Service area affected

What it is: when an incident occurs, you can tag which parts of your service were affected. You can tag any part of the service that suffered an outage, a slowdown, a loss of data, or any other disruption to your services.

Example tags: area-checkout, area-login, area-comments, etc.

Why it’s useful: if a particular service area keeps experiencing incidents, it suggests that faulty code within that service area could be the cause. This information can help you prioritize bug fixing. You can also understand your customers’ experience better: if one feature is consistently up, and another frequently down, you can better craft SLIs to reflect what your users expect from your service.

2. Contributing factors

What it is: while investigating the incident and building a retrospective, an important step is completing a contributing factor analysis. This will expose the systemic factors that allowed the incident to happen, and guide you to making changes to prevent it from happening again. You can tag your incidents with these contributing factors to see patterns in what frequently causes issues.

Example tags: factor-database, factor-serveroutage, factor-configurationerror, etc.

Why it’s useful: seeing what sort of factors frequently cause incidents can help you prioritize making things more resilient. If many high severity incidents are caused by server outages, it’s a clear sign that you need to investigate upgrading to more reliable servers.

3. Customer types affected

What it is: your service is used differently by different types of customers. You might have a tiered subscription model, where only some customers have access to every feature. Even if all customers have access to everything, your usage data could show that only some types of customers will use specific features.

This difference in usage could be based on the size of the customers’ org (for B2B services), their geographic location, the tools they integrate your service with, and more. All of this is useful to capture in tags. When something is disrupted, think about all the groups of customers that were impacted.

Example tags: customer-fullaccess, customer-european, customer-adminaccounts, customer-slackusers, customer-enterprise, etc.

Why it helps: understanding customers’ pains is the most important thing to stay ahead of churn. If one type of customer has been getting hit by a lot of high severity incidents recently, you can be sure they’re considering other options.

Customer success teams can use information from these tags to reach out to specific impacted customers and try to alleviate their concerns and regain trust. Product and development teams can also use this information to try to focus on improving the experiences of these impacted customers to try to offset the damage of these incidents.

4. Alert source

What it is: you can track where an incident was first detected with this set of tags. This information is likely known by the on-call engineers that first responded to the incident. These people might not be the same people who ultimately tag the incident in the system. Use incident retrospectives to make sure this information isn’t lost and is consistent everywhere.

Example tags: source-userreport, source-monitoring, source-teamreport, etc.

Why it’s useful: seeing patterns in the alert source can help you understand what you aren’t seeing. Ideally, incidents should be discovered and responded to before users have a chance to report them – the best case scenario is that the problem is fixed before it impacts users. If users consistently report some types of incidents before your monitoring tools catch them, you can invest more in monitoring the data patterns that these incidents cause and catch things faster.

5. Resources used in response

What it is: during incident response, your response teams will likely use a variety of diagnostic tools, runbooks, retrospectives from other incidents, and other resources. This category of tags can track how frequently and for which types of incidents these resources are used. Incident retrospectives can help you keep track of everything that was used.

Example tags: resource-runbook14, resource-logindiagnostic, resource-retrospective152, etc.

Why it’s useful: knowing which resources are used frequently can be helpful to know what to prioritize when making and improving resources. If there’s a runbook that’s used all the time, you’ll want to take the time to make sure it’s as clear and efficient as possible, and possibly spend the resources to automate it.

On the other hand, these tags can help you recognize where resources are lacking or need to be updated. If you notice that a runbook is being used for every service area except for one, maybe you need to build a runbook to cover that area, or update an existing one so it’s useful.

6. Teams involved in response

What it is: after the incident is resolved, you can tag each team that had someone contribute to the solution. These could include development teams, operations teams, infrastructure teams, DevOps or SRE teams. You can also tag incidents that require the involvement of legal teams, PR teams, customer success teams, executive teams, or any other auxiliary teams that handle some of the ramifications of the incident.

Example tags: teams-dev1, teams-ops3, teams-PR, teams-legal, teams-exec, etc.

Why it’s useful: By seeing which teams are taking on the most incidents, you can get ahead of burnout and alleviate their burden with balanced on-call scheduling. This can be tracked alongside the stats for individual contributions, which Blameless tracks out of the box.

Incidents that require legal teams or PR teams to get involved usually require special attention for their follow-up tasks, such as releasing subsequent statements or updating agreements with stakeholders. Being able to easily find these incidents with tags and check their status ensures that the proper processes are followed. Seeing patterns in these incidents can also help you create better processes for handling them.

7. Canary groups affected

What it is: if your organization does canary releases, where new features are rolled out incrementally to different groups of users instead of all at once, you can use tags to track which canary groups are affected by each incident.

Example tags: canary-group1, canary-group2, etc.

Why it’s useful: one of the biggest benefits to canarying is that you can roll back unstable releases without affecting all of your users. Seeing a high frequency of incidents for a particular canary group could be a sign that a rollback is needed. On the other hand, if a canary group is experiencing very few and minor incidents, it can help you know when those features can be expanded to a larger group.

As you start to tag your incidents, you’ll be able to create tiles and dashboards that will instantly highlight patterns. Continue to think of how you can make tags that capture the unique challenges of your system. If you’d like to consult one of our experts on what our reliability insights can teach you, request a meeting here.

Resources
Book a blameless demo
To view the calendar in full page view, click here.