Blameless’s comprehensive incident management platform is built to ease the burden of keeping your services up and running. Whether you are in the middle of an incident or trying to better track your response performance, you need access to your incident data on demand. Blameless’s Reliability Insights unifies your Incident, Resource, Task, and IAM data in a single customizable and queryable analytics tool.
Blameless customers rely on Reliability Insights to deliver the answers that will help them resolve incidents and prevent future ones. Data needs vary by company, team, situation and goal. This said, our industry also follows some widely agreed upon key performance indicators such as incident volume by severity and max time to resolution.
While Blameless offers industry leading robustness of data and query flexibility, for this blog we’ll focus on Blameless’s default dashboards which give you access to both of these metrics alongside almost 20 others out of the box.
Navigating to the Reliability Insights tab within our relevant instance of Blameless, on our left hand side we’ll find our shortcut panel to access each of our dashboards. Within our default Blameless boards we have many of the common KPIs we might look for broken down into various categories so that we aren’t forced to create them ourselves. Let’s think of a scenario where a DevOps lead is communicating with their executive stakeholder, the company’s VP of Infrastructure.
First we’ll look in our Incidents dashboard to get a sense of how many incidents our teams are having to address and what level of severity they are. This will inform us of how often our measured systems are facing challenges and whether these challenges are small fixes or Sev 0s which might cause us to reconsider parts of our architecture.
In this particular view we’re sharing with our VP of Infrastructure incident behavior over the past 3 months because that’s been the window since we launched our new SRE program. What quickly drew their eye was a major spike of incidents in mid December. For example, this enhanced detail view can help us articulate to them the difficulties we faced during a major migration of services, and while we dealt with some Sev0s and 1s that our preparation led to 2 out of 3 incidents being relatively benign. They acknowledge the value that our newest SRE program must have had in mitigating risk, and further damage during the migration.
They are still however curious about the amount of time it takes our team to resolve incidents. We discuss our averages, but still need them to be aware of extreme outliers. We pull up our max time to resolve metric tile in our Incident Timings dashboard to show them an incident from around that same period that took an unparalleled amount of time to resolve. Here we explain that it was not customer impacting but that the issue forced unnecessary complexity into our deployments while it was ongoing. We then discuss our root cause and other incident details we documented in our Blameless retrospective from the incident which has prevented similar challenges afterwards from being nearly as lasting.
As we can see here with our example around executive stakeholder communication, default dashboards make it easier to get started with more data driven decision making in the SRE transformation process. If you have any more questions about using Reliability Insights, please see our documentation or simply reach out to a member of the Blameless team. Want to see it in action? Start your free trial today!