Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

5 Reliability Insights That Immediately Transform Your SRE

Emily Arnott
|
6.1.2022

As infrastructure engineers, there’s so much you can learn from studying past incidents. Luckily, Blameless Reliability Insights helps you find patterns that better equip you to deal with incidents to come. If you’ve never used it before and you’re curious what it looks like, you can watch a video demo here! These statistical insights give you the power to learn everything you can when something goes wrong.

Reliability Insights comes with out-of-the-box reports that deliver the most crucial indicators of system health at a glance. These analytics can help you maximize efficiency and even build a positive work culture. (Keep reading to find out why!) As your organization grows, you might also take advantage of building customized data reports. But in this blog, we’re focusing on our 5 out-of-the-box insights that can help transform your SRE today. Let’s get started.

1. Reveal which product areas are causing the most problems with incidents by type

When incidents occur, the problem can often be caused by accumulating bugs in code, tech debt, or other problems in the codebase. The best way to address these issues is proactively – rather than just scrambling to stay on top of each incident that emerges, devoting time to refactor problematic areas of code can prevent many more.But how do you know which areas of the codebase to tackle first?

The Blameless tag categories dashboard can give you the answer at a glance. You can look at which product areas are producing the most incidents, or drill down and look at where the most severe incidents are originating. 

A graph showing incidents by type over time
A graph showing incidents by type over time

2. See if your incident response practices are paying off with MTTX Data

Developing a strong incident response process is key to minimizing downtime and learning from each incident. As you experiment with different policies, implement different tools, and improve your system itself, your goal is to lower the time it takes to detect, respond to, and solve incidents. But just changing things isn’t enough – you have to know if your changes are paying off.

Right out of the box, Blameless reliability insights will track various MTTX metrics for each incident. You can look at incidents before and after policy changes to see what effect they had. Just remember – you can’t compare apples to oranges. More complex and severe incidents can overpower the benefits your new policies bring. Drill down further into the dashboard to find similar incidents across time.

A graph showing the maximum time an incident took to resolve over time.
A graph showing the maximum time an incident took to resolve over time.

3. Leverage On-call stats to prevent burnout by finding the busiest responders

Incident response is tough work. The tension of on-call, the stress of quickly diagnosing, the challenges of digging through past knowledge to solve… it all adds up fast! Making fair and supportive on-call schedules requires you to know what each responder is going through. Tracking people’s workload isn’t as easy as just tracking their hours on-call, though. You need to understand the stressors of each incident and on-call shift.

The pre-built Hours in Incidents dashboard can help. These tiles track the time spent not just on-call, but in actual incidents. Moreover, it shows you which incidents were most time-consuming, giving better insights into which types of incident response work are most troublesome. You won’t be able to get a single number that reflects “how stressed is this person”, but you’ll be able to find warning signs of overwork. These insights can be the foundation for conversations that keep everyone safe from burnout.

Statistics showing the amount of time spent in incidents for different individuals
Statistics showing the amount of time spent in incidents for different individuals.

4. Get ahead of churn by identifying pained customers

Not all of your customers will use your services in the same way. You might only roll out new features to some customers, or offer tiered subscriptions. Even if everyone has the same access, each customer will rely on different aspects of your service to different extents. Because everyone has different usage, everyone is affected by incidents in a different way.

Understanding how incidents affect different user cohorts is key to customer retention. If a particular customer has been hit especially hard by incidents lately, they’ll be more likely to look to a competitor. Get ahead of this problem by looking at what customers have been impacted most by incidents, and target the services most critical to them when proactively improving reliability. Blameless’s reliability insight dashboards can help highlight these at-risk customers right out of the box. 

A graph showing how many incidents of different severities impact each customer.
A graph showing how many incidents of different severities impact each customer.

5. Schedule on-call better by studying timing patterns

Most services will see fluctuations in how they’re used throughout the day, the week, or month.  Perhaps a majority of your userbase log on during PST work hours, or most of your services decline heavily over the weekend. Maybe a lot of users make use of one particular tool at the end of every month. Knowing ahead of time that certain types of incidents are more likely at certain times can help you make better on-call schedules or prepare additional resources.

Looking at incident frequency over time can reveal these patterns at a glance. Once you see when spikes often occur, you can get ahead of them by having more resources at the ready. If certain service areas are particularly vulnerable under certain conditions, you can plan to make the relevant subject matter experts available at those times.

A graph showing the frequency of incidents with different severities
A graph showing the frequency of incidents with different severities.

Blameless Reliability Insights surfaces underlying patterns to give you a clear picture of your organization’s reliability. Unlike service monitoring metrics that reflect product performance (which is also important), these data reports reflect the health of your team and processes. If you’re a manager looking for actionable data on how your team can improve, you’ll want to check out Reliability Insights.

Want to learn more about Reliability Insights in Blameless? Watch the video below for a full walkthrough of the all the Reliability Insights reports you get out-of-the-box. Our technical expert, Matthew Dodge, shows you exactly where to find these reports in the Blameless UI.

Maybe you’re already collecting this data, but you’re doing it manually. You’ll save so much time and energy with Blameless. If you have a unique use case and you’d like to pull on our expertise, feel free to reach out. You can request a meeting with one of our experts here

Resources
Book a blameless demo
To view the calendar in full page view, click here.