Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

The Catchpoint 2024 SRE Report - Five Key Takeaways

Emily Arnott
|
1.16.2024
|
SRE Fundamentals

Only emerging into the mainstream in the 2010s, SRE is a relatively new discipline in tech. It’s been rapidly adopted by a widening variety of organizations, implementing constantly evolving practices. For the last six years, Catchpoint has been running a survey to take the temperature of the latest developments and trends. Check out the full report here, and read on to see our analysis on five key takeaways.

1. There remains plenty of room for improvement in learning from incidents

A major focus of this year’s report is learning from incidents. Despite being a core part of the SRE philosophy, 47% of responders said that learning from incidents had the most room for improvement of any category of incident management activities. Moreover, this percentage holds when looking at organizations of any size, so not even bigger companies have been successful in embracing learning from incidents sufficiently.

Diving further into this figure, we find that 28% of respondents said that taking the time to learn from incidents was the hardest part of incident management – higher than the percentage that said fixing the problem itself was hardest (27%)! Tellingly, when the answers were delineated between “major incidents” and “non-major incidents”, responders more commonly felt that they gave major incidents the appropriate learning time, but neglected minor incidents. We agree with the report’s authors that this represents a missed opportunity. Learning from minor incidents allows you to be better prepared for the major ones.

The crux of these results is in distinguishing between “learning” and “fixing”. Many organizations are equipped to address the issues of each individual incident; they can fix the immediate causes of those incidents to prevent recurrence. But they lack the opportunity to truly learn from an incident, improving processes on a meta-level and making more fundamental changes. We encourage organizations to set up the time and resources to make this deeper learning happen.

2. Coordination between parties is a major incident challenge

When asked “which parts of recent incidents were the most difficult?”, the second most commonly cited part was “Escalating to, or coordinating between, responsible parties”. This may at first be surprising, as it ranked even higher than actually fixing or detecting the problem, but it aligns to many things we’ve seen in the industry.

It’s easy to have some sort of escalation/coordination game plan for incident management, but it’s hard to have an effective one. An ineffective one leads to many many issues: redundant work, solution attempts that interfere with each other, people being brought in without being brought up to speed and left unable to contribute, and many more. Poor coordination can be one of the most crippling problems for incident management.

We recommend that organizations don’t oversimplify this step. Invest the time in a robust and automated system to get people communicating, coordinated, and clear on what the next escalation steps are. Have roles and associated checklists for each responder to keep people working on unique and helpful tasks.

3. Explore AI as a helper, not as a replacement

As 2023’s #1 major hot topic in tech, the report would be remiss to not take the temperature on AI. They found respondents were optimistic about AI’s role in their future: only 4% believed AI would replace them, whereas 53% believed AI would make their work easier. Specifically regarding incident management, 27% expected it would be “moderately useful” and 38% expected it would be very or extremely useful.

This aligns with our hopes and recommendations for incorporating AI into your incident management. AI can automate a lot of the more ad-hoc, toilsome aspects of responding to an incident, such as summarizing the state of the incident for new responders, writing quick testing scripts, and parsing long log files. This efficiency keeps engineers focused on the more nuanced diagnosis and analysis, which AI is still a long way away from mastering.

4. Budgeting concerns are stifling to reliability improvement

When asking respondents why they aren’t successfully implementing reliability practices, the most commonly cited reason (at 44%) was cost or budget. This is perhaps unsurprising, given the economic downturn in the industry last year. However, we’d encourage organizations to think more holistically about these costs and opportunities.

Organizations may be thinking in terms of being unable to hire a full incident management team, but there are many cheaper ways to get successful with SRE. An easy way to get a major force multiplier of your efforts is investing in an incident management tool. This will allow you to automate and easily adopt many helpful practices, reducing the time spent on incidents and the frequency of repeat incidents. Although this is still a cost, it’s relatively minor compared to having engineers spend time implementing these.

The costs of investing in reliability are also extremely minor compared to the amount you’ll gain. Having better reliability means less customer churn, less engineer burnout, less downtime meaning less lost revenue, more time spent on exciting new features, and more. Rather than thinking that they can’t afford to implement reliability practices, we encourage thinking that they can’t afford not to invest. Tellingly, the second most cited reason was “alignment or prioritization” – we encourage this investment reliability when considering those priorities.

5. Toil is trending down!

One of the most encouraging aspects of this year’s report is that responders are reporting that less of their work is toilsome – from a median response of 20% of work being toil down to 14%, using Google’s definition of toil. The report notes that this reduction is likely not due to generative AI, as the survey was run only 8 months after the release of ChatGPT. Regardless of whether or not AI was a factor, it would only be one of several factors influencing this reduction of toil.

This result is validating for the SRE project as a whole. Reducing toil is a fundamental goal of implementing SRE, right up there with, well, making your site more reliable. Embarking on the journey of SRE is an investment, and it has a cost. It may require hiring and/or retraining engineers. It involves new processes that require building new muscle memory. You adopt new tools to automate new things. It seems like a lot, but results like this should motivate engineers to take the plunge. Any steps you take on this reliability journey will produce positive results.

Stay ahead of the curve with Blameless

As SRE continues to expand and evolve, invest in a tool that evolves alongside it. Blameless features the most sophisticated and cutting edge reliability features, such as AI enhanced incident communication, service level objective tracking, and a suite of bidirectional integrations. Blameless gets you up to speed quickly, making robust practices straightforward to adopt with customizable and automated workflows.

Check out a Blameless demo today!

Resources
Book a blameless demo
To view the calendar in full page view, click here.