SRE is a constantly evolving field, responding to the challenges of increasing reliance on tech and the opportunities of its evolving abilities. Reliability has to remain a step ahead of the cutting edge, whether it’s navigating remote work, implementing AI assistance, or optimizing internal processes. But how do we know that SRE is keeping up?
We’re proud and excited to announce the results of the SRE Survey we ran in partnership with Catchpoint. We reached out to SRE practitioners, engineers, and others involved in SRE duties to understand how the practice evolved in the last year, and how it will continue to change in 2023.
Read on to learn five key takeaways and check out the full report here.
1. Competitive hiring and retention remain major challenges in SRE
Talent, including hiring, retention, and assimilation, was the issue most commonly described as “the number one challenge most hindering successful reliability implementation”. Notably, the second most common biggest challenge was “complexity of architecture”, which can also be thought of as emerging from talent issues – if expertise was properly integrated, then the system’s complexity could be properly managed.
This trend was also present in previous years, due to the high demand and low supply of the SRE skill set, but has been exacerbated by the Great Resignation. Around 35-50% of responders reported that areas such as hiring, knowledge retention, productivity, and morale were moderately or severely impacted by the Great Resignation.
These widespread labor events are difficult to definitively assess, but people’s subjective experiences with the fallout are still relevant to how the industry will develop. That is to say, what matters most is that managers believe that they will face issues in these talent management areas, and will hire and train accordingly. As a result, we can expect to continue to see competitive hiring and upleveling of engineers into the SRE skill set.
2. AIOps adoption rates are still slow, but show potential
The 2023 SRE report continues from the 2021 report in investigating the role of AIOps solutions in responders’ organizations. Between the two years, the numbers haven’t changed much: in 2021, 27% of responders reported low (1-3) and 41% reported moderate (4-6) received AIOps value; in 2022, 12.9% reported “none” and 32.6% reported “low or extremely low” value from AIOps.
Despite these numbers, there’s still reason to believe that more widespread AIOps adoption lies on the horizon. All of the pieces are in place to identify a problem that AIOps is positioned to solve. Responders report dealing with more sources of data contributing to more complex observability. AI observability and its ability to draw and highlight more nuanced conclusions by connecting multiple sources of data could allow more actionability and automation from deeper observations.
At the same time, most responders reported that tool sprawl – problems arising from having too many tools – was a minor or nonexistent problem. Therefore, we should expect that people hesitating to pick up an AIOps tool isn’t because it will be one tool too many. People may be skeptical of entrusting more nuanced tasks to AI. AIOps practitioners could overcome this by promoting specific problems that AI excels at to demystify and normalize AI usage.
3. Toil continues to drop from 2021, but at a slowing rate
Like the 2021 and 2020 surveys, the 2022 survey measured what percentage of time responders reported spending on toil, where toil is “manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly as a service grows”. That is to say: bad stuff. Although toil continues to decrease, this year was the smallest decrease yet. This deceleration of toil reduction may seem bad, but it may actually be a good sign.
When dealing with toil through automation and process optimization, you want to prioritize what will have the biggest impact: the tasks that are most common, most time consuming, and most automatable. Seeing diminishing returns means that people are prioritizing correctly, first taking out these “low hanging fruits” of toil.
Another interesting phenomenon in toil rates results from improved tooling. As advanced tools allow for more automation and quicker processes, things that weren’t originally considered toilsome, but unavoidably manual, could be done automatically. We shouldn’t have a rigid definition of toil, but instead push to automate things we previously couldn’t imagine.
4. Interruptions to an SRE’s workflow reduce productivity and take up tons of time
A new question on this survey looked at the rate of distraction and interruption in people’s workflow. The results are powerful, but maybe unsurprising: the majority of responders spent 20% or more of their time not on-call responding to interruptions.
Some interruptions are necessary, no one can hope to spend 100% of their time locked into a single task. However, 20% adds up to many many hours over the course of a single week, long enough to tackle entirely different tasks. People are often more unproductive after an interruption, taking a while to refocus on their task. Therefore, interruptions that can wait probably should wait, until the task at hand is completed.
Teams should be tracking the number of interruptions continuously, while considering which interruptions were time sensitive, the sources of them, and alternative solutions to them besides interrupting someone’s work. Policy and process changes can help people resolve issues without needing to escalate. Interruptions could be considered and addressed as “the new toil”.
5. Blamelessness is key to meaningful incident learning
The survey also looked at some correlations between cultural practices and organizational success. The results were very encouraging: there was a major correlation between DORA performance (as measured in the DORA report) and blameless culture. Moreover, the survey results show that an organization having a good blameless culture is not correlated with organization size. Therefore, it isn’t just a situation where big companies are higher performing and can also “afford” to be blameless – organizations of any size can achieve blamelessness, and it always helps.
A specific practice analyzed in this context was post-incident reviews, including retrospectives or postmortems. As an organization’s level of blamelessness increased, the value they achieved from reviewing incidents increased as well. This makes sense, as a blameful incident review is often just pointing a finger at an individual “responsible” for the incident, with no further investigation and change.
An interesting finding is that responders in different roles had very different responses for how blameless their organizations are. Individual practitioners reported higher levels of blamelessness than executives, with management in the middle. As they may ultimately have to hold people accountable for long-term strategic decisions, such as layoffs, people further up on the org chart may not see themselves as blameless. However, we should work to teach culture so that people understand that blamelessness doesn’t mean never holding people accountable, but just that accountability isn’t sufficient – it must always accompany systemic change.
Let’s look forward to another great year of SRE!
As we move towards 2023, we can’t be too certain of what challenges and opportunities tech will face as a whole. What we can be confident about is that an SRE mindset will continue to encourage teams in cleaning up toil to give people the chance to try new tech, developing response processes ready for new crises, and building a cultural foundation that inspires people to innovate.
Be sure to read the full report to see the details of all these findings. What did you find surprising? Let us know in our community Slack channel!