Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

SRE Toil | What It Is & Top Tips To Reduce It

Myra Nizami

Being affected by SRE toil? We define what SRE toil is, discuss how it can adversely affect your productivity, and tell you the best techniques to reduce it.

What is SRE toil?

SRE toil is a task that meets one or more of the following characteristics:

  • Manual
  • Repetitive
  • Automatable
  • Reactive
  • Lacks enduring value

One of the goals that teams have is not just working for the sake of work, but doing work that holds value – and getting rid of what it doesn’t. That’s where SRE toil comes in, as it helps teams identify what tasks are needed and reduce anything that fits in the above characteristics. 

Some examples of toil tasks can include semi-manual deployments, network changes, as well as incidents requiring human actions such as restarts, diagnostics, performance checks, and other manual work.

What are the benefits of reducing toil?

Within SRE and more broadly, the main benefit of reducing toil is that it cuts out the noise and helps teams focus on what’s really important. Additionally, reducing toil also helps teams feel better about their work and what they’re doing. When workloads are filled with tasks that fit the characteristics of toil described above, it means a lot more discontentment and higher chances of burnout. Reducing toil can help eliminate errors and time-consuming tasks, freeing up teams for better, more valuable work. 

And ultimately, reducing toil is about ensuring that the work your teams do has meaning. Toil shouldn’t be increased as a way to keep people busy with repetitive tasks. Instead, businesses need to focus on reducing functions that fit within toil characteristics and work towards creating opportunities and tasks that teams find valuable and help them develop. 

For businesses, eliminating toil has an immense benefit not just from a work standpoint but also from a retention standpoint. Team members are more likely to stick around and stay in roles where they can meaningfully contribute rather than just doing tedious, manual, and repetitive work. 

How can organizations reduce toil?

The next question when it comes to DevOps toil and SRE toil is how organizations can start to work towards reducing toil. The first step, of course, comes with culture. As a team, you’ll have to take the time to think about what current workloads are like and whether team culture allows for this kind of conversation in the first place. Are your teams and managers in agreement that toil should be reduced? Or do people still prioritize “keeping people busy”? Emphasize the benefits of reducing toil to counteract these concerns.

Once you’re on the same page about eliminating toil, you need to identify where it exists.  Finding tasks that fit within the toil criteria has to be the starting point because that gives teams something to track against as toilsome tasks are reduced. 

Once teams have identified the types of tasks that count as toil, it’s also crucial to look at how much time these tasks take to solve. For example, how many resources are being spent on repetitive and manual tasks, and how much time does it take to get through these kinds of functions? 

Identifying these measures can be a bit of a painful process, but it’s necessary to establish a baseline. Seeing how many team members are dedicated to these tasks and the amount of time it takes for them to get through these types of tasks will help in eliminating toil.

Once baselines are established, it’s time to get to work on reducing toil. Again, SRE teams need to work closely together to gain continuous feedback on how these measures fit with service level objectives. Some methods of reducing toil can make processes slower or more inconsistent. SLOs can weigh this cost against the benefits of reducing the toil. SRE teams can also be instrumental in creating system and design processes to proactively reduce toil. 

Some of the measures that can be instituted to reduce toil include:

  • Standardizing to make managing platforms easier and simplifying applications and tools used – this allows engineers to avoid wasting time figuring out the specific steps for each process
  • Repeating and reusing fixes for commonly occurring tasks, including using automation where possible. This can also involve investing in creating runbooks as guides for common tasks.
  • Creating a proactive monitoring process will allow you to see problems further ahead on the horizon. This will reduce toilsome time spent on coming up with last minute responses.
  • Improving poor code to reduce failures and incidents. Investing in cleaning up tech debt and spaghetti code will save you lots of toil in the long run, as bugs will crop up less frequently.

The role of automation

Automation plays a significant role in reducing toil, although some upfront engineering time is required to ensure automation is successful in the long run. Teams need to consider the tasks they’ve identified as part of toil and where automation fits into that. 

Teams can set up external and/or internal automation to reduce manual work without sacrificing performance or reliability. For example, repetitive tasks that follow an incident, such as restarts or diagnostics, can be distilled into automated runbooks that immediately take care of these tasks without manual intervention. Teams can work together to decide and set up automation for tasks where human judgment isn’t needed and repetitive tasks that take too much time to solve manually. 

As you continue to reduce toil, measurement is critical. Take stock of your baseline and look at changes after you implement changes. Are teams doing more valuable work than before, or is it the same? Is there a change in morale and culture since the effort to reduce toil? There will also be growing pains as automation is added into the mix, as teams need to put in the groundwork to get that going.

While things won’t necessarily change overnight, you will start to see more efficiency moving forward and a boost in morale – which ultimately helps with achieving business objectives.

How can Blameless help?

Blameless helps teams automate toil and carry out best practices across incident management, comprehensive retrospectives, service level objectives, reliability insights, and more. With Blameless, teams can real-time incident data, and any follow-up actions are captured and synced data into system records for accuracy and ease. Plus, Blameless Runbook Documentation helps teams capture essential knowledge and standardize incident response.

To learn more about how Blameless helps teams automate toil, schedule a demo today, or sign up for our newsletter for more insights.

Book a blameless demo
To view the calendar in full page view, click here.