Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

What is Runbook Automation? Best Practices

Myra Nizami
|
4.5.2023

Looking into runbook automation? We explain how runbook automation works, with examples and tips on how to use it to streamline your incident response process.

What is runbook automation?

A runbook is a guide for handling common tasks within a specific process. Adding automation to runbooks allows the steps and checks of the runbook to execute automatically, leading to faster and more consistent results..

Runbook automation eliminates toil further by having these steps run through software triggered by certain situations (such as exceeding a threshold in your error budget policy), minimizing the amount of input you need to provide. This requires tools to execute each step, as well as a tool to orchestrate the overall runbook and determine which steps are necessary.

Runbook automation is used for incident management, service reports, emergency protocols, and other key business processes. 

How does runbook automation work? 

Runbook automation is a way to automate workflows and reduce manual commands. It’s a way to implement operations procedures with very little intervention. You can find additional resources for runbook automation best practices here to design your own. 

Pre-Assigned Processes

Think of runbook automation as a tool that enables teams to automatically run the correct runbook based on the task. IT systems can be configured to attach relevant runbooks based on the problem at hand. That reduces a lot of the manual work related to incident response since there is no need to look for guidance, optimizing efficiency. Taking it one step further, teams can also automate runbook tasks to become self-service operations. That way, teams only need to focus on more complex issues and let incident response be as automated as possible. 

Examples of Runbook Automation

Teams can set up runbook automation for simple tasks like data health checks at the end of every day or service checks throughout the day. Organizations can also use runbook automation for complex processes like incident management by setting up parameters like pre-event triggers. These allow runbooks to change the steps they take based on information they get from the incident. 

Why is runbook automation important?

Without runbook automation, teams will likely have a scattered process for incident management and service requests. Solving any kind of issue likely requires a deep dive into old process documents that don’t have any sort of standardization or where there is an overreliance on ad-hoc tools that don’t really help. Teams often end up escalating the issue unnecessarily as information isn’t widely available.

The result? Bottlenecks, slow response times, unhappy customers, and more disruptions than there should be.

Faster Response Times

Runbook automation enables teams to work better and smarter when incidents occur. Instead of only having a select few on the team perform operations, runbook automation ensures that key tasks happen without needing any specific responders. If set up correctly, runbook automation has a lot of benefits for teams.

Fewer Barriers to Getting Things Done

Runbook automation means fewer barriers to getting things done. For example, you don’t necessarily need to wait on team members for approval, support, or instructions, because everything you need is already available to you. The faster you can resolve an issue, the sooner customers can return to their usual workflow, and you can mitigate any negative impacts on the business.

Frees up Team Resources

Runbook automation can free up resources and time. Service requests and incident management runbook automation means teams aren’t bogged down in repetitive work. Only the truly high-priority issues get precedence, making incident response time shorter. Runbook automation gives teams an easy way to streamline their workload without negatively impacting customers. 

A runbook template: Key steps to consider

  1. Understand and map your system architecture

To create runbooks that automatically use a variety of services, you’ll need to understand how each service functions and how they connect. Map these connections and include information on how automation tools can control each service to lay a solid foundation for future runbooks.

  1. Identify the right service owners

Once you’ve mapped out your architecture, you’ll need a repository of the owners of each service. This will help future runbook authors contact the right people for collaboration, advice, and sign-offs. Complex automated runbooks will work through many service areas, so involving the owners and experts of each space is a must.

  1. Lay out key procedures and checklist tasks

Common tasks often have common steps - subtask procedures like auditing, version control, and deployment are likely to overlap. Identify these key steps and clearly define their processes, then compile them into a list. Future runbook authors should use steps from this list when possible for consistency.

  1. Identify methods to bake into automation

Now that you have a list of key procedures that recur in many tasks, you also have a great starting point for finding automation opportunities. Look for things that can be scripted, and ways to have scripts trigger subsequent scripts. Make your automated steps modular so they can be baked into a variety of runbooks.

  1. Continue refining, learning, and improving

Resources like the architecture map, service owner repository, and list of common tasks aren’t to be created once and left untouched. Include updating these resources as a checklist task on procedures that would modify them, and also have regular checks to ensure they’re up to date. When you revisit them, take the opportunity to learn from them again, looking for new opportunities to automate and optimize.

How to write simple runbooks for complex workflows

One of the most powerful features of automated runbooks or playbooks is their ability to navigate long conditional paths to complete complex tasks. Consider a runbook created to update the settings for a variety of development environments. This could require the automating tool to check many variables and deploy different changes for each combination, quickly creating a tree with many branches. Manually determining which branch to shake can be a tedious challenge, but the automated runbook finds the correct branch with ease.

You’ll inevitably need to change your runbook, so you’ll need some way to cut through this complexity. In order for other developers to update and refine your automated runbook, you’ll need some representation of how it actually works. This could take the form of a visual aid, like a flowchart, that shows you the steps and pathways at a glance with embedded links to the code executed at each step.

Another option is to have a simple automating language that dictates the overall structure of the runbook. Ansible provides automation tools that are controlled by instructions in a simple language that’s understandable without any special programming knowledge. This helps your runbooks remain easy to parse and update, even when they contain many steps and connections.

Make creating new automated runbooks easy

To get the most out of runbook automation, developers should be encouraged to implement them where possible to help create guardrails around specific processes. You should never assume that any area of development and operations is unable to be automated - even in the most nuanced projects, you’ll find simpler subtasks that could be automated. Likewise, consider automating even seemingly novel tasks. Your investment in automation can pay big dividends if these tasks do end up recurring.

To encourage this automation mentality, remove as many barriers as you can to creating and implementing new automated runbooks. Ideally, creating a new automatic runbook for a task shouldn’t take much longer or need many more resources than just completing the task manually. Rundeck, for example, allows users to quickly create workflows, integrating existing scripts and tools. It prides itself on being “automation for automation,” allowing you to automate as quickly as possible.

Of course, like any other aspect of development, automated runbooks should be observed and reviewed on a regular basis. The more runbooks you have running around, the more essential it is to stay on top of what they’re doing. You can help yourself out by having your automated runbooks log themselves, providing information on when they run, what choices they make, and what resources they use. This small overhead is another rewarding investment.

Integrate runbook automation into every aspect of DevOps

There are opportunities to automate and save time in even the most nuanced aspects of development and operations. To empower this, your automated runbooks should hook into every tool in your stack. One route to this connection is to have tools that can be easily controlled through things like external scripts, allowing the orchestrating runbook automation tool to deploy custom instructions.

Another route is to choose an orchestrating tool that has specific integrations with the rest of your environment. Microsoft’s Azure Automation works with every aspect of an Azure development environment, allowing Azure customers to intuitively create powerful instructions for every part of their DevOps solution.

Have automated runbooks for reliability events

One of the most helpful ways to use automated runbooks is in incident response, increasing the speed and consistency of resolution. Create automated runbooks for your common troubleshooting processes, and have them trigger in response to outages, extreme load, or other SLOs.

But remember that automated runbooks can only do so much. SRE principles teach us that there will always be incidents that fall outside of our expectations, so it’s impossible to have a runbook for everything that could go wrong. Runbooks will still be useful in these instances, though: the audit trails they generate of what didn’t work provide a great starting point to determine how to triage.

We mentioned earlier the importance of scheduled review sessions to refine your runbooks and incident playbooks, and a good SRE solution will support you here too. The resource monitoring of SRE will allow you to measure the impact of your runbooks, highlighting areas to refine and optimize. Likewise, monitoring of development resources can suggest areas that could benefit from further automation.

What are some tools I can use for runbook automation?

There are several different types of tools used for runbook automation. Ideally, you’re looking for solutions that enable event-driven automation and work both on-premises and on the cloud. Runbook automation tools should also have self-documenting capabilities so teams can check workflows and have the documentation needed for incident management. 

Some of the tool types for runbook automation include:

  • Automation harness: A hub that brings together scripts, tools, or APIs and lets users configure these resources workflow that can execute automatically.
  • Guardrails: Guardrails are used for access control and usability. Access control guardrails are used for user permissions and auditing. Usability would lean more towards guidance and minimizing the need for extensive training. 
  • Dynamic infrastructure map: Dynamic infrastructure maps streamline the different elements of infrastructure to help create targeted automation. 

Runbook automation tools

Some of the more common runbook automation tools include:

  • Rundeck: Create automated runbooks and give selected users self-service access to handle incident management as needed.
  • IBM Runbook Automation: Supports manual, semi-automated, and fully automated workflows with pre-event triggers configuration available.
  • Azure Automation Runbook Management: Used with PowerShell Workflow, the runbooks come with predefined scenarios based on PowerShell, so teams don’t need to create runbooks from scratch, and features to enable easy automation based on specific parameters. 
  • Octopus Deploy: Users can create, manage, execute and schedule runbooks in multiple environments.
  • Fujitsu Systemwalker Runbook Automation: Includes runbook automation features and task history, including manual activities like requests and approval to build audit trails. 
  • Resolve.io Runbook Automation: Enables automation across different environments and is also available on-premises and cloud infrastructure.

Each of these tools for runbook automation has different pros and cons, so it’s difficult to compare in a one-size-fits-all way. However, when evaluating runbook automation tools, it’s helpful to think about the processes and tasks that need to be automated, their complexity, environments, and current infrastructure.

How can Blameless help?

The Blameless SRE platform has the tools needed to enhance runbook automation. This includes checklists and reminders to create guardrails, helping with creating new automated runbooks, and documenting runbook activities in Blameless’ retrospective (commonly referred to as postmortems). In addition, Blameless reliability insights can also help identify opportunities for more automation. Learn more about how Blameless helps streamline and optimize reliability by checking out our demo.

Resources
Book a blameless demo
To view the calendar in full page view, click here.