Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Here are 4 Ways SRE Helps New Employees Onboard

Emily Arnott

Onboarding is an essential yet challenging part of the hiring process. As your organization matures, more of its processes become unique. This makes it harder for new employees to get up to speed. Investing in custom processes and tooling to achieve your specific goals is a valuable practice. But, you must balance this with an investment in onboarding.

Fortunately, an investment in SRE is also an investment in onboarding, as one of the important goals of SRE is to help democratize context across software teams. At first, SRE may seem like an area with a high learning curve. The diversity of the skills expected of the SRE role can make it difficult to hire for. However, these skills help broaden engineer’s abilities and understanding of their organization’s systems. The SRE mentality can provide insights into many areas, including onboarding itself. 

In this blog post, we’ll cover how SRE takes onboarding to the next level.

Runbooks as guides for new employees

A runbook provides a sequence of steps, checks, and recommendations to complete a specific task. Runbooks are often used by teams during incident response. When something goes wrong, team members can refer to a runbook for guidelines on how to resolve it. You can build runbooks for many common tasks, from spinning up new servers to deploying an update.

The beauty of runbooks for new employees is their simplicity. Because of their comprehensive checks and steps, they require little specific expertise to execute. Each check is broken down into something observable, and each step broken down into something actionable. Where a more complicated process is required, that process can have its own runbook. Even a brand new employee should be able to work through a runbook. In fact, the SRE mentality is to automate runbooks as much as possible.

Runbooks allow new employees to contribute to solving problems immediately. As they work through them, they gain familiarity with a variety of systems in a guided way. Runbooks reveal the inner logic of your systems, showing the necessity of each step. Having new employees work through runbooks also helps improve the runbooks themselves. When reviewing runbooks, look for places that new employees stumbled. Refine your runbooks with that feedback in mind.

Incident retrospectives as a library of learning

Of course, not every incident can be covered wholly by a runbook. There will inevitably be challenges that are novel. The SRE mentality is to build documents around how teams responded. These documents are known as incident retrospectives (or postmortems). New employees can consult them to learn and grow.

Here are some common features of an incident retrospective and how they help new employees:

 Summary of customer impact Shows connections between service areas and customer use Shows how the severity of incidents is tied to this impact Demonstrates how to triage and prioritize incidents  Follow-up actions Shows how incidents fit into greater DevOps cycles Shows how tasks are assigned and followed up on  Narrative Helps highlight how key decisions and insights were made Shows how each team member contributed to the response  Timeline of key events Breaks down the incident response process Shows the cadence of check-ins and communication  Technical background Shows what monitoring data is collected and its relevance Provides a context for the incident among other incidents (e.g. is this a recurring problem?)  Process analysis Demonstrates a culture of blamelessness Shows how processes may have changed since the incident occurred

Any single incident retrospective can provide new employees with a wealth of knowledge. This knowledge goes beyond incident response.

New employees should be encouraged to review these retrospectives. Consider making a collection of your team’s “greatest hits.” Choose examples that show the best of your processes, or incidents that lead to major decisions. Consider having new employees sit in on retrospective review sessions, even if they weren’t involved in the incident. This will demonstrate how retrospective sessions are conducted within teams.

You can also apply the incident retrospective model to other areas of your organization. Consider making a sprint retrospective for major development projects. Many of the elements of the incident retrospective will still apply. These resources will be valuable for employees of any maturity level.

SLIs, SLOs, and error budgets as focal points and confidence boosters

When new employees start, their biggest concern might be “what shouldn’t I do?” The prospect of making a mistake can intimidate new hires enough to paralyze them. With complex systems, it’s hard to know what change may knock over the dominoes. SRE tools can encourage new hires to explore.

SLIs help point new employees to the major linchpins of your system. They indicate the aspects of the service that are most vital to customers. Although they point at high-level concepts, they can be broken down into simple components. SLIs provide an easy way for new engineers to connect with users. By understanding user desires and pain points, they can more confidently explore development opportunities. 

SLOs and error budgets will help them gauge whether or not it’s safe to ship any new code. SLOs show the threshold at which an SLI’s metrics become unacceptable. The error budget shows the inverse: how much of the metric is left before the SLO is breached. This can help new employees overcome anxiety. Even if they make an error, it can be budgeted for. Or, if the anticipated risk is too high, they can wait until they regain some error budget to push out their code

Looking at the error budget can tell a new employee the story of the development project. At a glance, they can understand the priorities and challenges. Stress tests and chaos engineering experiments can be accounted for in the budget as well. By understanding these risk factors and their impacts, new engineers can become emboldened. They’re encouraged to take risks and innovate up to the error budget’s limit.

These simple, centralized, and accessible metrics are essential for new employees. By following the SLO graphs, they can instantly feel the pulse of development. They can review development priorities, assess the pace, and spot significance of issues. This completeness of understanding helps them proceed confidently with their first projects.

Refining onboarding with an SRE mentality

In a talk for SRECon, Jennifer Petoff and JC van Winkel, SREs at Google, break down how they “SRE’ed an SRE training program.” Their talk reveals insightful parallels between the process of making a system more reliable and making an onboarding program more effective.

A major revelation that may seem counterintuitive is that it is possible to overspend on training. Petoff and van Winkel refer to this as “polishing a diamond.” Like making a service more reliable than the SLO demands, you face diminishing returns past a certain point of onboarding effort

This is especially true in SRE, where oftentimes there are a vast number of potential failure modes, none of which are quite exactly the same. This means it’s important to invest in tools that also help onboard, and gives team members the ability to make empowered decisions in the moment (context over control). Creating runbooks, libraries of learning, and centralized metrics is beneficial, as clearer context is key to making systems more reliable.

Just like our systems will never be 100% reliable, our onboarding programs won’t be 100% comprehensive. The SRE mentality is to constantly learn and revise. Get feedback from your new hires on what worked and didn’t in their onboarding program. Ask them to make notes of when they encounter unknowns in their work. Review whether those unknowns should have been taught to them sooner.

Making your onboarding process resilient is also a key lesson of SRE. Make sure training doesn’t require specific people to be available. You can run chaos engineering-esque experiments to see how onboarding would go if some people or resources are unavailable. Develop plan Bs and Cs to cover these worst case scenarios.

Blameless can help new employees get up to speed with incident response processes, incident retrospectives, and SLOs. To see how it works, check out a demo.

If you enjoyed this blog post, check out these resources:

Book a blameless demo
To view the calendar in full page view, click here.