Onboarding is an essential yet challenging part of the hiring process. As your organization matures, more of its processes become unique. This makes it harder for new employees to get up to speed. Investing in custom processes and tooling to achieve your specific goals is a valuable practice. But, you must balance this with an investment in onboarding.
Fortunately, an investment in SRE is also an investment in onboarding, as one of the important goals of SRE is to help democratize context across software teams. At first, SRE may seem like an area with a high learning curve. The diversity of the skills expected of the SRE role can make it difficult to hire for. However, these skills help broaden engineer’s abilities and understanding of their organization’s systems. The SRE mentality can provide insights into many areas, including onboarding itself.
In this blog post, we’ll cover how SRE takes onboarding to the next level.
A runbook provides a sequence of steps, checks, and recommendations to complete a specific task. Runbooks are often used by teams during incident response. When something goes wrong, team members can refer to a runbook for guidelines on how to resolve it. You can build runbooks for many common tasks, from spinning up new servers to deploying an update.
The beauty of runbooks for new employees is their simplicity. Because of their comprehensive checks and steps, they require little specific expertise to execute. Each check is broken down into something observable, and each step broken down into something actionable. Where a more complicated process is required, that process can have its own runbook. Even a brand new employee should be able to work through a runbook. In fact, the SRE mentality is to automate runbooks as much as possible.
Runbooks allow new employees to contribute to solving problems immediately. As they work through them, they gain familiarity with a variety of systems in a guided way. Runbooks reveal the inner logic of your systems, showing the necessity of each step. Having new employees work through runbooks also helps improve the runbooks themselves. When reviewing runbooks, look for places that new employees stumbled. Refine your runbooks with that feedback in mind.
Of course, not every incident can be covered wholly by a runbook. There will inevitably be challenges that are novel. The SRE mentality is to build documents around how teams responded. These documents are known as incident retrospectives (or postmortems). New employees can consult them to learn and grow.
Here are some common features of an incident retrospective and how they help new employees:
Any single incident retrospective can provide new employees with a wealth of knowledge. This knowledge goes beyond incident response.
New employees should be encouraged to review these retrospectives. Consider making a collection of your team’s “greatest hits.” Choose examples that show the best of your processes, or incidents that lead to major decisions. Consider having new employees sit in on retrospective review sessions, even if they weren’t involved in the incident. This will demonstrate how retrospective sessions are conducted within teams.
You can also apply the incident retrospective model to other areas of your organization. Consider making a sprint retrospective for major development projects. Many of the elements of the incident retrospective will still apply. These resources will be valuable for employees of any maturity level.
When new employees start, their biggest concern might be “what shouldn’t I do?” The prospect of making a mistake can intimidate new hires enough to paralyze them. With complex systems, it’s hard to know what change may knock over the dominoes. SRE tools can encourage new hires to explore.
SLIs help point new employees to the major linchpins of your system. They indicate the aspects of the service that are most vital to customers. Although they point at high-level concepts, they can be broken down into simple components. SLIs provide an easy way for new engineers to connect with users. By understanding user desires and pain points, they can more confidently explore development opportunities.
SLOs and error budgets will help them gauge whether or not it’s safe to ship any new code. SLOs show the threshold at which an SLI’s metrics become unacceptable. The error budget shows the inverse: how much of the metric is left before the SLO is breached. This can help new employees overcome anxiety. Even if they make an error, it can be budgeted for. Or, if the anticipated risk is too high, they can wait until they regain some error budget to push out their code
Looking at the error budget can tell a new employee the story of the development project. At a glance, they can understand the priorities and challenges. Stress tests and chaos engineering experiments can be accounted for in the budget as well. By understanding these risk factors and their impacts, new engineers can become emboldened. They’re encouraged to take risks and innovate up to the error budget’s limit.
These simple, centralized, and accessible metrics are essential for new employees. By following the SLO graphs, they can instantly feel the pulse of development. They can review development priorities, assess the pace, and spot significance of issues. This completeness of understanding helps them proceed confidently with their first projects.
In a talk for SRECon, Jennifer Petoff and JC van Winkel, SREs at Google, break down how they “SRE’ed an SRE training program.” Their talk reveals insightful parallels between the process of making a system more reliable and making an onboarding program more effective.
A major revelation that may seem counterintuitive is that it is possible to overspend on training. Petoff and van Winkel refer to this as “polishing a diamond.” Like making a service more reliable than the SLO demands, you face diminishing returns past a certain point of onboarding effort.
This is especially true in SRE, where oftentimes there are a vast number of potential failure modes, none of which are quite exactly the same. This means it’s important to invest in tools that also help onboard, and gives team members the ability to make empowered decisions in the moment (context over control). Creating runbooks, libraries of learning, and centralized metrics is beneficial, as clearer context is key to making systems more reliable.
Just like our systems will never be 100% reliable, our onboarding programs won’t be 100% comprehensive. The SRE mentality is to constantly learn and revise. Get feedback from your new hires on what worked and didn’t in their onboarding program. Ask them to make notes of when they encounter unknowns in their work. Review whether those unknowns should have been taught to them sooner.
Making your onboarding process resilient is also a key lesson of SRE. Make sure training doesn’t require specific people to be available. You can run chaos engineering-esque experiments to see how onboarding would go if some people or resources are unavailable. Develop plan Bs and Cs to cover these worst case scenarios.
Blameless can help new employees get up to speed with incident response processes, incident retrospectives, and SLOs. To see how it works, check out a demo.
If you enjoyed this blog post, check out these resources: