Whether you’re building an SRE team or looking for a job as an SRE, understanding the SRE job description is important. How would you define an SRE job?
What is an SRE job description?
SREs are responsible for creating scalable and reliable software systems. Generally, SREs are expected to have a Bachelor’s degree in computer science or a similar field, a proven ability to code, and an understanding of IT operations. However, as SREs can specialize in areas outside of coding, a formal background in tech may not be required.
That’s the basic foundation of any SRE job description you write. However, to write an effective description that gets you the best candidates possible, it’s crucial to have a nuanced understanding of site reliability engineer skills. Once you’ve identified the core skillsets needed, you can start to consider your specific business context and what will be most valuable there.
Site reliability skills in more detail
When developing the SRE job description, an important step is to differentiate between site reliability engineer vs. DevOps and why you need an SRE specifically. You can find more information about what is SRE that might be helpful before writing the SRE job description.
Site reliability engineering skills are very much rooted in continually analyzing the infrastructure and making changes to improve reliability. That could mean infrastructure optimization, performance improvement, and workflow design (or redesign).
Site reliability engineers are focused on reducing downtime and risk of major incidents, using automation where possible. As more organizations think about what is DevOps and begin to adopt DevOps practices to accelerate development, there is a growing emphasis on having SREs on the team that can execute the goals of DevOps. In addition, the measures site reliability engineers take will enable the DevOps cycle to shorten because software delivery and CI/CD best practices become standardized and automated where possible.
Site reliability engineer responsibilities
While there is a lot that goes into the SRE role, there are some broad categories that responsibilities fit into. On a day-to-day level, SREs will be focused on:
One of the primary responsibilities that SRE engineers have is to build out automation to streamline IT operations. The goal here, and where their skillset should be evaluated, is their ability to reduce manual functions through automation. This could include building out CI/CD pipeline automation where needed. It also includes using SRE tools to automate monitoring, incident response, and alerting to reduce time-consuming functions that are still necessary.
Automation goes hand-in-hand with the monitoring element that SRE responsibilities encompass. While they may use tools to automate monitoring, they will still need to be responsible for overseeing the underlying infrastructure of the solution. Site reliability engineers will also need to monitor the tools they use for automation to ensure they’re working as expected.
Downtime and failures are inevitable, but how SREs deal with the problem is what’s important. Part of the responsibility SREs have in their role is to work together with developers to troubleshoot and solve problems and reduce customer impact where possible. SREs will also need to go one step further after the incident to document and examine what went wrong and develop measures such as automated runbooks to handle the issue moving forward.
Many SRE practices require a cultural foundation to really function. Looking at incidents requires blamelessness to uncover systemic changes without pointing fingers. Investing time on improving infrastructure and reducing toil requires a shared understanding that it’s worth focusing on. SREs will have to champion these cultural principles, convincing management and other engineers that they’re worth adopting.
How to write a site reliability engineer job description
So once the broad responsibilities and skills of the role have been established, the description becomes easier to write.
You’ll want to focus on communicating essential parts of the role, such as:
- On-call rotation for incident response and proactive incident measures
- After incidents, document actions in order to create automated solutions during incident response.
- Monitor infrastructure using SRE tools, and suggest tools as necessary
- Build monitoring alerts and incident response processes
- Improve operational processes and team practices
- Coding infrastructure automation across the CI/CD pipeline
- As the solution scales, ensure reliability through designing, building, and maintaining the core infrastructure.
- Demonstrate strong programming skills and thorough knowledge of systems
- Bring about cultural shifts to provide a foundation for process changes
The job description needs to balance the technical aspects of the role, as well as the soft skills necessary to thrive in the role.
These skills can include:
- Ability to work asynchronously (or however your team primarily works)
- Ability and willingness to collaborate
- Strong problem-solving skills and ability to think under pressure
- Strong analytical skills and management skills
- Communication and documentation skills
These are some of the general points that the SRE job description should contain. As the seniority of the role progresses, they will naturally increase responsibilities and proactiveness. They’ll also start to take an active role in business growth as much as possible with more interaction with senior stakeholders and leadership as needed. The seniority of the level is based on their technical expertise, emergency response, and solution design, as well as how they work together with the wider team.
However, seniority and description will vary widely depending on the current team structure, resources, and needs. That’s why it’s crucial to take the core SRE skillsets for the SRE job description and adapt as necessary.
Depending on the seniority of the role, SREs may have to oversee a wider team rather than doing much of the coding work themselves, which changes the description to include more managerial and project management type skills alongside SRE skills. For more junior levels, emphasizing the responsibilities and wider team structure can help attract more candidates overall while making it clear what the role entails.
Blameless can help you start or grow your SRE practice. With our suite of tools, including incident retrospectives, reliability insights, and SLOs, new SREs can get up-to-speed with the system’s current health and start making meaningful changes more quickly. To see how, check out a demo!