When implementing SRE, almost every role within your IT organization will change. One of the biggest transformations will be in your Quality Assurance teams. A common misconception is that SRE “replaces” QA. People believe SLOs and other SRE best practices render the traditional role of QA engineering obsolete, as testing and quality shift left in the SDLC. This leads to QA teams resisting SRE adoption.
But QA teams can and should embrace the transformation that SRE can bring, as SRE elevates their role to a strategic partner in designing performant software and scalable practices. SRE removes silos from QA expertise, better aligning QA and engineering teams. Also, better prioritization and automation reduces the amount of toil QA teams face. In this blog post, we’ll break down how SRE transforms the role of QA, and highlight the improvements it brings for the team.
In his book Implementing Service Level Objectives, Alex Hidalgo explains how SLO implementation can affect QA. He describes six stages that we’ve summarized here:
By reframing QA steps in the context of an error budget, you prove that each step is impactful. Engineers won’t see these tests as onerous because they allow engineers to keep writing new code.
SRE teaches us that failure is inevitable. There will always be bugs and edge cases that QA cannot account for. You need to prioritize testing efforts and design tests to cover the most impactful areas. SLIs, or service level indicators, can help you identify them.
Error budgets and SLOs are based on SLIs. SLIs are based on the areas of your service that have the highest customer impact. When considering the value of a QA test, SLIs can provide very valuable context. Here is a process to evaluate QA tests with an SLI:
This allows you to see the worst case scenario that the test could prevent. If there’s little potential impact to the error budget, consider removing the test from your arsenal. If you cannot connect a test to an SLI, it’s possible that you could be running more focused, impactful tests.
Error budgets also allow you to design new tests. Look at major bugs that significantly depleted your error budget. Review incident retrospectives to see exactly where the bug originated. Consider what types of tests would have caught the bug before production. Better yet, given the impossibility of perfectly reproducing production scenarios in staging environments, build practices that enable you to safely test in production.
As you adopt SRE best practices, the actual function of testing is often adopted by the engineering team. QA then becomes responsible for the overall design and direction of testing. By using SLOs and SLIs, the goals of development and QA become more aligned and tests become more efficient.
Another way that SRE reduces the toil of QA is through automation. The SRE mentality is to automate wherever possible. QA teams have also always advocated for automated testing, but SRE elevates these practices in several ways.
An automated runbook is an SRE tool that provides a list of checks and steps for different circumstances. SREs automate their runbooks step by step, reducing the cognitive load on engineering. QA testing can also be formatted as a runbook or playbook. Instead of having each test be a standalone object, each step can be isolated and standardized. This library of steps can then be combined into new tests. As you automate, the steps become useful in a variety of situations.
In order to use this runbook model of testing, QA must be integrated into many areas within development and operations. The QA function shouldn’t be a siloed, black-boxed area of your organization. There must be more communication between teams than code going in and test results coming out. QA needs to work alongside development to understand their goals throughout the process.
By adopting SRE best practices, teams will develop this more strategic, integrative relationship. As engineering teams begin testing their own code, they’ll collaborate with QA to build testing runbooks. These runbooks will be able to draw from the necessary contexts and perform the necessary actions to automate fragile, manual processes.
As Alex points out in Implementing Service Level Objectives, QA teams may be concerned about losing their place within an organization. Alex emphasizes that QA skills and experience will be even more important.
Alex also describes a cultural shift that occurs as QA is integrated into engineering. He says that “QA teams are often seen by engineers as ‘no’ teams or ‘roadblocks.’” QA is often “caught in the middle of the friction between engineering and operations.” But with the SRE adoption, QA is elevated “from second-class roadblock to first-class partner.” Amy Tobey echoes this sentiment in a panel with Blameless. She suggests that SRE can “uplift” traditional QA teams by empowering them in “owning and nurturing the test spectrum, but extending that all the way out into production.”
The cultural lessons of SRE are centered around empathy and blamelessness. Instead of blaming individuals, incidents are viewed as opportunities, and people are encouraged to collaborate in addressing socio-technical challenges and improving resilience. A similar mentality applies to QA and engineering. Rather than testing specific pieces of code, QA can work with engineering on a systemic level to promote blamelessness.
Blameless can help you transform QA with our SLO and error budgeting tools. We take a vendor-agnostic approach and focus on the process of operationalizing SLOs in context of other key reliability practices such as incident resolution, incident retrospectives, and more.. To see how, check out our webinar on SLOs.
If you enjoyed this blog post, check out these resources: