After talking with over 300 companies from industries like retail, finance, healthcare and SaaS, we see an emerging trend in the way product and engineering teams operate. Site reliability engineering (SRE) as a discipline is shifting left. Companies are focusing on reliability and problem prevention earlier in the software development life cycle (SDLC). Instead of reacting to incidents after the code ships to production, SREs are participating at the product planning phase to build in reliability on day zero.
This shift could change the scope of work for SREs and software developers.
SRE is not the only one; security and QA are shifting left at a rapid pace. An analogy to describe this phenomenon is ‘A stitch in time saves nine’. Imagine this in the context of clothing. Here, traditional SRE acts like the repair stitch that saves further tears. SRE shifting left strengthens the weaving process, reducing the need for future repairs.
When Amazon's website experienced partial outage during their annual Prime Day sale in 2018, frustrated customers across North America were unable to access deals on Amazon. According to an analyst report, this hour-long outage could have cost Amazon approximately $99 million in sales. Reliability is mission critical not only for E-Commerce, but also for B2B companies. Non-reliability could result in restricted access to business data. This could further impact customer satisfaction or even violate revenue impacting SLAs. As companies realize that non-reliability could negatively impact business growth, the need for SRE grows.
This incident shows the importance of reliability and SRE shifting left. If a company’s SRE function is shifting left, it will be more likely to expect spikes in traffic. Teams will be better prepared to handle an incident like Amazon's.
The adoption of microservices has resulted in complex interactions between legacy systems, next gen systems built internally, and 3rd party cloud APIs.
The exponential scale of issues means SREs are now solving more problems than ever. The availability of SREs for hire, in proportion to the rising demand, is limited. SREs with the right experience are few and expensive to come by. The only way to overcome this demand/supply challenge is to have fewer incidents. SRE shifting left is a result of this phenomenon.
The team dynamics between development and SRE is due for an update. Shifting left puts the developers and SRE in a collaborative relationship at the beginning of the SDLC. This incentivizes developers to optimize reliability. Developers will leverage a reliability driven engineering mindset to ship better code. This helps avoiding working on repeat issues. Moreover, SREs do not have to be the gatekeepers for new upgrades/ innovation. This improvement in team dynamics means SRE shifting left is here to stay.
SRE shifting left has many advantages. First, the increase in confidence for successful deployments. Often failures in production are a result of differences in the deployment procedures. The development team may create deployment procedures different from those used by SRE in production. Sometimes production procedures are more manual and may even use different tooling.
SRE shifting left allows both teams to create standard deployment procedures. This eliminates the need for different tooling. The deployment process is then practiced in test environments before reaching production. With guardrails in place, teams will feel increased confidence for successful deployments.
SRE shifting left shortens the test cycles by testing against SLOs (Service Level Objectives) before production. It takes less time to set SLOs in the beginning rather than at the end of the life cycle. SLOs are useful as they help make data-driven priority decisions with regards to reliability v/s new feature development. SLOs also provide more visibility into SRE's reliability improvements. Generally, companies take six months to set the SLOs, whereas shifting left reduces the timeline to weeks.
Shifting left improves the relationship between SRE and Dev teams. Here SRE and Dev work together right from the planning stage. Developers can now learn to focus on reliability and stability from the start. Additionally, SREs can now learn development skills. This helps them identify issues in the beginning and troubleshoot faster.
While shifting left has its benefits, few companies are ready to take the step. Companies that do not have an existing SRE practice, or are not yet mature in their SRE practice, have a long way to go before they shift left. These companies can start small by having the Dev team adopt SRE best practices. But, in some cases, it might mean added responsibility for the Dev team and cause resistance to shift left.
While talking with more than 300 companies, we saw two patterns emerging in the way companies are adopting the left shift:
SRE is shifting left and becoming a first-class citizen, much like security and QA. Yet, this does not take away job opportunities for a site reliability engineer. Shifting left allows SREs to become partners in the development process. If you spent 100% of your time on unplanned work, that workload will reduce to less than 50%. As more companies focus on reliability, SREs will have the opportunity to champion shifting left.
SRE shifting left creates opportunities for both teams. SREs are now free to build resilient, reliable systems, explore new technologies, and pursue new practices. Developers can enhance their troubleshooting skills. Adopting a blameless culture is a step in that direction. It is the beginning of making technology processes as robust as possible. The future might bring a disruption in the Dev-SRE space, but for now, shifting left is inevitable.
Conceptualized by Ashar Rizqi