After talking with over 300 companies from industries like retail, finance, healthcare and SaaS, we see an emerging trend in the way product and engineering teams operate. Site reliability engineering (SRE) as a discipline is shifting left - companies are focusing on reliability and problem prevention earlier in the software development life cycle (SDLC). Instead of reacting to incidents after the code ships to production, SREs are participating at the product planning phase to build in reliability on day zero.Shifting left could fundamentally change the scope of work for site reliability engineers and software developers.
SRE is not the only one; security and QA are shifting left at a rapid pace. An analogy to describe this phenomenon is ‘A stitch in time saves nine’. Imagine this being applied literally in the context of clothing. Here, traditional SRE acts like the repair stitch that saves further tears. SRE shifting left strengthens the weaving process, thereby reducing the need for future repairs.
When Amazon's website experienced partial outage during their annual Prime Day sale in 2018, frustrated customers across North America were unable to access deals on Amazon. According to an analyst report, this hour-long outage could have cost Amazon approximately $99 million in sales. Reliability is mission critical not only for E-Commerce, but also for B2B companies. Non-reliability could result in restricted access to business data, that could further impact customer satisfaction or even violate revenue impacting SLAs. As companies realize that non-reliability could negatively impact business growth, there arises a strong need for reliability and SRE.
This incident draws a compelling illustration of the importance of reliability and SRE shifting left. If a company’s SRE function is shifting left, they will be more likely to anticipate spikes in traffic that cause incidents like the one Amazon experienced and be better prepared to handle the incident.
The adoption of microservices has resulted in complex interactions between legacy systems, next gen systems built internally, and 3rd party cloud APIs. With the rise in scale of issues with software running in production, SREs are fighting more fires than ever.
The exponential scale of issues in development means, SREs are now solving more problems than ever. The availability of SREs for hire, in proportion to the rising demand, is limited. SREs with the right experience are few and expensive to come by. The only way to overcome this demand and supply challenge is to reduce the number of problems arising, by SREs shifting left. SRE Shifting left is a result of this phenomenon.
Lastly, the team dynamics between development and SRE is due for an update. Shifting left puts the developers and SRE in a collaborative relationship at the beginning of the SDLC. This incentivizes developers to optimize reliability. Developers will leverage a reliability driven engineering mindset to ship better code, avoiding unnecessarily being on-call and working on repeat issues. Moreover, SREs do not have to be the gatekeepers for new upgrades/ innovation. In short, shifting left helps resolve the age-old conflict between the Devs and SREs, a win-win for both teams. Therefore, SRE shifting left is here to stay.
SRE shifting left has many advantages. First, the increase in confidence for successful deployments. Often failures in production are a result of differences in the deployment procedures. The development team may create deployment procedures different from those used by SRE in production. Sometimes production procedures are more manual and may even use different tooling. SRE shifting left, allows both teams to create standard deployment procedures, eliminating the need for different tooling. The deployment process is then practiced potentially hundreds of times in test environments before reaching production, thereby resulting in increased confidence for successful deployments.
Second, SRE shifting left shortens the test cycles by testing against SLOs (Service Level Objectives) before production. It takes less time to set SLOs in the beginning rather than at the end of the life cycle. SLOs are useful as they help make data-driven priority decisions with regards to reliability v/s new feature development. SLOs also provide more visibility into SRE's reliability improvements. Generally, companies take six months to set the SLOs, whereas shifting left reduces the timeline to weeks.
Third, shifting left improves the relationship between SRE and Dev teams. Here SRE and Dev work together right from the planning stage, as opposed to SREs getting involved in prod/ post-prod. Developers can now learn to focus on reliability and stability from the start. On the other hand, SREs can now learn development skills to identify issues in the beginning and troubleshoot faster and bring in exciting new technologies.
While shifting left has its benefits, few companies are ready to take the step. Companies that do not have an existing SRE practice, or are not yet mature in their SRE practice, have a long way to go before they shift left. These companies can start small by having the Dev team adopt SRE best practices.However, In some cases, it might mean added responsibility for the Dev team and cause resistance to shift left.
While talking with more than 300 companies, we saw two patterns emerging in the way companies are adopting the left shift. First, there is no SRE team and the Dev team wears the SRE hat. The team starts by reading books on SRE best practices and understanding the methodologies. The Dev team can take SLO workshops and learn how to set SLOs for reliability. They also create standard tools and procedures for development and production. The Dev team is responsible for 100% of the support, maintenance and problem-solving.
Second, there is an existing SRE team and is involved much later in the SDLC. The SRE team now shifts left to work together with the Dev team at the planning stage. They collaborate to set the SLOs and the monitoring process from the start. They work together on problem-solving and maintenance throughout the SDLC.
SRE is shifting left and becoming a first-class citizen, much like security and QA. However, this does not take away job opportunities for a site reliability engineer. Shifting left allows SREs to become partners in the development process. If you previously spent 100% of your time on incident management and postmortems, that workload will reduce to less than 50% of your time as the responsibility gets shared with developers. As more companies and industries place reliability as a top priority, SREs will have the opportunity to champion the cause of shifting left.
SRE shifting left creates opportunities for both teams. SREs are now free to build resilient, reliable systems and explore new technologies like golang, python; and pursue new practices like continuous delivery and chaos engineering. Developers can enhance their troubleshooting skills. Overall, the left shift is an indication that companies can benefit from reducing redundancy, risks, and failures. Adopting a blameless culture is another step in that direction. It is the beginning of making technology processes as robust as possible. The future might bring a disruption in the Dev-SRE space, but for now, shifting left is inevitable.
Conceptualized by Ashar Rizqi