Implementing SRE is fundamentally about shifting culture, but it often means adding new tooling and processes to your team's workflows to support that cultural change. Teams add new steps and checks to incident response procedures. Incident responders write retrospectives and create new meetings to review them. Engineers consult new tools like monitoring dashboards and SLOs. In other words, SRE creates another layer of consideration in development and operations.
With all of these additions, it may seem inevitable that new steps would slow down the process. But investing in reliability will actually save you time. In this blog post, we'll look at how SRE tightens feedback loops and decreases friction, and how development velocity generates business value.
Incident retrospectives are a document created after each incident. They capture contextual information around the incident and the details of the response. Relevant stakeholders meet to review the retrospective. This may at first seem like overhead, given other important priorities such as shipping new features. But the learnings from retrospectives make up for this investment.
After resolving an incident, teams often discover follow-up action items. These action items could be a change in the codebase to fix a bug. Or they could be a change in operations, like increasing server resources. The incident response feeds into these new requirements. New incidents then occur in the adjusted system. This feedback loop of incident responses and development is essential to continuous improvement. To power this feedback loop, lessons learned must be translated into actionable change.
The retrospective provides a common ground to determine what changes are necessary. When reviewing the retrospective, make sure to invite all stakeholders. Investigating the incident reveals which changes could prevent a recurrence. The stakeholders can then collaborate on a timeline to complete these action items. The retrospective serves as a hub for insights and reviews on this ongoing work.
These follow-up tasks are sometimes lost in the shuffle of feature work. The retrospective process helps teams plan for action items in upcoming sprints. This secures the feedback loop of incident learnings and development. In an article by Maor Rudick, he discusses how much time developers spend debugging. The examples he gives emphasize how much development velocity is impacted by debugging time. Development velocity must account for the inevitability of bugs and fixing them. By prioritizing learning from incidents, you know what technical debt, such as bugs, are highest-impacting. Investing time to address technical debt improves long-term development velocity.
Another key feedback loop deals with the incident response process itself. You want your incident response process to become more efficient with each incident. Figuring out what works and why is key to further improvement.The retrospective contains the processes and communication used when resolving the incident. When you review this information, you can pinpoint places in your response to revise. Are certain runbooks out of date? Was there incomplete context in existing dashboards? Is communication between internal and external stakeholders efficient? These are all areas you can improve with the insights from a thorough retrospective.
Maor also discusses how debilitating bottlenecks are to development velocity. Things like having to reiterate the entire deployment cycle for each fix can slow velocity of the overall release. These meta-reviews of processes are useful in discovering such roadblocks. The more you improve these processes, the more time engineers can spend developing.
Development velocity is fastest when everyone’s goals align on business needs. If some teams have different priorities than others, friction can develop in several ways, including the following:
SLOs and error budgets can help ease friction by aligning all stakeholders. The most important metric to focus on is customer happiness. But getting an accurate measure of what makes your users happy can be difficult. SLIs, SLOs, and error budgets help keep everyone focused on the customers first.
SLIs determine the areas of highest customer impact. SLOs then measure the effect of development projects and incidents on the customer. In other words, SLOs set the standard for reliability. Development and operations now have a metric in common to aim for. When conflicts emerge around how to prioritize planned work, teams can refer to the SLO's error budget policy. When the error budget reaches certain burn rates over a period of time, teams must take action. These remediations help strengthen the system and keep customers happy.
Additionally, this eases the friction between development and operations. As long as development keeps within the error budget, engineers can push new code. But, if there is risk of the error budget being breached, it's time to work on reliability in order to mitigate customer impact.
With SLOs and error budgets, it's easier for teams to align on what the business values most: customers.
Increasing development velocity increases business value, but it’s important to understand why. It is very possible to underestimate the impact it can have on your bottom line.
Within the SaaS industry, companies that are first to market often have a huge advantage. Yet customer demands change over time, and technological advances create new opportunities. When these changes happen, companies race to offer new solutions first. Outpacing your competitors to provide customers with new features first is critical.
In an article for McKinsey, Srivastava et al. quantify the advantage better development velocity provides for a company. They surveyed executives at 440 large enterprises and interviewed 100 more experts. From this, they created an index of the 46 most critical factors of development velocity. These factors included best practices, tooling, and cultural indicators improved by SRE.
Srivastava et al quantified organizations’ development velocity by this index. This velocity metric was then compared to indicators of business value. The results were conclusive. “Top-quartile DVI scores correlate with 2014–18 revenue growth that is four to five times faster than bottom-quartile DVI scores.” This shows that investing in development velocity can impact your company’s success, to the tune of several multiples when it comes to revenue growth. With tools such as SLOs, you can increase velocity without risking customer satisfaction.
If you enjoyed this post, check out these resources: