SRE Adoption | A 2-Year Retrospective (From A Business Point-Of-View)

This month I hit my 2-year anniversary with Blameless and as our industry progresses and matures, I thought it would be a good opportunity to look back and review how far we have come and also ruminate on where we’re headed. Our shared vision at Blameless is to help engineering teams adopt reliability practices with ease and advance to a resilient culture. We’re already on a path to realizing this as I meet every week with engineering teams of all sizes, across multiple industries worldwide who are steadily adopting modern reliability practices. 

I’ve been working closely with software engineering teams for over a decade and I’ve seen significant changes not only in terms of tech tool usage but also people process changes that often start with a culture or mind-set shift. Working in SaaS is one thing, selling software to engineers brings you even closer to understanding the degree of work and investment it takes to keep our products and services humming. I have to both understand a day-in-the-life of a software engineer and also the many distinct functions within engineering – from releasing code to feature-flagging, security and governance, and improving the way production apps impact the end-user experience. Cloud-native, microservices-driven has both complicated the discipline, yet enabled us to live an all-digital existence. 

From Monitoring to Observing

Since 2010 I’ve watched engineering orgs develop and improve better methods of inspecting their own products in order to deliver a valuable user experience. APM (Application Performance Monitoring) was being widely adopted with varying flavors of that approach, whether through proprietary or open-source, “home-grown” software. In more recent years, this technology evolved to Observability, which gives an even deeper level of understanding how code behaves in production, when users interact with it. It’s a  continuous learning journey to better understand how service reliability delivers business objectives. It all starts with how we measure, and we need to measure the right things and then we need to make sense of it.

When I first started at New Relic, there was a big emphasis on infrastructure monitoring across the industry. I say that, but there wasn’t a lot of monitoring going on just yet.  It wasn’t unusual to see different structures — and, in some cases, dev and ops — worked in silos and didn’t communicate very much at all. Processes weren’t codified and ops lacked visibility into the code base. They were at the mercy of finding the right engineer when something went wrong. Some of you reading this may have lived through that era. Operations moved much slower and it was definitely more frustrating for both engineers and end-users.The dawn of DevOps inspired teams to break-down silos, automate workflows, and better communicate. Simultaneously, APM led the evolution of democratizing metrics monitoring. With that, performance expectations increased and we saw the beginnings of emphasizing the application or front-end user experience. That translated to a more avid focus on response times, error rates and availability. APM adoption really took off during the big cloud migration. Shifting from monolithic to micro-architected services, teams started to lean on third party cloud providers and also outsource their monitoring too.

Duh! It’s All About the End-User Experience 

Meanwhile this set the stage for SRE (site reliability engineering), Google’s best kept secret until 2016 when the SRE handbook was published. Created internally for a decade and it was a natural time to introduce SRE principles into the ITIL framework. SRE takes a prescriptive approach to DevOps, with its main mantra being: your key objectives should aim toward customer happiness as the end goal. This new-fangled practice has inspired a cultural shift for engineering infrastructure teams that I’ve never quite seen before. What gets me most excited is the data-driven approach to measuring the business value of reliability. I see a lot more non-engineering stakeholders invested in and paying more attention to  reliability insights. As a business leader focused on delivering value to customers, reliability directly impacts both the top-line and bottom-line. Customer loyalty, brand, and growth are impacted when a service doesn’t deliver predictable reliability. At the same time when engineers become mired in complex, unstable environments where they have to spend toilsome cycles fixing and band-aiding, it causes attrition and a significant cost hit to the business. 

Reliability: Think Outside the Engineering “Box

Reliability is categorically a business metric. It encompasses more than product measures, and I believe companies should view it in a more holistic manner. To understand reliability, we need to look at what’s happening “on the front lines.” How often do we find out about a performance issue from a customer support ticket? What type of feedback do we receive from customers? How’s our reputation in the market? We pose these questions to the Support Team, Customer Success, Marketing, and Sales. In doing so, we actually start to treat reliability as a measure of the health of the business. It’s both a leading and lagging indicator. 

To take this a step further, SRE helps teams use a proactive approach by surfacing the right data insights to take preventative steps and avoid an issue to worsen. If teams have time to respond and react, they are happier and certainly the customer either never knows or carries on with their loyal usage. Before, maybe 5-7 years ago, we thought metrics monitoring and visibility across DevOps was enough. It wasn’t. We built telemetry-based dashboards and collected all relevant data points, but it didn’t go far enough by prioritizing different parts of the service and proactively setting a metric to work towards. Without a planned number or indicator, we didn’t know how to progress or improve over time. Using the principles of SRE, we’re more diligent about continuously learning and improving. We do better at documenting what happened with retrospective reports that inform us how the system is behaving. By aggregating that data over time, we learn how it’s advancing.  We are better challenged to identify gaps and any single points of failure. Often fixing one issue or part of the service doesn’t solve the entire system. When teams come together and agree on which specific parts of the service are absolutely critical and how to escalate when something goes awry, teams get a clear sense of what’s important and where to focus. This is now popularized by SLOs which are essentially KPIs for engineers that can be communicated up, down, and across the functions of any business.

Culture Is the Great Enabler:

SRE is markedly different because of its emphasis on culture. There’s a very specific type of culture that SRE teaches us to adopt. I like to think of it as two-fold. First, you want everyone in the organization to put the customer first. In other words, you’re successful in your job when the customer is happy. Second, trust and believe that incidents and failures stem from systemic problems that require systemic solutions. The first part about focusing on the customer is hugely important to understanding why we should care about reliability in the first place. We’re all here to create cutting-edge tech that simplifies and streamlines, absolutely. But ultimately we’re also here to provide a service and be useful. If we can get the business and engineering functions to align on the same goals, speak the same language, and humanize our processes more, we’re on the right track.

We’re making great progress. This year, I’m looking forward to a few trends manifesting in our day-to-day experiences. I want to have engineers exposed to the front lines. Connect them with customers much more often. Let them hear what customers are saying and asking for. Also, I hope to see go-to-market teams get close to engineers, understand the complexities of their products, and observe them as they manage tradeoffs. Whichever side of the company, we’re all dealing with prioritization in our day-to-day. Everyone’s doing it. By adopting a holistic view and witnessing how our functions play their part in a greater ecosystem is a game changer. Third party research from leading provider Gartner states: 

IT organizations struggle to demonstrate the business value of I&O [infrastructure and operations]. As a result, business leaders often see I&O as a cost center rather than an enabler of business value. There’s plenty of work to be done. Recommendations are wide-ranging and include: 

  • Use business and IT operations outcomes and metrics that stakeholders will understand and appreciate
  • Facilitate regular communications that highlight progress toward the identified outcomes and metrics.

For every engineering team that we work with at Blameless we always ask about the culture mind-set and how goals map to the business needs. Improving the entire incident management process is unquestionably important, but even more critical is how we translate the outcomes of that learning to other business stakeholders in order to achieve a resilient culture. So here’s to 2 years at Blameless and advancing how engineers deliver business value.