Application delivery is getting harder each day with the rise in complexity, the demand for services to be always-available, and the increasing pressure on teams to innovate. Service level objectives, or SLOs, can help.
In this blog, we’ll discuss how SLOs are the key to modern application delivery, how to manage and measure them, the importance of observability for your SLO solution, and how to begin the journey to reliable application delivery today.
Our world has changed. Not only in the sense of increasingly complex technology, but also on a very personal level. With remote work, COVID-19, and the extra stress of decision fatigue leading to reduced cognitive capacity, our own ability to deliver has changed. At the same time, our customers rely on us now more than ever.
The apps that customers use must be available and perform at high levels every cycle. Traditionally, development would slow feature releases to ensure that quality. But speed is crucial as well. Customers want new features and reliability, and existing processes may not support this. Digitalization has become the new mode for survival.
So what does this mean for teams? According to AppDynamics’ “The Agents of Transformation Report 2020: COVID-19 Special Edition,” there are three main reasons for the accelerated urgency of digitalization:
This response to slowdowns might be surprising, but after investigation, it makes sense. Think about how much time latency costs users versus unavailability. When an app is down, your user knows right away and moves on to other tasks. They’ve only lost a few seconds. But when an app is slow, it saps time from your customers.
Imagine you're depositing a check on your banking app. You wait for the app to open, for it to log you in, to connect to your camera, to scan the check, and so on. A slow app takes time away from users at every stage. This is why SLOs focus on latency as well as availability.
Additionally, today's environments are no longer simple, monolithic architectures. You must support hybrid architectures spanning many operating models, leading to interdependencies. This increases the risk of having islands of dissociated data. Unfortunately, tools can add to this tsunami of data, creating new islands of data.
Customer experience lives across these islands and hides within that data tsunami. And insight into this data is hard to glean. The left hand needs to know what the right hand is doing to function as a pair. Humans achieve this with a central nervous system (CNS). The CNS collects and correlates data with context, deriving insight so you can take action. Observability functions the same way for technical systems.
Observability is the central nervous system for the digital enterprise. Within organizations, it’s what connects applications to business, to security, and more. Monitoring tools integrate with your systems to decipher what’s important, eliminating data islands. From the AppDynamics perspective, below is a diagram of what this observability central nervous system looks like.
Once you have this level of observability, you need to ensure that you have a feedback loop in place. SLIs and SLOs give you this data-driven basis for decisions, closing the loop. This is especially important in the context of slowdowns.
As an industry, we’ve gotten good at responding to outages. Slowdowns are different, and we aren’t as adept at recognizing and resolving them yet. SLOs give us the means to respond more effectively to slowdowns.
Think of SLOs as a meteorology report into how your service is doing. SLOs capture what your monitoring is doing in conjunction with the most important parts of your service’s user journey. SLIs (service level indicators) are the monitoring units that pull information and pass it on to the SLO. The SLO's job is taking that information provided from the SLI and comparing it against thresholds.
Let's say you have an SLI for a latency of 250 milliseconds. You want 95% of interactions to achieve this. Your SLO is going to take these events and measure them against each other. This captures what you expect the customer to experience and reports any anomalies.
Anomalies are recorded via your error budget. Error budgets suggest the amount of unplanned system failure that you can have for your SLO. SLOs capture failure within a timespan and use those values to show where your error budget stands. If slowdowns reach a rate your customers will find unacceptable, it will trigger an alert. With this data, you can make decisions that benefit customers and internal stakeholders.
Organizations comprise people with competing priorities. Operations doesn’t want to push features to production if that means the pager buzzing all weekend. Developers don’t want to get blocked on releases by operations. Sales and marketing want new features to sell and promote. So, how can you prioritize all these needs?
SLOs can help with all this, too.
SLOs can help three core stakeholders in the following ways:
With SLOs as a unifying tool, organizations can make decisions that take all these competing priorities and needs into account.
Yet, SLOs don’t come without their challenges.
When helping our customers establish SLOs, we hear four main questions. Let’s walk through each and reflect on how to work through them:
Now that we’ve addressed these common issues, it’s time to look at how to begin SLO adoption.
According to Blameless’ own engagements with various organizations, over half of SLO implementation initiatives fail the first time. This is often because they lack the right process and culture. You’ll need both for SLOs to be a success.
As SLOs will change as your company changes, the process of SLOs is cyclical. You’ll need to iterate on them to ensure that you’re still honed in on customer experience. The image above shows the cycle you can refer to.
Culture is at the heart of SLO adoption, and necessary to mention as well.
SLOs are difficult to get right, and failure is inevitable. It’s important to create a culture of psychological safety so team members feel they can fail forward. Additionally, you’ll need to set guardrails that work against blame and attribution. There is no such thing as human error; each error can be traced back to a systemic failure.
With both culture and processes in place, you’re ready to begin your SLO journey. But there is still one thing to consider: tooling.
Think back to our analogy of the central nervous system. To act with purpose, we need visibility, insights, action, and feedback.
Together, AppDynamics and Blameless integrate all these dimensions to enable proactive service management. AppDynamics provides visibility and insight and Blameless provides action and feedback. Let's look at how this works.
A global insurance provider captures real-time service metrics with AppDynamics, which also provides real-time telemetry, logs, and other critical pieces of data to help make its system observable. AppDynamics correlates this data to the user experience and picks up early indicators of potential customer-facing impact. These insights will inform the organization’s SLOs.
The insurance provider leverages Blameless to map AppDynamics data to key user journeys within Blameless (ie overall mobile app reliability). The team can operationalize SLOs with Blameless by identifying SLIs and connecting them to relevant SLOs, creating and monitoring error budgets, and defining error budget policies.
If an SLO is breached or an error budget policy is triggered Blameless can start an incident. Collection of data and context is automated, so your team hits the ground running. Through the incident, Blameless records important tasks and messages with our automated chatbot.
After resolution, we aggregate the data for you and create an incident retrospective. This document tracks follow-up and action items, encouraging you to reinvest your learnings. This better protects your error budget and customer satisfaction in the future.
With AppDynamics providing you with visibility and insight into your systems and Blameless driving action and feedback, you’ll close the loop on your development cycles, enabling fast and reliable delivery. If you want to learn more about our integration with AppDynamics, see our integration guide here.
If you enjoyed reading this, check out these resources: