Over a year ago, Blameless launched the industry’s first end-to-end SRE platform to help software teams innovate without sacrificing reliability. As Service Level Objectives (SLOs) provide an anchor for reliability targets and corresponding decisions, they are the foundational step toward helping teams truly adopt SRE best practices. Today, we are very excited to announce our new SLO platform, giving teams a shared language on how to focus their engineering efforts.
As more and more organizations adopt DevOps at scale, there is increasing tension between operations teams and product teams to manage competing incentives around release velocity and risk. Organizations are in different stages of maturity in their journey to resilience, and oftentimes, enterprises struggle to bridge the gap between established ITSM and fast-evolving DevOps and SRE practices.
SLOs are critical for bridging this gap, by helping otherwise siloed teams gain shared context in understanding customer experience. In this blog, we’ll share how Blameless SLOs help advance teams towards achieving production excellence through proactive management of service levels, in concert with the rest of the Blameless platform. Blameless SLOs are currently in Early Access as we work with our design partners towards refining the product.
Effective SLOs: The Stages of Implementation
An SLO is defined as the target level for the reliability of a service. In conjunction with Error Budgets—the amount of allowable unplanned system failure—SLOs provide guardrails and incentives to align teams to drive high levels of service reliability while mitigating the risks.
Successfully setting up SLOs comprises of four key stages:
Step 1: Craft & Design User Journeys and SLOs
The first step of setting up an SLO requires cross-functional stakeholders to collaborate in designing the SLO. In this stage, a few critical tasks need to happen:
The Product Owner must select the relevant User Journeys that users care about and articulate why these journeys are important.
The Engineering Component Owner then identifies the service involved, documents the dependencies, selects the correct Service Level Indicators (SLIs), and documents the rationale behind these.
The Reliability Owner (typically SRE/production engineering) collaborates with the others to determine what the threshold should be and finally documents the Error Budget Policies and other relevant information like troubleshooting dashboards.
In Blameless, a guided workflow walks each stakeholder through required tasks for SLO setup, providing a collaborative experience. This functionality is underpinned by the Blameless Services Registry, which connects the CMDB to a modern service catalog which centralizes service context.
Step 2: Connect Data (SLIs)
After designing the SLO, relevant data sources such as SLIs need to be connected. Blameless supports deep integrations with third party data sources like Datadog, AppDynamics, New Relic, Prometheus, and other observability platforms. This gives customers the flexibility to define simple SLOs using one data source or more complex SLOs by combining multiple SLIs.
Blameless provides complete visibility into end-to-end user journey flows, as a user journey can map across multiple services, multiple SLIs and multiple SLOs in a vendor-agnostic way. This concept of user journeys also helps teams reduce alert noise and effectively prioritize operations work, as SLO-triggered incidents are specific to the team’s most critical customer experiences.
Additionally, you can set thresholds using error budgets. Blameless can take the SLO and convert it into an error budget automatically. By monitoring, testing and proactively tracking the consumption of the error budget, teams can iteratively refine and set the appropriate error budget as well as associated policies.
Step 3: Set Error Budget Policies
The third stage of successful SLO setup is configuring them to be actionable. Blameless does this by triggering certain workflows when defined error budget thresholds are exceeded. Through bi-directional integrations with systems like PagerDuty, Slack, ServiceNow and other systems, Blameless allows customers to easily trigger incidents, collaborate, block deploys, create tickets, and more when error budget policies are violated. Error Budget policies can be tested before they are enforced giving customers context over control.
For example: Consider a customer who wants to achieve an SLO of 99.95% uptime (equivalent to an error budget of 21.56 minutes of downtime per month) for user login activity, a critical end user journey for their application. Using application performance data from their APM software, they determine business transaction response time (latency) is a critical Service Level Indicator (SLI) to track. Based on this SLI, they create an SLO where a violation takes place whenever a user login request takes longer than 300ms. They can then define an error budget policy such as the following:
If more than 75% of the error budget is depleted, service owners will automatically be alerted
If 100% of the error budget is depleted, an incident within Blameless will automatically kick off
By tracking the full lifecycle of an incident—such as postmortem data, service metadata and changes—and linking back to error budget violations, customers can capture all the activity associated with services in a single location. This shared context creates a complete feedback loop between production and development.
Stage 4: Operationalize SLOs
The final stage of implementation is to build a long-term process to get value out of SLOs. Examples include:
Having a weekly operational meeting with cross-functional leaders to review SLOs
Capturing commentary and discussions around SLO violations or trends
Capturing, assigning, and tracking follow-up action items from SLO violations
Reporting on SLOs to the Board to validate prioritization of engineering investments
Operationalizing SLOs: Why It Matters
As we speak to many organizations who have tried and struggled to effectively implement SLOs, the number one reason for failure is because the SLOs were set up in a silo, without a consistent and collaborative process across key stakeholders. The second reason is because teams often are not equipped to select and operationalize the right SLOs, which leads to manual toil and troubleshooting.
In these circumstances, SLOs often see limited adoption or can become meaningless. This creates friction between teams and can hurt trust. Blameless addresses these pain points by providing guided workflows for every stakeholder, the ability to select and test SLOs, as well as the ability to take actions based on SLOs.
Finally, Blameless has taken a unique approach to SLOs that is vendor-agnostic. In today’s complex distributed environments, customers are increasingly using multiple observability solutions at different layers of the stack. Blameless can leverage metrics from AppDynamics, New Relic, Prometheus, Datadog, and other industry-leading solutions to build a single source of truth around your most critical user journeys. While SLOs are a useful tool to help production teams gain more proactive visibility, they are not meant to be a real time monitoring solution —- rather, their true value comes from providing a unique view into the quality of your digital experience and business health.
We are actively working with our design partners to refine Blameless SLOs. If you’re interested in learning more, register for our webinar, sign up to join our alpha program waitlist, or contact us with any questions at firstname.lastname@example.org.
For more reading on SLOs, check out the following resources: