Improve your Reliability with Blameless SLOs, Now Generally Available

Blameless is excited to announce that our SLO Manager is now generally available! SLO Manager is a new product added to the Blameless platform. This product helps SRE and engineering teams proactively make data-driven decisions about reliability efforts.

According to a survey Blameless conducted, over 80% of organizations use SLOs or will in the next 1-2 years. While there are a variety of solutions available to create SLOs using application performance monitoring (APM) tools, it remains difficult to prioritize, interpret, and leverage these to drive customer satisfaction. After building SLOs, many teams are still left asking, “So what’s next?”

With Blameless’ SLO Manager, teams can create distinct user journeys that correspond to their services. Teams can monitor these services’ corresponding SLOs and gain actionable insights via error budgeting. Blameless error budgets help teams understand how much unreliability their services have experienced over a time period, and predict when their error budget will deplete. This helps teams sort services by risk levels and take proactive measures to address any degradation of reliability before it starts affecting customer satisfaction.

Embarking on a user journey

Creating user journeys is a key part to crafting an SLO. User journeys are distinct steps and actions within your system that users interact with. For instance, consider an eCommerce site. One particularly important user journey is the checkout. If the service is unavailable or loads too slowly, customers may abandon their carts. By mapping these crucial user journeys, teams have a better understanding of where they most need to focus their efforts when setting SLOs.

In Blameless, user journeys can be flexibly organized. They can map SLOs to a single backend service or distinct way your users interact with your product (checkout, account settings, user login, dashboard refresh, etc.). Once you’ve determined a user journey to embark on, it’s time to create your SLIs.

Dictating service level indicators

Service level indicators are another important part of the Blameless SLO Manager. SLIs allow teams to use their monitoring data to understand how they’re performing against customer expectations. This performance is calculated by dividing the system’s “good” events by its “valid” events and multiplying that by 100%.

Multiple user journeys may rely on the same underlying services and corresponding SLI metrics. To make this easier, teams can share SLI metrics across multiple user journeys, and assign each their own SLO target. Additionally, each user journey can have different reliability goals (SLO) to match different customer profiles or tiers (e.g. high-touch/large and small customers). This flexibility allows for a deeper understanding of how teams are performing against specific customer expectations.

Within Blameless’ SLO Manager, teams can set SLIs on any metric related to their services. Teams can also integrate natively with leading application and infrastructure monitoring tools, such as Prometheus, New Relic, Data Dog, and Pingdom. Additionally the Blameless SLO APIs allow teams to inject metrics from other data sources.

Crafting SLOs

Service level objectives (SLOs) are goals for the reliability of services. If teams are meeting their SLO, their customers will be happy. If they are not meeting their SLO, customer satisfaction will suffer. As the best metric to determine how changes to the service affect the end user, SLOs are a critical decision-making tool. For instance, if a team is consistently meeting its reliability goals, it can increase innovation velocity. However, if the goal is not being met, the team knows that it must first work on reliability concerns before adding new features.

Blameless helps teams consolidate all these goals and knowledge into a comprehensive dashboard. This dashboard is a better depiction of overall reliability, and helps teams gain context and move away from shallow incident metrics as a proxy for reliability data.

By combining multiple SLOs with different types of SLIs such as availability, latency, throughput and saturation, you can track SLOs on an unlimited number of SLIs per service within Blameless’ SLO Manager.

Additionally, setting up SLOs is simplified. Wizard-based SLO creation accelerates on-boarding experience for teams looking to create and iterate on SLOs. The setup wizard guides users through creating their user journeys, attaching a new or existing SLI to the user journey, setting an SLO target on the SLI, and determining an error budget alert policy.

And, through our API, teams can automate SLO creation in Blameless and integrate with your own DevOps data.

Establishing error budgets

Error budgets are an advanced step to using SLOs. This is the actionable part. Error budgets depict how a system is performing over a given time period. Teams can set alerts based on depleted error budget. This helps them be proactive, and mitigate reliability issues before they begin to pain the customer.

With error budgets, Blameless offers a way for teams to continuously deliver value to their customers at a fast pace while mitigating reliability risks, surpassing traditional SLO management approaches.

The SLO dashboard displays the risk level of depletion error budget for each service, as well as a prediction for when the error budget will be depleted based on burn rate. This predictive capability as well as the sorted risk levels gives teams a better understanding of which services need their attention. Blameless also helps teams visualize when their error budgets fluctuate by generating performance graphs.

Screen shot showing historical graph of error budget burn over the past 28 days.

Additionally, Blameless makes SLOs actionable through alerts on error budget depletion with the option to automatically create an incident within Blameless. These alerts notify users through various channels such as Slack and email. Once an incident in Blameless has been created, teams can also set a PagerDuty alert to fire automatically. By creating incidents for at-risk services, teams can prioritize and proactively mitigate further degradation of the service. This can help teams mitigate problems before they affect customer happiness.

A complete SRE solution

One of the most useful parts of Blameless’ SLO Manager is how it brings together the rest of the platform, connecting your tools to and services to customer happiness. This creates a full service life cycle SRE solution. By setting alerts on error budgets and kicking off incidents based on burn rate, you can proactively mitigate customer issues. By using this data in your retrospectives when detailing action items, you influence development and learn from your failures. By incorporating this data as a key metric in reliability insights, you can make sure you’re reporting on and making decisions with the right, most meaningful information.

Together, this means happier customers, happier engineers, and better ways to measure progress and make decisions. If you’re interested in trying Blameless SLOs, reach out to our team for a personalized demo today.