Error Budgets That Work for You. Plus Support for New Relic Metrics and NR Query Language
Did you know that error budget policy is the key to making SLOs actionable? In fact, Twitter’s engineering team did not successfully adopt SLOs until they introduced error budgets. SLOs enable teams to quantify customer happiness, and error budgets enable teams to make data-backed tradeoffs between reliability and feature velocity. We believe that teams optimizing for reliability must adopt both.
Therefore, Blameless is very excited to make the following enhancements to our SLO Manager generally available:
Expanded New Relic support
New data ingest log per SLI
The fully redesigned Error Budget Policy notification service now gives you more control and flexibility. As you get closer to depleting your error budget, you can choose how and when you want to be notified. There is also added flexibility to set different reliability goals for various user journeys. You can create multiple error budget policies to define the escalation profile and proactively address the degradation of your service reliability.
With error budgets you want to be able to forecast the direction in which your reliability is going. Is reliability degrading or improving over time? How fast are you spending your error budget? Even more powerful is the ability to forecast when your reliability will reach a critical state (e.g. 100% depleted). So you have enough time to proactively respond before your customers become unhappy, whether that means shifting focus to bug fixes, tech debt, or having a meeting about how to prevent customer impact. Of course, manually tracking all error budgets is not realistic or time-efficient.
Blameless’ SLO Manager now offers a built-in error budget notification service to automatically notify users via any combination of email, Slack, as well as starting a Blameless incident at various thresholds. There are two threshold types:
Both thresholds are now fully customizable (% or number of days). And each error budget policy can notify users at one or more thresholds. Our goal is to support your escalation profile to match the requirements of your service operation procedure, so you can proactively address reliability issues before your customers notice and get frustrated.
Organizations that rely on multiple monitoring tools can consolidate all their SLO management through Blameless. To support this consolidation, our SLO manager now integrates natively with New Relic, one of the leading Application Performance Monitoring tools.
The new integration gives you full control and flexibility to create your New Relic SLIs using the powerful New Relic Query Language (NRQL). You can use NRQL to periodically retrieve metric data from any New Relic metric types, such as APM, Browser, Infrastructure, Mobile, Synthetic. This applies to all supported SLI types (Availability, Latency, Throughput and Saturation) in the SLO Manager.
If you are new to NRQL, you can start by exploring the following New Relic online documentation:
Example of a Latency SLI:
Additionally, since each SLI takes only a single column of data, we validate your NRQL query string to make sure it includes only one column. This ensures that what follows the “SELECT” statement will result in one field only, which could be the result of mathematical equations supported by the query language.
By automatically validating your queries, the Blameless interface helps you avoid errors that could delay getting meaningful data to make your SLIs truly powerful and effective for your user journeys. For example:
If you are currently using New Relic or planning to use it, and you have a Blameless account, we encourage you to connect your Blameless account to your New Relic account today and start building your first SLI, SLO and User Journey using our powerful SLO wizard.
For more information on how to connect Blameless to your New Relic account, please refer to our online documentation at:
Connections between Blameless and your application performance monitoring provider (e.g. New Relic, Data Dog, Pingdom or Prometheus), while secure over the public Internet, could fail at any time, potentially resulting in data missing in the historical graphs. To protect you from missing data, Blameless always attempts to backfill the data from the time of failure.
Blameless now reports the history of the data transfers’ issues and successes on a per SLI basis via a new tab showing the “data ingest” log for each SLI, starting from the initial backfilling process. This log provides a quick and easy way to troubleshoot the origin of potentially delayed or missing data.
Other issues originating from your data source that could also result in missing data include:
For more information about this new feature, please refer to our online documentation:
To learn about SLO and error budget best practices, check out these blog posts: