The Blameless Blog

Failure Is Not An Option Inevitable

Featured Post

3 Ways SRE Can Boost your Business Value

In this blog post, we’ll look at the business value of SRE through customer focus, observability, and efficiency.
Oct 19, 2020
3 Ways SRE Can Boost your Business Value

In this blog post, we’ll look at the business value of SRE through customer focus, observability, and efficiency.

Oct 16, 2020
SREview Issue #6 October 2020

BOO! Did we scare you? We couldn’t help it, we’re just so happy it’s spooky season. Here’s the October issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

Oct 13, 2020
Can Security Teams Benefit from SRE? You bet!

In this blog post, we’ll break down how to use SRE to enhance your security procedures.

Oct 8, 2020
How to Construct a Reliability Model for your Organization

In this post, we’ll construct a basic reliability model and show you how to create one for your own organization.

Oct 1, 2020
This is your Guide for Implementing SRE in NOCs

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Sep 30, 2020
The Ultimate, Free Incident Retrospective Template

To make the most of each incident, teams need a solid post-incident template that can help minimize cognitive load during the analysis process. Here is an example of what a comprehensive, narrative incident retrospective could look like.

Sep 24, 2020
Here's your Complete Definition of Software Reliability

In this blog post, we’ll break down what software reliability means. We’ll look at how the reliability of your software is perceived, how teams operate to improve reliability, and how to contextualize reliability with customer happiness and cultural lessons.

Sep 17, 2020
Availability, Maintainability, Reliability: What's the Difference?

In this blog post, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability.

Sep 15, 2020
SREview Issue #5 September 2020

Here’s the September issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

Sep 11, 2020
SRE Leaders Panel: Testing in Production

Our panelists discussed testing in production, how feature flagging and testing can help us do that, and how to get managers to be on board with testing in production.

Sep 8, 2020
How to Improve the Reliability of a System

In this blog post, we’ll work through some helpful steps to take when improving a system’s reliability. We’ll use a development project as an example, but the essence of this advice can be applied anywhere SRE is being implemented.

Sep 3, 2020
Industry Experts Explain how to Thrive in a Post-COVID World

In a CIO panel hosted by Lightspeed Venture Partners, industry experts came together to discuss how to thrive in a post-COVID world. Here are key insights from their coversation.

Sep 2, 2020
Determining Error Budgets and Policies that Work for Your Team

In this blog, we’ll look at the basics of error budgeting, how to set corresponding policies, and how to operationalize SLOs for the long term.

Sep 1, 2020
How to Build Your SRE Team

In this blog post, we’ll look at some of the many roles an SRE can play, and how to find people with those skill sets.

Aug 26, 2020
Here are the Important Differences Between SLI, SLO, and SLA

In this blog post, we’ll cover what SLI, SLO, and SLA mean and how they contribute to your reliability goals.

Aug 25, 2020
How SLOs Enable Fast, Reliable Application Delivery

In this blog, we’ll discuss how SLOs are the key to modern application delivery, how to manage and measure them, the importance of observability for your SLO solution, and how to begin the journey to reliable application delivery today.

Aug 21, 2020
SREview Issue #4 August 2020

Here’s the August issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

Aug 20, 2020
What is a Kubernetes Operator and Why it Matters for SRE

In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.

Aug 19, 2020
Here are the Metrics you Need to Understand Operational Health

In this blog post, we’ll walk you through holistic measures and best practices that you can employ starting today. These will include challenges and pain points in gaining insight as well as key metrics and how they evolve as organizations mature.

Aug 14, 2020
Resilience in Action, E5: Tammy Bryant and Eric Roberts The Importance of Glue Work

In our third episode, Amy chats with Tammy Bryant, Principal SRE at Gremlin, skateboarder, and horror movie lover and Eric Roberts, Sr. Manager SRE at Under Armour, performer/writer/recorder of music, and coffee aficionado.

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.