The Blameless Blog

Failure Is Not An Option Inevitable

Featured Post

The Engineer's Guide to Preparing for Black Friday 2020

In this blog post, we’ll cover how to handle a Black Friday that’s unlike any other we’ve seen thus far. We’ll cover how SLO-based alerting, runbooks, and other practices to drive preparedness are crucial for holiday season success.
Nov 18, 2020
The Engineer's Guide to Preparing for Black Friday 2020

In this blog post, we’ll cover how to handle a Black Friday that’s unlike any other we’ve seen thus far. We’ll cover how SLO-based alerting, runbooks, and other practices to drive preparedness are crucial for holiday season success.

Nov 17, 2020
How Mercari Scales Vision, Culture, & Reliability

In a recent fireside chat with Mohan Bhatkar, Head of Engineering for the Customer Reliability Platform at Mercari, Inc. sat down with Blameless Co-Founder Ashar Rizqi. They talked about scaling while avoiding silos, exciting day-to-day challenges, instilling a culture of empowerment, and more. Here are their top insights and the lightly edited transcript of their conversation.

Nov 16, 2020
Blameless Book Club: Implementing Service Level Objectives, Part 1

This is is a summary of key topics from Alex’s book, along with thoughts our team had while reading. In this blog post, we’ll cover part one of Implementing Service Level Objectives, “SLO Development.”

Nov 10, 2020
SREview Issue #7 November 2020

We’re drinking Pumpkin Spice Lattes, lighting candles, and wearing flannel. Oh, and reading a bunch of great stuff. Here’s the November issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

Nov 4, 2020
An end-to-end incident in Blameless and PagerDuty

The bi-directional Blameless integration with PagerDuty helps teams add additional automation to their PagerDuty workflows, minimizing the costs of incident coordination.

Nov 3, 2020
Why I Joined with Chris Hendrix, Engineering

What I’ve found is that I’m driven by teaching emerging best practices: like a surfer chasing the biggest swells I’m drawn to the waves of new paradigms that have the capacity to transform the way the world works. That is why I’m so excited to join Blameless as a Staff Software Engineer.

Nov 2, 2020
Engineers, Stop Hoarding your Metrics

Like The Hobbit’s dragon Smaug laying on his pile of gold, never spending and only hoarding, many of us often stockpile pretty, feel-good, but useless metrics that never make a difference. In fact, they could actually be clouding your ability to get the context and clarity you need from your metrics.

Oct 28, 2020
Yury Niño Roa Shares her Insights on Chaos Engineering and SRE

In this interview, we’ll delve into what draws Yury to SRE and chaos engineering, how she defines resilience, as well as her predictions on emerging trends in the SRE landscape.

Oct 27, 2020
Here are 4 Ways SRE Helps New Employees Onboard

The SRE mentality can provide insights into many areas, including onboarding itself. In this blog post, we’ll cover how SRE takes onboarding to the next level.

Oct 26, 2020
This is How Blameless Integrates with JIRA

Atlassian JIRA, one of the most popular ticketing systems, allows teams to catalogue incidents, follow-up actions, bugs, stories, and more. As a common tool in any DevOps/SRE operation’s toolchain, JIRA is a key integration at Blameless. Here's how it works.

Oct 19, 2020
3 Ways SRE Can Boost your Business Value

In this blog post, we’ll look at the business value of SRE through customer focus, observability, and efficiency.

Oct 16, 2020
SREview Issue #6 October 2020

BOO! Did we scare you? We couldn’t help it, we’re just so happy it’s spooky season. Here’s the October issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

Oct 13, 2020
Can Security Teams Benefit from SRE? You bet!

In this blog post, we’ll break down how to use SRE to enhance your security procedures.

Oct 8, 2020
How to Construct a Reliability Model for your Organization

In this post, we’ll construct a basic reliability model and show you how to create one for your own organization.

Oct 1, 2020
This is your Guide for Implementing SRE in NOCs

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Sep 30, 2020
The Ultimate, Free Incident Retrospective Template

To make the most of each incident, teams need a solid post-incident template that can help minimize cognitive load during the analysis process. Here is an example of what a comprehensive, narrative incident retrospective could look like.

Sep 24, 2020
Here's your Complete Definition of Software Reliability

In this blog post, we’ll break down what software reliability means. We’ll look at how the reliability of your software is perceived, how teams operate to improve reliability, and how to contextualize reliability with customer happiness and cultural lessons.

Sep 17, 2020
Availability, Maintainability, Reliability: What's the Difference?

In this blog post, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability.

Sep 15, 2020
SREview Issue #5 September 2020

Here’s the September issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

Sep 11, 2020
SRE Leaders Panel: Testing in Production

Our panelists discussed testing in production, how feature flagging and testing can help us do that, and how to get managers to be on board with testing in production.

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.