The Blameless Blog

Failure Is Not An Option Inevitable

Featured Post

Expert Insights

SRE Leaders Panel: Managing Systems Complexity

Leading minds in the resilience industry discuss how SRE can manage systems complexity, and how that's tightly intertwined with business health especially in the context of current health and social crises.
July 2, 2020
SRE Leaders Panel: Managing Systems Complexity

Leading minds in the resilience industry discuss how SRE can manage systems complexity, and how that's tightly intertwined with business health especially in the context of current health and social crises.

July 1, 2020
SLO Adoption at Twitter

The concept of service level objectives (SLOs) and error budgets have been key to this transformation, as SLOs shape an organization’s ability to make data-oriented decisions around reliability. (Read here for a definition of SLOs and how they transformed Evernote.). Today, the Twitter team has invested in centralized tooling to measure, track, and visualize SLOs and their corresponding error budgets. 

June 30, 2020
Twitter’s Reliability Journey

We had the privilege of interviewing Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zachary Kiel, Sr. Staff SRE to learn about how SRE is practiced at Twitter.

June 29, 2020
How SLIs Help You Understand Users' Needs

To be effective, service level indicators must be relevant to the users’ needs and experience. By consolidating a number of internal metrics into one indicator that reflects the typical use of the service, we can ensure that meeting our SLO means keeping users happy. A good way to think about this is by looking at the user’s experience or journey.

June 26, 2020
How to Reduce Engineering Waste: Embrace Resilience

Resiliency isn’t something that just happens; it’s a result of dedication and hard work. To reach your optimal state of resilience, there are some crucial SRE best practices you should adopt to strengthen your processes.

June 26, 2020
Top Practices for Runbook Automation

Runbooks, also known as playbooks, are documents that walk you through a certain task with specific steps. Automated runbooks can be a powerful tool for time-saving and consistency. We’ll look at five best practices for getting the most out of runbook automation, some tools on the market that can help you implement them, and discuss how to integrate runbook automation into a complete SRE solution.

June 25, 2020
What is Site Reliability Engineering? A Human Approach to Systems

As organizations are made of people, any organization can foster continuous learning, blameless culture, and psychological safety so long as its people are committed to a growth mindset. Once these cultural factors are in place, it becomes much easier to implement the practices, processes, and tools that scale that culture of excellence. 

June 24, 2020
SREview Issue #2, June 2020

Here’s the second issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

June 19, 2020
Best Practices for Effective Incident Management

Below are five incident management best practices that your team can begin using today to improve the speed, efficiency, and effectiveness of your incident management process.

June 17, 2020
Resilience in Action, E3: Inclusion and Integrity with Sidney Miller

In our third episode, Amy chats with Sidney Miller, Talent Acquisition Lead at Packet and Inclusion Strategist for those that can not have a voice.

June 16, 2020
At Blameless, Reliability is Personal

I was asked to talk about why is reliability important to me personally. I was up at 3:00 AM this morning, thinking through this question. So my sleep is obviously pretty unreliable and those kinds of questions will always get me going. And I thought, let me kind of walk folks through how reliability is personal to me.

June 12, 2020
Blameless Is Awarded CIOReview Top 2020 DevOps Solution Provider

We're proud to announce that we were selected by CIOReview as one of the Top 20 DevOps Solution Providers of 2020 alongside other innovators in the space such as Chef, Jfrog, Splunk, and Xebia Labs. This recognition validates our vision to help teams achieve production excellence by facilitating resilience and learning.

June 11, 2020
Announcing our new integration with GoToMeeting

In addition to Zoom, Slack and Google Hangouts, Blameless has released a new integration with GoToMeeting to further extend our collaboration capabilities. With this integration, customers can automatically spin up a GoToMeeting link within the Blameless Slack incident channel.

June 9, 2020
A Journey Through Blameless from Incident to Success

Here at Blameless, every aspect of our product has SLOs (Service Level Objectives) and error budgets in order to help us understand and improve customer experience. Sometimes, these error budgets are at risk, triggering an incident.

June 5, 2020
SRE Leaders Panel: Work as Done vs. Work as Imagined

Blameless recently had the privilege of hosting some fantastic leaders in the SRE and resilience community for a panel discussion.Our panelists discussed the effects of imposter syndrome especially during high tempo situations, how to use it to our advantage and overcome doubt, and how culture directly affects the availability of our systems.

May 29, 2020
SREview Issue #1 May 2020

Welcome to the SREview! This zine will feature epic Tweets, content, and events happening in the SRE and resilience engineering community throughout the month.

May 26, 2020
Introducing Blameless Service Level Objectives

Over a year ago, Blameless launched the industry’s first end-to-end SRE platform to help software teams innovate without sacrificing reliability. As Service Level Objectives (SLOs) provide an anchor for reliability targets and corresponding decisions, they are the foundational step toward helping teams truly adopt SRE best practices. Today, we are very excited to announce our new SLO platform, giving teams a shared language on how to focus their engineering efforts.

May 22, 2020
Join Blameless at INS1GHTS2020!

Blameless is so excited to sponsor INS1GHTS2020. This one-day digital gathering of industry leaders in NetOps, DevOps, and application delivery provides the (virtual) space for candid conversations and presentations on navigating the present and building the infrastructure that will power the future.

May 21, 2020
Join us at Catchpoint’s SRE from Home!

If you’re interested in spending time with the resilience engineering community, chatting about how COVID-19 has affected your work, or simply just relaxing with a nice beverage while listening to some awesome speakers, make sure you save your seat today.

May 20, 2020
Fostering Psychological Safety in Remote Teams is Crucial

Psychologically safe organizations are free to create, discuss, disagree, take risks, and make mistakes. These organizations are often the ones we see as key innovators in their unique industries. In other words, cultivating a culture of psychological safety is paramount in order to succeed. So what can we do to make sure our teammates feel secure even while socially distanced?

May 12, 2020
Resilience in Action, E2: Adaptability, ego, and scaling with Tim Banks

In our second episode, Amy chats with Tim Banks, a technical account manager at Mission who has held the title of database engineer, DevOps engineer, SRE, American National and Pan American Brazilian Jiu-Jitsu champion, and professional chef during his career.

May 6, 2020
Managing Burnout During COVID-19

During this crisis, managing burnout has become more difficult with people unable to separate home from work, the increased burden of keeping everything on and heightened on-call loads, and the strain on communication. Here are tips to help combat burnout in your teams.

May 1, 2020
Deserted Island DevOps Recap

April 30, 2020 Austin Parker, Principal Developer Advocate at Lightstep and co-host of On-Call Me Maybe, hosted a one-of-a-kind DevOps conference. Deserted Island DevOps was the first ever conference held in the world of Animal Crossing: New Horizons.

April 29, 2020
How resilience and security shift left: An interview with the EVP Product & Engineering and CISO at FOX

CEO and Co-founder of Blameless Ashar Rizqi had the privilege of interviewing Melody Hildebrandt on her fascinating personal story, as well as her thoughts on security and resilience in today’s constantly evolving world of technology.

April 28, 2020
How We Use Blameless to Power Remote Work

We’ve been relying on Blameless more and more to improve how we collaborate virtually. Here are some of the top workflows and tips on how we have been using Blameless internally to streamline remote productivity.

April 24, 2020
A "Retrospective" of Amy Tobey's "The Future of DevOps is Resilience Engineering"

April 22, 2020 at 11:20 AM PST, Amy Tobey began her talk “The Future of DevOps is Resilience Engineering” at Gremlin’s Failover Conf. During her talk, attendees registered additional questions. Requests and responses noted in timeline below.

April 23, 2020
Reflections on Gremlin's Failover Conf

With dozens of cancelled events, social distancing policies, and heightened stress due to the current crisis, it was more necessary than ever to take a moment to learn, share, and talk to one another about something we are all passionate about. We loved Failover Conf, and want to share our favorite parts with you.

April 22, 2020
Getting SRE Buy-in from C-Levels for Error Budgets and SLOs, Part 3

Two levels of management have agreed to your SRE buy-in efforts. That is a huge accomplishment! If you’re here, you're making great traction adopting SRE best practices, but the battle is not won yet.

April 21, 2020
Thought Leadership Panel: What is a "real" SRE?

SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen discuss how SREs can approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.

April 16, 2020
Incident Readiness and Observability for Production Teams: Save Your Spot!

We are very excited to partner with Lightstep to share practical steps on gaining deep observability into distributed systems, and automating toil from incident response and learning to improve production readiness in our live webinar, Incident Readiness, Observability & Learning for Production Teams.

April 14, 2020
Getting SRE Buy-in from a VP or Director for Automated Metrics and Continuous Learning, Part 2

After getting managerial approval for incident management, your SRE buy-in program is well underway. In part 2 of this blog series, we're going to share how to convince a VP or director to invest in additional SRE practices to strategically improve business results: automated metrics and continuous learning.

April 9, 2020
Getting SRE Buy-in from a Manager or Lead for Incident Response, Part 1

In this blog series, we will walk you through how to come up with a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.

April 8, 2020
Resilience in Action, Episode 1: Narratives in Incidents with Lorin Hochstein

April 7, 2020
Technology Innovation Snapshot: How Blameless Accelerates Team Performance

In Digital Enterprise Journal’s March Edition of its Technology Innovation Snapshot, Blameless was listed among 11 other companies as promising vendors. Blameless is honored to be alongside companies such as Gremlin, Catchpoint, and Moogsoft, and excited about the future DEJ sees for the SRE space.

April 3, 2020
Blameless is a Proud Sponsor of Gremlin's Failover Conf

We’re so proud to be co-sponsoring Gremlin’s new virtual conference, Failover Conf. As Gremlin states, “We expected to gather together in person to share our knowledge and experiences when the unexpected happened. But we’re resilient. When one opportunity goes down, we create another."

April 2, 2020
How SREs can Embrace Resilience During Crises

Blameless recently had the privilege of hosting SRE leaders Liz Fong-Jones, Dave Rensin, and Alex Hidalgo to discuss how SREs can embrace resilience during pandemic, and how the principles of SRE intersect with global trends.

April 1, 2020
Survivor Season 41, Bay Area

Blameless is excited to announce its sponsorship of Survivor Season 41: Silicon Valley. In this season of Survivor, players will hold nothing back! It’s a season that will shock everyone, even Jeff Probst himself. Survivor season 41 will feature an age-old conflict: Developers vs Operations!

March 31, 2020
Incident Management: Roles, Challenges, and Mastering Incident Command

The goal of this piece is to provide some practical advice on how teams can coordinate and respond to complex, dynamic incidents. After all, incidents are unplanned investments that surface valuable learnings for improvement.

March 26, 2020
SRE Office Hours with Staff SRE Amy Tobey

Blameless Staff SRE Amy Tobey is lending her time to provide SRE office hours to help anyone in need get their head above water. She cares deeply about her community of SREs and wants to take what she’s learned over the 20+ years of her career to help others.

March 24, 2020
SRE for Business Continuity in the Face of Uncertainty

No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.

March 19, 2020
Our Top 5 On-Call Practices

On-call: you may see it as a necessary evil. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager around the clock. But does on-call have to be so dreadful? We think not. Here are five best practices that can help your team respond quicker and build more resilient systems that minimize repetitive interruptions.

March 16, 2020
The Incident Response Approach to Remote Work

In response to recent events, many organizations are implementing social distancing programs such as remote work. Successfully transitioning to remote work does come with challenges, but the right practices and attitudes can make it much less painful (and safer for you than heading into the office).

March 12, 2020
Great Incident Response Requires 3 Major Components

Remote work is only projected to increase, and teams need to be able to adapt in order to resolve incidents quickly and efficiently, even if team members are a thousand miles away. But how can we make great incident response a reality?

March 10, 2020
How ITIL, DevOps, and SRE Work Together for your Organization

The trick is to ensure that regardless of your organizations’ different operating models or toolchains, there is shared visibility, communication, and collaboration across teams. This will allow your disparate teams to stay aligned while using the best practices from ITIL, DevOps, and SRE.

March 3, 2020
Why I Joined Blameless - Afif Mohd-Amir

Growing up, I always had a pretty wild imagination, drawing up the craziest of ideas and sharing them with my friends. That process of idea sharing almost always went like this:

March 3, 2020
How do we Apply SRE Outside of Engineering with Google's Dave Rensin

In this talk from Dave Rensin, Engineering Leader at Google, you'll learn about what it looks like to apply SRE principles outside of engineering in organizations.

February 27, 2020
Using AI to Auto-Detect and Remediate Incidents

Today, the number of possible failure modes in cloud and microservices applications are exploding, making it increasingly difficult to gain true observability and take the right action across IT environments. Register for this webinar with Blameless and Zebrium to find out how AI can help with incident auto-detection.

February 19, 2020
5 Surefire Ways to Improve Your Product Reliability with Logging and Automation

Over many years of working with customers, we have come to the conclusion that there are several specific areas of focus where investment in automation can add tremendous value over the long run.

February 18, 2020
Evolving Blameless' SRE Practices with Amy Tobey

At Blameless, we drink our own champagne, and aim to adopt a mindset of continuous learning to foster resilience. We believe that the adoption of SRE practices is one of the best ways to get there.

February 12, 2020
Structuring Your Teams for Software Reliability

How well positioned is your team to ship reliable software? What are the different roles in engineering that impact reliability, and how do you optimize the ratio of software engineers to SREs to DevOps?

February 4, 2020
How to Network Effectively as an SRE

For many SREs, networking prompts a similar response as going to the dentist. You know you should do it, but you don’t really want to. But networking is much less like a root canal and more like a regular teeth cleaning; you may not want to go, but once you’re there, it’s not so bad.

January 29, 2020
New Postmortems Design and Commenting Functionality

The new comment sidebar helps drive postmortem workflows by enabling collaborators to comment on postmortems, reply to comments, and resolve comments. We’ve also updated the look and feel of postmortems so that postmortem authors can gain as well as provide important post-incident context in a simple way.

January 21, 2020
What Are Service-Level Objectives? Lessons Learned

Service Level Objectives, or SLOs, are an internal goal for the essential metrics of a service, such as uptime or response speed. We’re probably familiar with this definition, but what is the value of setting these goals?

December 26, 2019
5 Best Practices on Nailing Postmortems

Reading about postmortem best practices can sometimes be quite different from seeing them in action. Postmortems are like snowflakes; no two will ever look the same.

December 18, 2019
An SRE Carol

We’re probably all familiar with Dickens’ story of Scrooge and the Three Ghosts of Christmas, written all the way back in 1843. What we may not know is that ghosts providing visions and teaching lessons is still common practice today! Let’s look into the carol of an ambitious, but unreliable, tech CEO.

December 11, 2019
Why I Joined Blameless - Simone Salman

My name is Simone Salman, and I’ve been working as a software engineer at Blameless since May 2019. In the spirit of thanks as we’re approaching the holidays, I wanted to reflect on my time at Blameless thus far, and share a few things about the culture that I’m especially grateful for.

December 10, 2019
Building Reliability Through Culture with Veteran Google SRE, Steve McGhee

It’s astonishing that despite the tremendous time we spend working on our systems, we seem to have very little control over them. If we can’t predict where the next incidents will come from, then we will be forever stuck in a reactive cycle of repair. An analogous example is the famous fable of the Three Little Pigs.

November 26, 2019
Improving Postmortem Practices with Veteran Google SRE, Steve McGhee

For many SREs, Google’s 99.999% availability seems like an untouchable dream. If anything, getting out of pager hell is already worth celebrating with all your coworkers, friends, and family. How can you get to a stage where you have time to proactively prevent incidents, and enter a mental state of calm and control?

November 21, 2019
9 Reliability Talks at AWS re:Invent 2019 that SREs Should Attend

Planning your schedule for AWS re:Invent 2019 but don’t know how to choose between the 3400 sessions? If you are passionate about all things reliability, we’re here to help you sift out the signal from the noise.

October 29, 2019
The Tipping Point: 4 Signs Software Reliability Should be a Top Priority at Your Company

Thanks to companies like Amazon, Google, Facebook, Netflix, etc., software delivery is transitioning from a novelty to a utility...When feature requests for reliability exceeds 50% of all feature requests, it’s time to focus on reliability first and foremost.

September 13, 2019
Introducing Swimlanes for Incident Resolution

August 19, 2019
Trend Alert: SRE is Shifting Left

Having talked with 300 companies from industries like retail, finance, healthcare and SaaS; we see SRE as a discipline is shifting left in the software development life cycle...However, this does not take away job opportunities from SREs. Shifting left allows SREs to become partners in the development process.

July 1, 2019
Why Every Company Can Benefit from a Blameless Culture

When companies blame, fearful employees are not incentivized to surface issues early or ship risky changes... The fact is, complex systems fail. Rather than blaming individuals for these failures, the only way to navigate this complexity is to empower people to have the adaptive capacity that machines do not.

May 7, 2019
Blameless announces ISO 27001 certification

April 29, 2019
How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams

Whether you are implementing SRE or DevOps, your best intentions are likely going to disappoint you on the first try. Much like with DevOps, the path to successful SRE implementation is not as elusive as it may seem. Costly mistakes and team dissatisfaction can be avoided — if you take the right steps.

April 24, 2019
See Blameless at PyCon and join the team for drinks and snack

If you'd like to get to know us better, then join the team for drinks and appetizers at the unique and delicious restaurant Hodge's on May 4th, or if you'd like to apply for a position to work with Blameless and change the world through SRE, visit us at booth 23 at the PyCon job fair!

March 11, 2019
Blameless at SREcon19 Americas

THE conference for SRE is just around the corner, and we hope you're able to join the Blameless team there in Brooklyn. We'll be at booth 15, located in the Promenade. Stop by to chat with us about SRE and how Blameless can take care of every step in your SRE process.

February 12, 2019
Why SaaS Can't Ignore SRE

Walk the halls of any SaaS startup and you’ll hear the same thing over and over again, “before we can really scale, we have to be enterprise-ready.” While “enterprise ready” has many definitions, it’s generally defined as being secure and compliant, having the baseline features the market expects, and being consistently reliable.

February 1, 2019
Blameless at SaaStr 2019

It's that time of the year again: SaaStr time! Blameless is heading to SaaStr in San Jose and hopes you'll be able to join us. You can stop by Booth 721 to chat with us about Site Reliability Engineering (SRE) and why Blameless might be the best solution for you. Come grab some swag, chat with us, and RSVP for our VIP dinner!

November 28, 2018
Exciting News at KubeCon 2018

December is typically considered a month of reflection and anticipation for what’s to come -- it's kind of like a postmortem for your year :wink:. 2018 was a great year for Blameless, and we have something even more exciting coming in 2019. So join us at KubeCon 2018 as we get ready to make an announcement that will be guaranteed to brighten your new year.

November 13, 2018
Why I Joined Blameless - Kevin Greenan

In many companies, incident management has turned into a game of hot potato. Our natural human tendency is to blame, but blaming actually discourages resolution, because whoever fixes the problem could be blamed for creating it in the first place.

October 8, 2018
Getting to 99.999% Availability with Twilio’s Tyler Wells

A remarkable milestone for any company’s site reliability engineering (SRE) is five 9s availability. That’s less than 30 seconds of service unavailability per month! Exactly what Twilio has accomplished. Tyler Wells, the Director of Engineering at Twilio, shares the key building blocks of getting to five 9s.

September 19, 2018
LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations

Most feature developers don’t plan for retirement of features. Microservices gives you the illusion that you can yank and replace, but that’s not really the case. It’s tough to turn off a microservice without losing an arm or leg. That’s why it’s important for SRE to have a full life cycle engagement.

August 9, 2018
Talking with Matt Klein about “A Culture of Reliability”

A discussion with David N. Blank-Edelman and Matt Klein.

August 3, 2018
But I Already Have DevOps! (how SRE figures into the picture)

I have yet to meet an organization that hasn't had to balance often conflicting needs around feature velocity and operational stability.

July 13, 2018
Getting the Most Out of SRE, SLOs, and Error Budgets with Joseph Bironas at Collective Health

Joseph currently leads the SRE team at Collective Health, a company that is transforming the employer-driven healthcare economy, redefining the way health benefits work.

July 1, 2018
Garrett Plasky Shares How SLOs Transformed Evernote

Having SLOs in place for your production services allow you to remove the emotion and ambiguity when it comes to figuring out the impact of an unplanned outage or a bug released to production.

June 4, 2018
Introducing the brand new Blameless dashboard

Introducing the Blamless dashboard

June 1, 2018
Building Blameless right from the beginning

Building Blameless right from the beginning

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.