Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Postmortems Now Called Retrospectives in Blameless

Phoebe Wang
|
3.2.2022

Something big happened at Blameless this month — our “Postmortem” feature was updated to its new name, “Retrospective”. To the naysayer, I suppose you’re thinking, This seems trivial. Different teams call it different names anyway, so why bother making the change? First let me say, thank you for reading our blog and I hope you finish this one through to the end. Now, allow me to explain our reasoning and why we’re excited about this update.


Retrospectives Dashboard


A fundamental of SRE is the mindset to treat failures as systemic problems that require systemic solutions. This subsequently fosters a more positive and psychologically safe environment, inspiring the SRE mantra “blameless culture” — you can read more on that in our post about SRE culture. One way you can actionably treat failures as systemic problems that require systemic solutions is by being more intentional with post-incident analysis. For example, you can study a post-incident report to identify gaps in systems, tooling, and processes. We always encourage teams to use the opportunity to brainstorm sustainable solutions. In fact, it’s good to remind yourself that an incident is never an isolated event. There’s virtually always something more to discover that needs addressing.


We want everyone to succeed in their SRE journey, and it’s important for us to evangelize this forward approach to post-incident analysis. Earlier we noted that there are a few different terms that describe the part of incident management that involves a post-incident report and its analysis. This is true. Still, most of us will agree that the most common and longest running name is postmortem. Postmortem is actually a medical term that dates back to the 1820s. In tech, it’s used metaphorically to describe when we review an incident after its “death” and record detailed notes. In a way, sure I guess that makes sense. The incident is over; we killed the beast. But did we? Most of us know this usually isn’t true. And if we’re thinking ahead, we take this as a learning lesson and brace ourselves for what’s ahead.


By changing the product feature name from “Postmortem” to “Retrospective”, Blameless (the company) aims to encourage teams to view incidents as learning opportunities. It can prepare you for future similar events or to identify a larger underlying issue. You’ve heard us say before that incidents are unplanned investments. Teams should cherish the post-incident process by collaborating on how to improve the system moving forward. The term postmortem implies that since the event is “over”, there’s nothing more to discuss. Not so the case. Smart nomenclature is mindful that words are always subject to interpretation. Let’s say I’m a woodshop teacher explaining to my students that it’s important to wear goggles when chopping and to always turn on the safety lock when a saw is not in use (disclaimer: I’m not a woodshop expert). I call these “safety protocols” for everyone’s protection. I don’t call them “restrictions”. I don’t even call them “rules”. Sure, they’re effectively rules and restrictions, but the purpose behind them is to protect. That’s the theme I want everyone to keep in mind. Silly example, but I hope you see where I’m going.


Semantics! Yep, and proud of it. Earlier I mentioned that “blameless culture” is largely emphasized in the site reliability engineering community. Evolving away from using the term postmortem is helpful in engendering blameless culture. We previously wrote about how and why the words we use in engineering impact the way we think and work. Language has the power to shift our perspective. Consequently, we might be excited for something or dread it. It can impact the level of importance we attach to a thing. We might expect a situation to be combative or collaborative. In this particular case, retrospective promotes constructive conversation, discourages finger-pointing, and fosters problem-solving. By contrast, postmortem, and it’s association with death, implies finality and bears a strongly negative connotation. Not all incidents are Sev0, but we usually still have something important to learn. There’s no reason to associate an incident with the idea of death.


My final “battle card” is that this feature update is a request we’ve received from many Blameless customers. Several have told us that they refer to the post-incident process internally as the “retrospective” — “retros” for short — and they would love to see that reflected in the Blameless product. They do this to promote collaboration, build long-term sustainable and scalable solutions, and discourage finger-pointing. In fact, our friends at Hashicorp had submitted a ticket to us for this specific feature request. Martin Smith, Senior Site Reliability Engineer at Hashicorp explains, “We believe that retrospectives create continuous reflection and improvement whereas postmortems imply a root cause, which drives the wrong outcomes vs. future improvement. Root cause analysis is disappearing as folks build more and more distributed systems and analyze incidents more like the airlines than a mainframe.” We’re excited that we can finally deliver this update to our customers and continue to partner with you on your reliability journey. To all of our customers, whether or not you requested the update, we understand this will take a bit of getting used to. Thank you for progressing with us as we continue to embrace a blameless mindset. It’s all part of moving the needle forward for more reliable services and resilient teams.


“We believe that retrospectives create continuous reflection and improvement whereas postmortems imply a root cause, which drives the wrong outcomes vs. future improvement.” - Martin Smith, Senior Site Reliability Engineer, Hashicorp


At Blameless, our goal is to be more than just a product for incident response and SLO management. We want to share everything we know about site reliability engineering and make it more accessible. One of the ways we do this is by providing a best-in-class product for engineering and on-call teams. Whether that’s through functionality, service, or even - you guessed it - product nomenclature. Another way we like to share knowledge with the SRE community is through our blog *shameless plug* and other resources like our podcasts and webinars where we record conversations with SRE experts in the community. We welcome you to check those out, and if you’re ever interested in having a chat with our experts, feel free to request a demo. We’d be happy to walk you through the product and share some of our insights. Finally, thank you to our customers who continue to partner with us as pioneers of reliability engineering!

Resources
Book a blameless demo
To view the calendar in full page view, click here.