What is BetterCloud?
BetterCloud is a SaaS operations management platform, enabling IT professionals to transform their employee experience, maximize operational efficiency, and centralize data protection.
With no-code automation enabling zero-touch workflows, thousands of forward-thinking organizations like Walmart, Oscar Health, and Square now rely on BetterCloud to automate processes and policies across their cloud application portfolio.
Before using Blameless, BetterCloud lacked a clear process to manage incidents. The engineers faced role ambiguity, missing documentation, and a lack of meaningful reliability data. As an added factor, culturally, it became common to view incidents as something to avoid.
Prior to transitioning into a remote-first organization, the team closely watched a clock in the office kitchen that counted the number of hours passed since the last incident. Teams hesitated to declare customer-impacting events and sometimes even avoided doing it altogether.
The SRE Team Steps in to Evolve the Business
In 2021, BetterCloud’s Site Reliability Engineering (SRE) team was relaunched, expanded, and chartered with several ambitious projects including establishing a better incident management process across the organization, with the end goal of reducing reliability related customer churn.
Stephen M. Dick, BetterCloud’s VP of Cloud Engineering and Optimization recalls, “BetterCloud’s customer base had expanded, and we had customers using our platform for business critical automation. Our incident management process needed to evolve to be enterprise grade for our next level of scale.”
Blameless Onboards 120+ Engineers at the Onset
After evaluating incident management and SRE tools on the market, BetterCloud chose to partner with Blameless. Stephen wanted a solution that would enable ChatOps-based incident management, track remediation steps to address underlying issues, and serve as a single source of truth for reliability metrics including MTTD and MTTR. BetterCloud brought in a team of external consultants to help the business transform its approach to Incidents using the IMS framework which is adopted from the public sector. Incident tooling needed to be flexible enough to accommodate these best practices, and Blameless met and exceeded expectations.
“The in-chat tagging and reporting capabilities with Blameless transformed what was perceived as an after-action chore into something our engineers enjoy using.”
“In This Together” is a company value that promotes teamwork and collaboration at BetterCloud. They not only encourage this as it applies within the organization, but also as it extends to stakeholders like customers and partners. Accordingly, the Blameless rollout was a collaborative effort from the start. Stephen’s team worked closely with Matthew Dodge, Customer Success Manager at Blameless, to do recorded training sessions, launch events, and provide configuration advice.
- Unorganized word documents to track incident retros
- Cumbersome manual processes
- Customer frustration
- Attrition influenced by product reliability issues
- Role ambiguity, relied on a 28x10 RACI* chart
- Single source of truth in Blameless for incident and reliability data
- Reliability is a company-wide initiative
- Customers are informed faster and more often
- Higher NPS scores and positive customer feedback
- Reduced churn by 10%
- Engineers enjoy interacting with the IM process
- 1-2 hours saved per retrospective (post-mortem)
- 0 repeat incidents
- Blameless Reliability Insights (e.g. MTTR, MTTD) shared at board meetings
Using Blameless Led to Positive Business Outcomes
A year after implementing Blameless, BetterCloud is already seeing the benefits. First, Blameless automatically collects incident information so that engineers can do their jobs quickly and efficiently. Blameless documents incident reports and retrospectives (postmortems), including relevant chat logs, into a searchable directory for holistic context — which now serves as their single source of truth.
Flexible reporting language is another feature that has dramatically improved BetterCloud’s SRE function to serve the entire organization. Blameless allows BetterCloud to report on metrics that are most useful to various stakeholders (i.e. incident response teams, customer success, executives, etc.). This also means presenting data in a way that makes sense. In fact, Stephen created his own custom Reliability Insights dashboard in the Blameless UI and uses it for executive updates.
Using Blameless has ultimately helped BetterCloud feel confident about product reliability and their new incident response process. Implementing Blameless led to more logged incidents, paving the way for more meaningful metrics and ensuring completion of the retrospective (postmortem) report. The platform established a focused method of communication across the organization, because everyone is aligned on a standardized process.
Anecdotally, Stephen says it’s apparent that both their internal and external customers are informed more frequently and more meaningfully during incidents. Based on results from their most recent NPS survey, BetterCloud knows that customer happiness improvements are related to improved product reliability, and their new approach and culture mindset. Engineering teams no longer shy away from declaring incidents. Rather, “they enjoy interacting with the process”. This also goes for their business teams like customer success, marketing, and sales who keep a pulse on product reliability. With their enterprise-grade tooling, automation, and processes, BetterCloud can now accelerate business growth to the next level of scale.
*A RACI chart is a project management tool used for tracking roles and responsibilities. RACI is an acronym that stands for responsible, accountable, consulted, informed.
This is some text inside of a div block.