Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.
Customer Story

RevOps Says Reliability Is Critical to Their Success and Blameless Helps Them Scale Quickly

RevOps was founded in 2018 by pricing nerds Adam Ballai and John Solis from Twilio, Stripe, and PullString (acquired by Apple), with the mission of enabling businesses to build a scalable Deal Desk operation.

The small, distributed engineering team at RevOps is divided up by service ownership. Each team comprises full-stack engineers, with infrastructure operations as a shared responsibility. Not all engineers are on-call and customer support is the first line of defense for issues. An incident is investigated via an automated alert and any escalation required is handled through PagerDuty.

The Challenge

The engineering team was growing rapidly, and their customer base expanded globally. At this stage, the company needed a scalable, data-driven process for improving reliability across engineering, product, and support teams. 

At RevOps, there’s no dedicated DevOps function, so each team oversees their service area and its performance health. Incident severity is determined by the degree of customer impact. If multiple customers are impacted by any type of incident, it’s deemed Sev1. Because the team is distributed, they needed a solution that could fit seamlessly with Slack. 

“Technology integration is a key part of our decision to purchase Blameless. Rather than trying to manage incidents on our own across disparate tools, we now have automated steps. Pulling the right data together for reporting is also easier.”

Reliability Insights Allow RevOps to Double Click on Incident Data

When the RevOps team first started using Blameless, their incident response process wasn’t fully built out yet. Using Blameless for incident management has allowed them to establish a reliability process they can follow consistently enough to be valuable. 

John appreciates the SRE values Blameless promotes, saying it’s important to encourage a “blameless mindset” across the entire company. During incident response and retrospective analysis, his team tries to be blameless. “We don’t want to discourage people when it comes to incidents,” he explained. In fact, RevOps engineers don’t assign each other “tickets” — at least, that’s not what they’re called. They use a much more encouraging term, “betterments”.

Of course, it helps to have a clear process, coupled with an automated tool like Blameless. John expressed how much hidden value lies in simply being able to have clear role assignments and follow-up tasks. He adds, “We can now answer important questions such as What are the process steps for Sev1 vs. Sev2 or Who should be commander for this incident type and service area.”

RevOps embraced a strategic approach to incident management by starting to analyze reliability data. John revealed that they use Blameless Reliability Insights to track exactly where they tend to find out about incidents. They want to know how often incidents are discovered internally vs. when a customer raises a flag. Naturally, incidents aren’t great news, but John celebrates when the RevOps team discovers incidents before their customers.

The engineers review Reliability Insights bi-weekly. In fact, they rely on the data to help them learn and improve. We asked them to share what types of reports they look at. The most common ones are mean-time–to-resolve and mean-time-to-respond. John says, Blameless Reliability Insights helps his team double click on incidents and practice the philosophy of the “five whys”. They can get very granular with the data to reveal areas of improvement.

RevOps Carries Out Effective Knowledge Transfer with Blameless

It’s been much easier for RevOps to scale their incident management, because using Blameless makes the retrospective process simple. They save time collecting incident metadata, because Blameless automatically collects it in a timeline. It’s also easier to locate retrospectives and host meetings where veteran engineers can train and educate new team members. Using Blameless is a huge opportunity for knowledge transfer.

Reliability is a critical part of the company's success, and that’s the reason why other teams beyond engineering are involved in managing incidents and tracking reliability growth.

“Blameless really helped us scale. A dedicated incident management tool — when everyone knows the process — is key. Since we're 100% remote, we needed a Slackbot integration. It’s where most people are working day to day”.