Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison

Customer Story

Iterable sees a 43% reduction in critical incidents with Blameless

Iterable’s growth marketing platform enables organizations to deliver seamless, personalized customer experiences across channels, including email, SMS, mobile push and more. In less than seven years, the platform has scaled to billions of cross-channel messages sent per month.

The company’s mission— as implied by its name — is enabling companies across the digital maturity spectrum to iterate on messaging that maximizes customer engagement. Scaling its technology as quickly as possible while protecting reliability has been core to Iterable’s explosive growth.

The SRE team at Iterable is focused on optimizing stability of the platform. A key initiative the team is championing, for example, is working in lockstep with the go-to-market team to improve stability and predictability in scaling its database investments. By using projections to right-size its Elasticsearch indexes, the SRE team can reduce the risk of reliability issues, especially in the case of large customers with vast amounts of data (a whopping tens of billions of data points for their largest customer!).

The Challenge: Shallow Incident Learning and Platform Instability

In 2018, before Blameless, Iterable lacked a defined incident management process. Incidents were created on an ad hoc basis, and postmortem reviews were conducted only for the highest severity issues, largely due to the toil of creating postmortems. The VP of Engineering spent 2-3 hours generating each postmortem timeline. He was also spending a disproportionate amount of time on calls with unhappy customers.

The team recognized the need for automated incident coordination and streamlined learning to prevent repetitive issues. At the time, the team had been considering building similar functionality in-house, but realized the effort would cost multiple months of full-time engineering work, as well as ongoing maintenance. As a result, according to Staff Site Reliability Engineer Tenzin Wangdhen, the investment in Blameless was a “no-brainer.”

‍

Pain Points Before Blameless

Arduous process (2-3+ hrs) to build postmortem timeline
Overall platform instability, resulting in time spent appeasing disgruntled customers
Lack of feedback loop to address incident contributing factors, creating a “treadmill” of burnout

Goals

Shift to a culture of learning from incidents
Automate incident coordination and scale response processes
Avoid building a solution internally

According to Tenzin Wangdhen, Staff SRE, “Blameless is one of those solutions where you forget it’s there because it’s so well-ingrained into our system. It’s really blended into our day-to-day toolkit due to its ease of use and intuitiveness.”

Blameless is one of those solutions where you forget it’s there because it’s so well-ingrained into our system. It’s really blended into our day-to-day toolkit due to its ease of use and intuitiveness.

‍

The Solution: Centralized Coordination and Actionable Insights

After adopting Blameless, incident creation and coordination now takes place in seconds. One team member contrasted that with previous organizations where simply creating an incident could take upwards of an hour.

The improved coordination, follow-up tracking, and visibility help us actually address what caused the incident in the first place, and prevent it from happening again. Through that iterative process of having incidents, learning from them, applying the fixes and rinsing and repeating, we’ve been able to improve the stability of our platform.

Furthermore, Blameless has helped the Iterable SRE team facilitate a culture of learning. The team conducts weekly incident meetings — one with Customer Success and one with Engineering — led by the Incident Commander for the week. The team reviews all the Blameless incidents from the week before, and will bubble up questions and discuss next steps on follow-up items. Blameless’s centralized charting uncovers actionable insights on where to focus their reliability efforts.

One big issue we were facing was a significant number of change-related incidents. Having numbers through tags and trends to back up the hypothesis that the incidents were being caused by deploys was key to enabling more focused work. The way Blameless helps us embrace the idea of learning from incidents is powerful.

Reliability Toolchain

Blameless
Slack
JIRA
Datadog

The Slack integration and chatbot have been great. Other competitors with a similar solution are not as seamless.

‍

The Business Impact

In the two years since using Blameless, Iterable has seen incident frequency shift from high-severity incidents (Sev0s & 1s) to lower-severity incidents (Sev2s & 3s), which is a positive signal. This means that incidents used to affect a larger blast radius of customers, but as they now trend smaller; impact can be isolated to a smaller subset of users. Lower-severity incidents also enable the team to more easily encapsulate impact and identify proactive opportunities to improve platform stability.

43% reduction of Sev1s and Sev0s (over a 6-month timeframe)
Automated postmortem timeline creation, compared to 2-3 hrs previously
Flexible reporting, especially around impacted services, impacted customers, and contributing factors
Fewer repeat incidents

What’s Next

Longer term, the Iterable engineering team is interested in integrating Blameless’ SLO solution to provide even more granular visibility with its customers around platform reliability. SLOs create a concrete data point to validate when to work on stability instead of new features. The eventual goal is to surface real-time SLO metrics (e.g. success rate for an API endpoint) by each customer.

According to Tenzin, there is very strong interest across the company in understanding incident and platform stability trends, reflecting the mission-critical role that embedded and core SREs play in enabling Iterable’s success as a digital-first business.

Ultimately, in partnership with Blameless, Iterable can continue focusing on scaling the platform to serve the world’s leading marketing organizations.

“Blameless is a sticky product, and without it, there would be much more manual work for everyone involved. The SRE team would be more thinly stretched, and we wouldn’t have as much bandwidth to work on key reliability initiatives.”

‍

More case studies

Get industry insights and events in your inbox.
Sign up for our monthly newsletter.

Company

About us Newsroom careers contact

Product

pricing integrations interactive Demo

Help Center

Getting Started Implementation Security Documents APIs & Webhooks

resources

Blog ebooks Incident Impact Calculator videos glossary Comparisons How Long do you Spend on an Incident?

legal

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Based on the applicable laws of your country, you may have the right to request access to the personal information we collect from you, change that information, or delete it. To request to review, update, or delete your personal information, please fill out and submit a data subject access request to support@blameless.com.

I Accept

Preferences

Iterable sees a 43% reduction in critical incidents with Blameless

The Challenge: Shallow Incident Learning and Platform Instability

Pain Points Before Blameless

Goals

The Solution: Centralized Coordination and Actionable Insights

Reliability Toolchain

The Business Impact

What’s Next

More case studies

BetterCloud

Iterable

Citrix, Greenlight, and Incognia

Eventbrite