Iterable’s growth marketing platform enables organizations to deliver seamless, personalized customer experiences across channels, including email, SMS, mobile push and more. In less than seven years, the platform has scaled to billions of cross-channel messages sent per month.
The company’s mission— as implied by its name — is enabling companies across the digital maturity spectrum to iterate on messaging that maximizes customer engagement. Scaling its technology as quickly as possible while protecting reliability has been core to Iterable’s explosive growth.
The SRE team at Iterable is focused on optimizing stability of the platform. A key initiative the team is championing, for example, is working in lockstep with the go-to-market team to improve stability and predictability in scaling its database investments. By using projections to right-size its Elasticsearch indexes, the SRE team can reduce the risk of reliability issues, especially in the case of large customers with vast amounts of data (a whopping tens of billions of data points for their largest customer!).
The Challenge: Shallow Incident Learning and Platform Instability
In 2018, before Blameless, Iterable lacked a defined incident management process. Incidents were created on an ad hoc basis, and postmortem reviews were conducted only for the highest severity issues, largely due to the toil of creating postmortems. The VP of Engineering spent 2-3 hours generating each postmortem timeline. He was also spending a disproportionate amount of time on calls with unhappy customers.
The team recognized the need for automated incident coordination and streamlined learning to prevent repetitive issues. At the time, the team had been considering building similar functionality in-house, but realized the effort would cost multiple months of full-time engineering work, as well as ongoing maintenance. As a result, according to Staff Site Reliability Engineer Tenzin Wangdhen, the investment in Blameless was a “no-brainer.”
Pain Points Before Blameless
- Arduous process (2-3+ hrs) to build postmortem timeline
- Overall platform instability, resulting in time spent appeasing disgruntled customers
- Lack of feedback loop to address incident contributing factors, creating a “treadmill” of burnout
- Shift to a culture of learning from incidents
- Automate incident coordination and scale response processes
- Avoid building a solution internally
According to Tenzin Wangdhen, Staff SRE, “Blameless is one of those solutions where you forget it’s there because it’s so well-ingrained into our system. It’s really blended into our day-to-day toolkit due to its ease of use and intuitiveness.”
Blameless is one of those solutions where you forget it’s there because it’s so well-ingrained into our system. It’s really blended into our day-to-day toolkit due to its ease of use and intuitiveness.
The Solution: Centralized Coordination and Actionable Insights
After adopting Blameless, incident creation and coordination now takes place in seconds. One team member contrasted that with previous organizations where simply creating an incident could take upwards of an hour.
The improved coordination, follow-up tracking, and visibility help us actually address what caused the incident in the first place, and prevent it from happening again. Through that iterative process of having incidents, learning from them, applying the fixes and rinsing and repeating, we’ve been able to improve the stability of our platform.
Furthermore, Blameless has helped the Iterable SRE team facilitate a culture of learning. The team conducts weekly incident meetings — one with Customer Success and one with Engineering — led by the Incident Commander for the week. The team reviews all the Blameless incidents from the week before, and will bubble up questions and discuss next steps on follow-up items. Blameless’s centralized charting uncovers actionable insights on where to focus their reliability efforts.
One big issue we were facing was a significant number of change-related incidents. Having numbers through tags and trends to back up the hypothesis that the incidents were being caused by deploys was key to enabling more focused work. The way Blameless helps us embrace the idea of learning from incidents is powerful.
The Slack integration and chatbot have been great. Other competitors with a similar solution are not as seamless.
The Business Impact
In the two years since using Blameless, Iterable has seen incident frequency shift from high-severity incidents (Sev0s & 1s) to lower-severity incidents (Sev2s & 3s), which is a positive signal. This means that incidents used to affect a larger blast radius of customers, but as they now trend smaller; impact can be isolated to a smaller subset of users. Lower-severity incidents also enable the team to more easily encapsulate impact and identify proactive opportunities to improve platform stability.
- 43% reduction of Sev1s and Sev0s (over a 6-month timeframe)
- Automated postmortem timeline creation, compared to 2-3 hrs previously
- Flexible reporting, especially around impacted services, impacted customers, and contributing factors
- Fewer repeat incidents
Longer term, the Iterable engineering team is interested in integrating Blameless’ SLO solution to provide even more granular visibility with its customers around platform reliability. SLOs create a concrete data point to validate when to work on stability instead of new features. The eventual goal is to surface real-time SLO metrics (e.g. success rate for an API endpoint) by each customer.
According to Tenzin, there is very strong interest across the company in understanding incident and platform stability trends, reflecting the mission-critical role that embedded and core SREs play in enabling Iterable’s success as a digital-first business.
Ultimately, in partnership with Blameless, Iterable can continue focusing on scaling the platform to serve the world’s leading marketing organizations.
“Blameless is a sticky product, and without it, there would be much more manual work for everyone involved. The SRE team would be more thinly stretched, and we wouldn’t have as much bandwidth to work on key reliability initiatives.”
This is some text inside of a div block.