Eventbrite Mitigates Risk by Improving MTTA by 10X
The Challenge
Eventbrite brings people together through live experiences, allowing them to discover events that match their passions or create their own with online ticketing tools.
Before Blameless, the Site Reliability Engineering team struggled with several challenges such as:
Manual requirements to create communications channels, alert key team members, set roles, create tasks, and keep tabs on status
Lack of visibility into incidents and status updates
Internal tooling which lacked refinement (due to detracting from core competencies), regular maintenance, and customer success team
Lack of an integrated solution to tie together Postmortems, SLOs, Error Budgets, and Reliability Insights for a complete reliability process
John Shuping, Director of Site Reliability Engineering, and his team sought a solution that could go beyond just incident management, to also managing SLOs and error budgets. They wanted to replace internally-built tooling that took focus away from core competencies with modern, repeatable process for orchestrating reliability efforts. Finally, they wanted to eliminate intensive, tedious manual effort involved with incident management as well as maintaining SLOs and error budgets.
Before Blameless, there was significant toil tied to incidents as well as maintaining SLOs and Error Budgets.
The Solution
Blameless' integrated chatbot, SLOs and Error Budgets, and Reliability Insights features helped the Eventbrite team achieve the following benefits.
Benefits of Blameless
Engage the right people and teams to stop incidents fast, ensuring customer satisfaction
Automatically bring relevant information and context to Blameless Postmortems to learn without pointing fingers, ensuring continuous improvements
Create SLOs and gain insights into Error Budget burndown, providing context to make informed decisions between releasing new features and meeting reliability requirements
Query event data across the entire DevOps stack and create custom dashboards to quickly find signals across the noise
Minimize customer impact and resolve incidents faster by allowing Incident Commanders to orchestrate parallel streams of investigations for complex incidents with Swimlanes for Incident Resolution
Finding a platform that finally went beyond incident management into SLOs and error budgets drove the decision to choose Blameless.
Reliability Toolchain
Here are the tools that Eventbrite relies on to maximize their reliability efforts.
Blameless
Testing frameworks
Monitoring with Datadog
Server redundancy
Orchestrators such as Kubernetes
The Results
With Blameless, Eventbrite saw the following positivebusiness results, helping cross-functional teams improve alignment and effectiveness to deliver great software experiences.
Rapidly decreased MTTA and MTTR by 10X (1000%)
Quantified frequency duration and severity of incidents
Codified internal processes, turning focus on building great customer products vs. internal reliability tooling
Provided reporting that’s meaningful to executives highlighting MTTA and MTTR
Drove organization-wide adoption powering communication between engineering, customer service, and IT teams both independently and inter-departmentally
Before Blameless, it would take 5-10 minutes to get the right people on an incident. Now it's as fast as 1 minute.