Zoopla is a residential property website and app based in the UK. With over a million properties to browse through, Zoopla helps potential buyers and renters find their dream homes. Beyond making sure to supply users with home details, their ultimate goal is to help renters and buyers feel confident about the home they choose. In their words, “We know that searching for a home is about more than just checking its price, location, and features (important as all those things are). What really matters is how it makes you feel.”
Zoopla is reimagining the property industry. Buying or selling a home can be extremely stressful, and Zoopla’s aim is to eliminate the stress and create a more delightful experience. To achieve this, the team is laser focused on simplifying the product and providing users with as much data as possible, so that they can make informed decisions.
Across the business, Zoopla teams aim to be innovative while also remaining agile. For the SRE team, that means shipping new features to users in increments. It’s important for the SRE team to mitigate business risk and recover from failures as quickly as possible. With that in mind, they committed to making data-driven decisions and tracking key metrics that gauge performance. Will the SRE team at Zoopla pioneer a new era of reliability in the residential property space?
The Challenge: Lack of Incident Response Process and Data
Prior to adopting Blameless, the team at Zoopla experienced a high number of incidents. At the time, they coordinated incident response in a single document. It was difficult to replicate a standard process and train new team members. They needed to structure their incident response as quickly as possible, with minimal overhead for training and adoption.
Key Pain Points
- No codified incident response process, difficulty with training and on-call
- Toilsome process responding to incidents and gathering information for the retrospective
- Data was difficult to aggregate or unavailable
- Codify the incident response process including roles and responsibilities
- Reduce the toil of incidents through ChatOps and automation
- Easily collect and present data in order to make the right decisions
- Partner with SaaS provider to reduce heavy lifting and focus on maturity of our incident management process
“Blameless really helped us automate and simplify a 13-page incident response document into something that anyone can interact with through the Slackbot,” states Abz Mungul, Head of Site Reliability Engineering at Zoopla.
The Solution: ChatOps-based Incident Response and Detailed Insights
Now, the team collaborates and resolves incidents seamlessly thanks to Blameless’s Slack integration. It’s easy for other team members to jump into a dedicated channel and be involved in incidents. In fact, members of the product team are active incident participants, often serving the role of scribe. This transparency has promoted cross-team collaboration for Zoopla.
“Before Blameless, I don't think that there were many product owners getting involved in incidents. We now have incidents where product owners are actively getting involved and adopting a role. And a lot of product owners are attending the postmortems as well. They now understand the context and the value of fixing those problems.”*
Additionally, the team is now able to use Blameless Reliability Insights to glean more information about their system. Previously, data was scattered across various teams and tools, making it difficult to consolidate. Now, all this information lives within the Blameless product.
“Using Blameless, we've now got data that we can start using and understanding, including how much an incident is impacting developer throughput, as well as impact to the business. This was something that we previously didn't have, or if we wanted to get this data, it would take quite a long time.”*
- Google Hangouts
- New Relic
“Blameless’s Slackbot was one of the things that we really focused on and built our incident management process around. Slack was the difference between — in the global pandemic — success and failure.”*
The Business Impact
With Blameless, the team at Zoopla was able to codify their incident response process. What started as a 13-page document turned into a scalable and repeatable process that made on-call and onboarding new employees easier. Additionally, the team now has the data they need to make decisions and report to internal stakeholders. Last but not least, they’ve been able to automate key processes in their workflows, reducing time spent creating postmortems.
- The team practices an incident management process that’s scalable and easy to follow
- Time to build a postmortem has decreased from days for simple incidents and weeks for complex ones to hours and days respectively
- Data aggregation that would take days now only takes a few hours
Abz Mungul explains why adopting Blameless was an easy decision, “We're partnering with Blameless as the experts to help first mitigate the business risk and then help change and influence our culture towards being more proactive with incidents to deliver faster, with higher quality, and recover from failure quickly.”
Zoopla’s reliability journey doesn’t end with incident response and data clarity for incident metrics. The team has big plans for the future, and Blameless is here to help. The team is interested in making use of their retrospectives to better prioritize action items. Additionally, they are goaling themselves towards establishing SLOs to deepen their understanding of their overall reliability.
Prioritizing action items can be difficult, as they often require tradeoffs with feature work. However, by conducting thorough retrospectives and talking with product teams, they’re looking forward to being able to better plan improvements into future sprints. Product’s involvement in incidents strengthens this work and builds empathy for the on-call engineers who work through these issues.
Adopting SLOs is also an important next step. This will help with the goal of becoming a more data-driven organization. Understanding what Zoopla’s customers experience will help the team be able to innovate better and know what to prioritize. These improvements allow the team to continue to disrupt the property industry and focus on the work that matters most to their users.
“Blameless has helped us simplify the process and our partnership has helped us move towards continuous improvement. [It] has changed our culture toward being more reliable, and encouraging involvement and collaboration around failures.”*
*Abdurrahman Mungul, Head of Site Reliability Engineering