Onboarding a new tool can be boring. Or stressful. Or both. When onboarding an incident response tool, it can be difficult to make sure that your team is getting the most from the experience. Do you opt for a run-of-the-mill meeting, or try to learn while in an incident? Neither option is ideal.
That’s why Petal’s DevOps Engineer Michael Cole found a new way to get his team using Blameless for their incident response process. Michael channeled his love for board games to create a unique experience for his team. In this blog, we’ll share how he went about this, and how your team can do the same.
Petal offers a simple, modern digital experience that encourages members to build credit, avoid debt, and spend responsibly. When the team started using Blameless, engineers didn't feel confident in using the tool right away to respond to incidents. As Michael noted, “They felt that they just wanted to solve the problem and not necessarily follow the instructions or best practices.”
Breaking old habits is hard, and adopting best practices takes time and effort. When in the midst of an incident, it can seem easier to do things the way they’ve always been done. The team at Petal wanted to get people used to Blameless. This meant getting comfortable with the tool without the pressure of having to fix something currently failing in the system at the same time.
Michael had an idea for how the team could accomplish this. Michael is a board game fanatic. He owns 78 different board games, including Gloom Haven, which weighs 20 pounds! Rather than using Gloom Haven for onboarding, Michael picked something classic and a little more lightweight. His choice was Sherlock Holmes: Consulting Detective. In this game, you:
Figuring out who committed a fictional murder and figuring out why the server crashed correlate. Many of the same investigative skills one would use to solve a Sherlockian crime are useful when responding to an incident.
Blameless allows individual contributors to add and consolidate key information. When investigating this board game crime, the team operates much like it would during an incident. ICs hold crucial information that others might not have about the state of the system (or investigation). Without Blameless, this information might be lost in DMs or logs without being bubbled up.
Locations, newspapers, and interviews all hold important details for the team solving the murder mystery. Blameless becomes communication central, just as it would in a real incident. But, how does this actually work in practice? Michael walked us through how he ran the entire game in Blameless.
Here’s how Sherlock Holmes: Consulting Detective works. You have an informant’s page, which has a large list of people you can talk to. For example, you can speak with Sir Jasper Meeks, who is the head medical examiner at Saint Bartholomew's hospital. You visit the location listed, open the scenario booklet, and read aloud what happened to the rest of the team. A Sherlock imitation is strongly encouraged.
You’ll receive clues from the interaction. Some of the information may not be relevant, but there is likely to be something of use. For instance, in the interaction with Sir Jasper Meeks, you learn two things: the victim was shot at close range, and the murder occurred between 3:00 AM and 9:00 AM.
Michael notes how consolidating information like this in a real incident is important. “During postmortems, it's better to have the condensed information. It's good having all the logs. But when you try to solve the problem, you want the distilled version.”
These distilled clues are what teams need to resolve the issue. It all hinges on communication. Murder mystery or production incident, talking to one another is the most important indicator of success. So, how do you set up this game within the Blameless platform? Here’s how the team at Petal did it.
When you're looking at your Blameless console, click the gear icon and set an incident type. For this, you’ll want to select "Add a Type.” This prevents you from accidentally creating a default incident and skewing your metrics in the reliability insights tool. Petal named this incident type “Sherlock.”
Next, Michael assigned roles. For this incident type, they had a commander and a communications lead. Additionally, Michael created tasks that associate with the game play. The investigation phase is the first phase of an incident, and these are the tasks that the team needed to complete.
Before setting everyone on their individual tasks, Michael pulled the team into the Zoom instance spun up in Blameless. Once in the room, the communication lead read aloud the opening paragraph of the mystery to set the tone and to go forward. Then, Michael showed the team how to complete tasks within Blameless as they went along, and how to create new tasks for themselves as needed. Michael also prepared the team with a packet of information. This contained:
Then, before Michael sent everyone on their investigations, he told them to set the status to monitoring when they felt confident that they knew who did it.
As the team investigated, the infrastructure team was also in charge of adding some pressure. One of the best practices Petal follows is that individual contributors, as they're working on a problem, should redirect everybody to talk to the communications lead for updates. So, as the infrastructure team started asking ICs questions, the team felt a little of the pressure of a real incident. Each IC had to remind the infrastructure team to direct questions to the communications lead, reinforcing the practice.
Once the team resolved the incident, Michael still had a trick up his sleeve. To complete the mystery and answer the bonus questions, the team needed to create an incident retrospective. To achieve this, the team needed to make sure that all the key information they added to Blameless was carried over in the retrospective with the :point up: command. Once the retrospective was filled out, the team was able to complete the incident and solve the mystery!
Of course, getting to play a board game at work is tons of fun. But, the benefits of playing Sherlock Holmes: Consulting Detective extended beyond solving a fictional murder. The team members at Petal got to bond with one another, practice incident response, and learn how to leverage Blameless in the best way possible.
As Michael noted, “The importance of what we did here was not necessarily to teach people how to do incident management. The goal was to get people used to running the tool and interacting with Blameless in a less stressful environment.”
And this worked! The team had a significant increase in Blameless usage after this training, and not only within the infrastructure team, but within engineering as a whole. Now, the team has seen an increase in incidents created, retrospectives completed, and has stronger communication during incidents.
While playing games at work is rarely encouraged, this experience taught the team how to use Blameless and helped them have fun while doing it.