January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.
SRE also teaches us the importance of planning for failure and reacting to it as resiliently as possible. Failures are not limited to our own systems, either. With the rise of microservices, our systems rely on a complex ecosystem that involves many third-party vendors and tools. Some of the most important tools teams use daily are those for collaboration. So how can we plan for failure in systems outside of our control?
In this blog post, we’ll cover:
When your communication tool goes down, it can feel like the sky is falling. Whether you’re on Slack, Teams, Discord, Skype, or Zoom, a lapse in communication tooling limits productivity. There are also some overlooked results of an outage of this nature. Here’s what can happen when a third party communication tool goes down:
Third party tools are outside of your control. It is essential to prepare for an eventual failure to minimize these issues.
While you can never be fully prepared for an outage, there are some strategies that make it easier to respond to a lapse in communication. You can work with your team to minimize SPOFs (single points of failure). You can form back-up plans and publish them. And you can create chaos engineering experiments to simulate a communication tool outage. Let’s look at each of these methods
A single point of failure (SPOF) is a single person, resource, or tool that, if unavailable, would cause a failure in the system. In the case of a third party communication tooling outage, SPOFs could include access to a SME (subject matter expert).
SMEs can be extraordinary resources of knowledge. However, if these people are the only ones who know certain details about a system, or are the sole people on the team capable of resolving incidents, they’re points of failure. To minimize this, you can try the Wheel of Staycation, a technique used by teams at Google. Each teammate will spend a period of time on project work, totally uninterrupted. The rest of the team needs to function without that person’s help.
If the team members reach a point where they cannot resolve an issue or complete a task without the SME’s help, they note it. The team should address these knowledge gaps later. The SME who holds the knowledge can create documentation, runbooks, or conduct training to ensure that the whole team is on the same page. In the event of a communication tooling outage, this helps teams function when access to SMEs can be difficult.
Runbooks are key to crafting resilient backup plans in the event of a communication tooling outage. Runbooks are documents that guide you with steps and checks. In SRE, runbooks are built and automated to respond to internal incidents. They can also be applied to third party outages. Your steps might look something like this:
Once you have a plan in place, it’s time to test how effective your plan is.
Chaos engineering is a technique that allows teams to simulate failure and look for ways to react to it and prevent it in the future. It helps teams answer those “what if” questions about their systems. Chaos engineering can also help teams validate their backup communication plans in the event of a tooling outage.
For instance, a team wanting to test how well they can respond to a Slack outage might create an experiment where they cannot use Slack for a full business day. At this point, the team would fall back to the runbook covering a situation like this. The team would then execute the runbook. If there are any communication breakdowns, teams could note this and suggest ways to adjust the runbook in the future.
Some services may not need to deeply analyze how a communication outage will affect their overall reliability plan. However, for services that are dependent on communication tooling (ie. virtual-event hosting platforms reliant on video conferencing providers), you’ll need to account for potential failure.
Make sure you don’t plan to be more reliable than a third party tool you depend on. Look at the service level agreement (SLA) you have with your tool provider. Your SLA for services that require that tool can’t be any higher than that. If your provider’s SLA states it can provide 99.95% availability, you can’t promise that your service will be available 99.99% of the time.
Blameless can help teams work through a communication lapse and provide a way for teammates to connect during the incident. Here are some ways Blameless can help when an outage occurs:
To see other ways Blameless can help improve your reliability, check out a demo.
If you enjoyed this blog post, check out these resources: