No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.
Embrace resilience with incident management procedures
One way to think of these difficult circumstances is to envision them as an incident. Incidents are forms of unplanned work, and crises certainly fall under that category. To deal effectively with incidents within your organization, you likely refer to runbooks to guide you when you’re under pressure. These key components to good incident response are just as applicable when dealing with uncertain and difficult circumstances.
While no runbook is ready-made for this kind of crisis, the idea of a runbook is still applicable. You’ll still need accurate organization charts in order to know who works in what department, how to contact them, and how to escalate issues when necessary. You still need playbooks to execute on day-to-day issues. However, you’ll need to adjust these to better reflect the current reality.
Your new runbooks will likely need to revolve around working from home. Key information will include meeting protocols, agreed upon hours or results required per day or week, and how to communicate with your team. This will require flexibility. Some people will prefer the standard email, others will want to be contacted via Slack. You’ll also need to determine what discussions require calls, and who needs to be notified when impromptu meetings are called.
In addition to creating new runbooks, you’ll also need to review how you deal with illness and family emergencies during this time. Capacity will need to be increased (for servers, storage, etc) but also for headcount. It will be important to plan for how your team will function when members fall ill, or need to care for family and friends during this difficult time.
You still need playbooks to execute on day-to-day issues. However, you’ll need to adjust these to better reflect the current reality.
Revise your on-call schedules qualitatively
On-call will need to be reworked as well. With the increased strain on your infrastructure, incidents could spike. If some engineers happen to be on call for the times with highest usage, they could be the ones responding to the brunt of the issues. This mental strain could lead to burnout, and potentially weakened immune systems due to stress. Instead of tracking the time a person spends on call, begin a qualitative analysis. If someone spends a day on call and is paged only once, it might seem like they could help load balance against someone who was on call for an entire weekend and was paged three times. However, if the three outages lasted only one hour a piece and the single outage lasted 16 hours, person No. 1 will require more rest.
Remembering to plan accordingly for times of uncertainty and being flexible in revision can greatly improve your business continuity. If you need a little help getting organized, Hubspot has a business continuity template that helps you navigate 5 crucial steps to help your organization embrace resilience.
If some engineers happen to be on call for the times with highest usage, they could be the ones responding to the brunt of the issues.
Write postmortems to understand recurring issues
Outages the last few days have been unprecedented as companies struggle under increased demand. With the switch to WFH, even Microsoft teams had an outage lasting two hours as workers and students began logging on for the day. Gaming, stock trading sites, and corporate VPNs are of significant concern with the influx of daily users, and incidents are cropping up at an alarming rate.
In fact, all services (from internet providers to grocery stores and health facilities) are getting stretched to capacity, and can’t afford to keep making the same mistakes. With the increased volume of incidents, it would be easy to simply skip over the postmortem. However, this is one of the worst pitfalls of firefighting. By skipping the postmortem, you lose the opportunity to learn from incidents and prevent them from occurring again. Crises don’t have a set end-date. If you don’t begin working down same-class issues soon, eventually you will be overwhelmed.
By writing postmortems and working your way through a root cause analysis, you’ll be able to identify two major ways to speed up your processes:
- Identify bottlenecks. Is there a recurring stopping point for services being improved or incidents being resolved quickly? Bottlenecks can be people or processes, and it’s important to know which one you’re dealing with. For example, in Gene Kim’s “Phoenix Project” Brent was a huge bottleneck. As a gifted engineer who dabbled in all aspects of the service, Brent was a constant go-to for any issue. This meant all his time was being stolen by unplanned work and undocumented requests, overloading him and slowing down system-wide improvements. In situations like these, it’s important to make sure engineers feel empowered to say no, focus on project work, and get some quality heads-down time. If the bottleneck is a process, you’ll need to review your workflows for that particular process. While this sort of work is less visible, it’s important to efficiency and innovation. Without bottlenecks, you’ll be able to improve your service and resolve incidents faster. It’s well worth calling a meeting to work through. And you’ll need postmortems to refer to in order to make these informed decisions.
- Automate toil. Writing postmortems can also help you understand where you’re losing time to toil in your incident resolution process. For example, for an outage lasting 15 minutes, if 5 minutes are spent simply getting all participants filled in on this issue, 33% of your MTTR is toil. You could automate the incident resolution process to generate a communication hub for your incident to fill others in on the details. Additionally, how much time do you spend writing postmortems? Do you spend hours searching for disparate information to include in your timeline? This is toil as well. By using a tool to aggregate key data for you, you and your teammates are free to do the important part: learning.
Crises don’t have a set end-date. If you don’t begin working down same-class issues soon, eventually you will be overwhelmed.
Learn continuously to adapt
Embracing resilience also requires flexibility of thinking and learning. If you allow key opportunities to pass by, you’ll miss the opportunity to learn flexibility. This adaptation is crucial during times of crisis and uncertainty. Business can’t simply proceed as it used to. We need to iterate on our process, behaviors, and mindsets in order to thrive during these difficult times.
The first step in flexibility is a mindset change. You’ll need to learn to be patient with others. Many of your coworkers are now working from home. This means there are pets, partners, and children to deal with. This isn’t an ideal working situation for most. It’s high-stress and distracting. Your team member who used to reply to your Slack messages in two minutes now might require 15-20 before responding. And that’s okay. Meetings might be a little tougher with busy households, and that’s okay. Productivity might dip while people learn how to operate in this new normal, and that’s okay. We must be patient with each other and ourselves while we adapt.
You’ll also need to consider how to be flexible in creating new team dynamics. In the office, you know how your teammates take their coffee and what they did this last weekend because you have a communal break room that allows for this level of connection. Without that, how will you keep your team talking? Fun slack groups, team water coolers via Zoom, virtual game night bonding all become so important here. Not only because they keep you feeling like part of the same team, but because camaraderie is so important in this time of social distancing. Human connection keeps us motivated. Knowing that someone else counts on us can keep us working even when we feel overwhelmed.
Lastly, you’ll also need to be flexible in your learning resources. Cancelled conferences, lack of internal continuing education, and classes either postponed or moved to online means you might be suffering from a knowledge drought. It’s more important than ever to find safe, healthy ways to learn and interact with the community. This could be attending virtual conferences, weighing in on live panels, or reading industry news. Some of our favorite resources are:
- SRE Weekly
- Lessons learned from Resilience Engineering and Community Resilience
- Gremlin’s Failover Conf.
- Resilience Roundup
- Increment, and more
And if you’re interested in attending any of our virtual events, make sure you check these out: