Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

How to Scale for Reliability and Trust

Blameless Community
Reliability & Availability

As more people depend on your product, reliability expectations tend to grow. For a service to continue succeeding, it has to be one customers can rely upon. At the same time, as you bring on more customers, the technical demands put on your service increase as well.

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency. It isn’t a problem that you can solve by throwing resources at it. Your organization will have to adapt its way of thinking and prioritization. In this blog post, we’ll look at how to:

  • Design services that can remain reliable while scaling
  • Balance reliability and development velocity
  • Respond to incidents using best practices
  • Build trust when incidents occur through good communication

Designing services that stay reliable while scaling

Scaling and maintaining reliability starts with the design of the service. Determine your scaling goals from the very start of laying out design specifications. As development progresses, keep in mind how scalability and reliability are impacted. It can help to have SREs involved in the design process as consultants for reliability.

In a discussion with Blameless, SRE Kelly Dodd of Greenlight Financial shared four main initiatives her team uses for reliable scaling demands:

  1. Platform stability: Make sure that all services are running in similar environments. If you're using Kubernetes, everything's in Kubernetes. Terraform as much as you can or codify in another preferred way; the important thing is to prioritize consistency. This uniformity helps de-risk the process around spinning up new environments.
  2. Fast and small releases: Invest in automated testing, as it is crucial for enabling continuous delivery. You can't manually test if you're shipping out every change to prod.
  3. Observability: Implement distributed tracing and other observability efforts, then check to see if it’s working by asking novel questions of your system and analyzing the accuracy of the results.
  4. Service ownership: Plug all this into on-call. When something breaks, make sure that someone who understands the product can tackle the problem — ideally the person who built the service.

Not all teams are able to build services from the ground up with these considerations in mind. Many teams need to adjust previously built systems for reliability and scalability. Making changes to a service in later stages can be difficult, but either way, these tactics can help.

Balancing reliability and development velocity

The most important lesson of SRE is that failure is inevitable. You cannot design a perfectly reliable service, even without the pressures of scaling. Aiming for 100% uptime is a futile goal. Moreover, it can unnecessarily slow development velocity. As you scale, you’ll need to deploy quickly to keep up with demand. Finding the optimal tradeoff between velocity and reliability is key. One way to do this is by determining SLIs, SLOs, and error budgets for your services.

First, determine an acceptable level of reliability for your service. This is your service level objective, or SLO. It is based on a service level indicator, or SLI. An SLI is built on the metrics that impact customer happiness the most at different points of a user journey. The SLO marks where the customer becomes dissatisfied due to unreliability. As you scale, your SLO for particular services could increase. Customers will expect the services they rely most on to be more reliable.

Next, build an error budget for your SLO. An error budget shows the amount of unreliability you can safely experience within a given timeframe before your customers will become unhappy.

You can design policies that kick in when the error budget crosses certain thresholds. These policies ensure that particular actions take place to maintain or improve reliability. Some actions can be code freezes to shore up reliability, or handing the pager back to the development team if the service does not meet reliability standards. These error budget policies should still be applicable as you scale up. For example, freezing all of development will likely become unfeasible at a certain point, and localized freezes will suffice.

Finally, accelerate development from within the error budget. If you’re finishing development cycles with plenty of error budget to spare, you can see it as a sign that development can safely accelerate. Resources can be reallocated to push out code faster. You can set more ambitious development goals, confident that you’ll still meet your reliability goals.

By following this process, you can meet your scaling demands as quickly as possible without causing customer pain.

Recover from outages faster with incident response

No matter how reliably you design your software or how safely you scale, incidents will still occur. Minimizing the impact of those incidents is the other half of reliability. By recovering quickly when things go wrong, customers will still perceive your service as reliable. Here are some incident response best practices, and how they’ll change as you scale.

Alerting based on SLOs. As you scale, the threshold for when you need to trigger an alert may also change. However, you’ll always want your alerts to be based on customer impact. Alerting based on SLOs ensures that your alerts stay relevant, no matter how large you scale.

Blue-green deployment. Blue-green deployment involves building two identical environments. One works as the live environment, and the other (the idle environment) hosts new deployments. Once the deployments are tested and safe, the idle environment becomes the live version. This technique remains effective as you scale. 

Automated runbooks. Runbooks assist incident responders by providing a series of checks and steps. As you scale your service, more types of incidents will emerge. You’ll need to update your classification and runbooks to capture these new incidents. Don’t see this as overhead, though. Instead, it’s an opportunity to improve and further automate your response. Automating runbooks allows for fast responses, even as you scale

Refined on-call. As your services scale and become more complex, the challenge of being on-call also increases. Make sure your training is keeping pace with new potential obstacles. Also work to keep your on-call policies fair and considerate of the human as the difficulties increase.

Reassure customers with good communication

Even the most reliable services experience outages. Customers accept that this is an inevitability. However, to retain customer trust, you need to communicate. As you scale, ensure you revisit your policies for communicating during and after incidents. This will ensure your responses continue to meet customer expectations.

When an incident is detected, communicate with your customers to manage their expectations. Give them a realistic timeline for when they can expect updates from your team. Knowing that your team is working on it and will update them, for instance, every 30 or 60 minutes can reduce at least some of the frustration customers may experience. These updates can be given (depending on the incident) via status pages, emails, or, if the incident is large enough, on social media. Partner with the necessary internal channels such as Customer Success or Marketing to make sure that communication plans are determined ahead of incidents.

Once the incident is mitigated, issue a statement summarizing the incident. The statement can be built from the incident’s retrospective or postmortem. Of course, you probably won’t be able to share every detail of the incident. Instead, you should put yourself in the customers’ shoes and consider what they most want to know. This could be technical background, a high-level timeline, learnings, and a path to a more reliable future.

This path to a more reliable future is paved with the action items uncovered from the incident. Noting at least some of these gives customers confidence that not only do you understand why the incident occurred, you also have a better idea of how to prevent the same one from happening the same way the next time.

Scaling brings new challenges and greater expectations for your reliability. Help your organization grow by investing in reliability tools like Blameless. To see how we can help your team, check out a demo!

If you enjoyed this blog post, check out these resources:

Book a blameless demo
To view the calendar in full page view, click here.