Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Ensuring Five 9s Uptime (99.999%) - Is it Achievable?

Wondering about five nines availability? We explain what five nines availability is, why it’s important, how to measure it, and whether it’s an achievable goal.

What is Five Nines Availability?

Five nines availability is a goal to have a system that is fully operational 99.999% of the time, which would result in an average of approximately 6 minutes downtime per year.


The availability of a service is the most basic building block of reliability. Reliability is the most important feature for users – after all, it doesn’t matter what features you have if users can’t access them. At the same time, reliability has to be balanced with gaining a competitive edge through innovation. Innovation requires change which always brings about instability in the system. When it comes to any software service, the developmental toil for innovation (new features) is always competing with the developmental toil for stability. That is why however tempting it may be, you cannot set the goal of 100% uptime. Some outages will inevitably happen, and trying to improve reliability beyond where users find it acceptable provides diminishing returns for growing amounts of effort.

How to Measure Availability

Availability, often referred to as uptime, defines a system’s ability to perform its intended operations at a given point in time. It can be monitored by continuously querying and confirming the service's response with an expected level of speed and accuracy. It is also used to describe the probability of a system to perform as it is expected in the future. 


The availability of a service is often measured in nines representing how many decimal places the service uptime percentage can reach. More decimal places mean higher uptime. Since 100% uptime is impossible, the highest nines of availability companies will aim for is five or 99.999% uptime. 

The availability SLO of service can be calculated by looking at the ratio of total downtime compared to total uptime over some time period.

What is Error Budget, SLO, SLA, and SLI?

The error budget is the amount of acceptable unreliability or errors that your service can accumulate without impacting customer happiness. For any business, the happiness and satisfaction of the customer are of utmost importance. However, the ultimate goal of tech companies is to innovate, which often brings about some disruption. 


In complex systems, the pursuit of perfection is fruitless, and the best you can do is be prepared for incidents. The purpose of the error budget is to leave room for some mistakes without impacting the customer. 


SLO (service level objective) is the target numeric value for your service’s availability. It is the minimum level of reliability that is required to keep the customers happy. To avoid depleting your error budget and breaching your SLO, the SLO can trigger policies such as code freezes. 


SLA (service level agreement) is a formal agreement between the customer and the service provider on how reliable the service will be and the repercussions of failure. If the service fails to be as reliable as promised, then the service provider gives a partial refund, points, or discounts to the customer. SLOs are always set to be more strict than the SLA, in order for the SLO policies to protect the SLA.


SLI (service level indicator) is a metric that is linked to an SLO. It can be as simple as the total availability of a server, or reflect something nuanced such as a user journey. 


Error budgets represent the space you have before your SLO is breached. It represents the amount experimentation and mistakes you can make before impacting the customer.

Your SLOs, SLAs, and SLIs must be communicated across every level of the organization from developers to VPs. Having a shared objective can help organizations make their product better than ever with increased reliability and innovation. 

Why is the Five Nines Availability Important?

The nines of availability are an important metric in determining the strictness of an SLA or SLO. The higher the number of nines, the more difficult it will be to manage downtime. Five nines is an attractive figure, but ensuring this level of reliability requires further resources such as having staff available 24/7/365, which can be expensive in the long run. 


The most commonly-encountered SLA response window is four hours. Response time doesn’t mean that the website will be fixed in four hours, rather that the service provider will start troubleshooting within four hours. Additionally, the SLA usually allows for equipment-driven malfunctions, planned downtime, downtime resulting from human error, and maintenance to be excluded. 


When it comes to five nines availability, the SLAs can get quite nuanced. For example, the Amazon EC2 Cloud service boasts a 99.95% availability and gives credits if the availability metric is not met. However, there’s more to it in the fine print:


  • Scheduled outages are excluded. 
  • Events outside of Amazon’s control are not included. 
  • Downtime is measured from when the cloud service is in a Region Unavailable state (when Amazon detects that the service is down).
  • Downtime is measured on a monthly basis, i.e. last month’s system failure is not included in the current month’s statistics. 

Should You Strive for Five Nines Availability?

According to a study by Compuware and Forrester Research in 2011, it cost businesses an average of $14,000 per minute for mainframe outages. Using that figure, we can calculate that a system with five nines availability costs about $73,500 per year. For a system with four nines, the cost increases to $735,000 and three nines to $7.3 million. 


Considering the figures above, you will need extra hardware, software, and resources to cost less than $660,000 per year and go from four to four to five nines. When we consider the costs of mainframe hardware, $660,000 does not buy much. A jump from three nines to four nines is much easier and has a greater business impact.


Alternatively, it’s also possible for businesses to become victims of their own reliability. If there have been no outages for a couple of years, then the customers will expect the same level of service moving forward. But the fact remains that outages will happen, no matter how reliable your system has been. Even big organizations such as the Royal Bank of Spain (2013) and Air New Zealand (2009) have been victims of outages from time to time.


You also want to avoid overspending on reliability. If your users are satisfied with a given level of availability, even if it’s just two or three nines, then any further improvements are likely to not be appreciated or even noticed. As the cost of increasing reliability grows faster and faster, but the value to customers diminishes, it becomes important to know when to stop focusing energy on improving availability.


The point where users are satisfied with your reliability should be where you set your SLOs. That way you’ll have an error budget that represents how much you can prioritize other things over reliability. When the error budget is low, you know you need to focus on keeping the services available. When the error budget is high, you can feel secure that occasional outages won’t cause customer dissatisfaction.

What Does Unavailability Mean?

The definition of unavailable is rather complex and varies with the application. For example, an ATM is expected to authorize transactions within 20 seconds. If it takes longer, then the system is effectively down, despite the fact that it's working slowly but fine. 


In another scenario, an end-user may consider that the system is unavailable if one of the functions is not working or if they can’t access the system for some time. Also, if an end-user cannot use the service because of a lack of training or a complicated user interface, then it will also be considered unavailable. Ultimately, reliability depends on what the user expects from the service

Human Error, Black Swans, and Fives Nines Availability 

According to Gartner, up to 95% of cloud breaches happen due to human error. This figure can be minimized by proper training, change control, monitoring, and incident retrospectives or postmortems, but cannot be eliminated altogether. Furthermore, you cannot predict Black Swan events such as cyber-attacks, viruses, or rogue weather.


When you aim for a five minutes downtime per year, you’re limiting the downtime to minor incidents that are generally resolved by the monitoring systems. A human cannot detect, identify, and fix issues within five minutes. As soon as your incidents require human intervention, the five minutes are gone for the year. 

How to Achieve Five Nines Availability?

Monitoring plays a key role in achieving five nines availability. Downtime often occurs because organizations fail to efficiently monitor their IT systems or the system relies too heavily on processes that require human intervention.


There are three engineering principles that organizations can use in designing a high availability system:


  1. Avoid any single point of failure that can cause the entire application to crash by adding redundancies. 
  2. Add reliable crossover points to ensure that the crossover point does not become a single point of failure. 
  3. Make sure that any failure is detectable as soon as possible. 

How can Blameless Help?

The five nines availability is an attractive figure, but an ambitious figure. There is little to no room for error, and it’s almost impossible to completely eliminate black swan events. Most businesses don’t need more than four or three nines availability. However, no matter what your reliability goals are, Blameless offers various products such as comprehensive reliability insights tool, SLO managers, and automated incident resolution tool to help organizations increase their system reliability. To learn more about our products and services, request a demo or sign up for our newsletter below.

Resources
Book a blameless demo
To view the calendar in full page view, click here.