In the reliability era, many services are migrating from in-house servers to the cloud. The cloud model allows your service to capitalize on the benefits of large hosting providers such as AWS, Microsoft Azure, or Google Cloud. These servers can be more reliable than in-house servers for reasons including:
However, as with all things, cloud providers present their own risks and challenges as well. Teams will want to take advantage of the benefits while accounting for these limitations. To do this, your DevOps practices must be built with the cloud in mind. In this blog, we’ll look at how SRE helps with migrating and operating in the cloud, as well as share some tips on how to maximize reliability.
In an AWS blog post, AWS GM Stephen Orban describes six strategies for migrating a service to the cloud. All strategies note a transition period where the service must be reevaluated or changed. Even if you could rehost a service on the cloud with no changes, you should make changes to optimize it for the cloud.
This transition period can affect reliability. Developers will likely need to change several aspects of the service in a short period of time. Operators will need to incorporate development changes and adjust procedures. It can be difficult to understand the impact of the risks you take when making these changes. You may anticipate that some services will have an outage, or that incidents will take longer to fix. How do you know if this is an acceptable decrease in reliability?
SRE helps provide valuable answers with SLIs (service level indicators) and SLOs (service level objectives). SLIs are built to reflect the metrics most impacting customer satisfaction. SLOs provide a goal for those metrics. This objective is set at the point where customer happiness decreases. Up until that point, the SLO forms an error budget. As long as you’re within that budget, you can be confident that decreases in reliability won’t affect your customers.
Break down your migration plan into potential incident scenarios. Think about your confidence interval for different incident types. As you refactor the code, monitor your burn rate. At your current rate, can you handle the worst possible outcome for each incident? Have a plan in place for violations, such as rollbacks or shifting efforts to reliability work. A slow but reliable transition to the cloud is better than a rushed one that introduces a large surface area of simultaneous failures.
Once you have services on the cloud, make sure your SLOs account for your provider’s limitations. If the cloud host can only guarantee 99.99% availability, don’t promise that your service will offer 99.999%. These third party dependencies can affect your customer happiness. Even though these incidents aren’t your fault, they still impact your customers’ perception of your reliability. You need to account for them when aiming for your reliability targets.
Operating your service on the cloud comes with both its own challenges and advantages. As you adjust your DevOps procedures for a cloud-first environment, you’ll find that you may naturally be implementing many SRE best practices. Rastko Vukašinović explains how in an article on his blog. He breaks down the two scales of DevOps: one that focuses on the deployment of the service, and one focusing on the service running in production.
For cloud-first environments, many aspects of deployment are often standardized. Resource management becomes more automated and flexible. This changes the way the deployment scale and production scale integrate. Rastko explains how SRE focuses more on applying DevOps onto live services, rather than in deployment. Instead of a traditional model of discrete deployment, deploying onto the cloud often involves working with the live service directly. This enables what Rastko calls “a seamless flow of delivery”.
Rastko emphasizes that such a smooth flow isn’t magic, but “really hard work”. It can be helpful to build a model of how development and operations functions in your organization. Look at each link in the reliability model. How will it need to change for a cloud or microservices-based environment? Which steps can you automate or streamline within a cloud-based environment?
Make sure that the transition to the cloud doesn’t create information disconnects. SRE means learning from failure and creating a feedback loop. The cloud model may increase technical dependencies, and change how your operators deploy code and resolve incidents. SRE tools like incident retrospectives ensure that you’re still sharing learning no matter how processes change or scale. This helps keep the feedback loop in motion.
Review and refactor your monitoring data. Monitoring data is the lifeblood of SRE. The scale and context of data from private vs. public cloud environments can be significantly different. Make sure you’re still able to capture what you need. At the same time, look for new data sources available to you, such as ways to integrate your monitoring tools with the cloud service.
Test your service against cloud issues. Although many cloud hosting providers adhere to SLAs and outages are relatively uncommon, cloud hosting services are still prone to disruption, just like any other service. If possible, test in production and take advantage of chaos engineering to simulate potential incident scenarios, so you can be as prepared as possible for the real thing. Come up with emergency plans such as investing in architectural redundancies or failover plans to minimize the impact and cost of hosting provider outages. These incidents are out of your control, but responding well will make a difference in how customers perceive your reliability.
Run your incident response in cloud environments. Incident response tools, like automated runbooks, will likely also need to change when switching to the cloud. They should include checks for whether it is a local or cloud-based issue. If it’s the latter, they should detail how to engage help from the cloud provider. On-call responders will need to have this information at hand.
The transition to the cloud is a major one. Investing in SRE practices and tooling can help. Blameless’ features spanning SLOs, incident retrospectives, runbooks, and more can help you stay reliable both during and beyond this mission-critical initiative. To see how the platform works in action, check out a demo.
If you enjoyed this blog post, check out these resources: