The world’s leading publication for data science, AI, and ML professionals.

3 High Availability Cloud Concepts You Should Know

From scaling to VM placement strategies

Photo by Josep Castells on Unsplash
Photo by Josep Castells on Unsplash

Having a solution available in public usually means you need to deploy and keep it working "somewhere". Nowadays "somewhere" is very often a cloud environment. It is a flexible solution, where you can start small and increase the capacity as your business grows. However, despite what kind of system you own, you need to make it highly available so users can rely on it.

Cloud environments make it possible to build reliable systems, but it doesn’t mean that clouds themself are immune to failures. It doesn’t work that way. You need to be aware and make your system ready to deal with failures, rather than believing that all cloud components you use are always available.

Let’s go through the main cloud concepts that are crucial to making your systems highly available.

Scaling

Vertical vs horizontal scaling (image by Author, using monitor image by 1117826 on Pixabay).
Vertical vs horizontal scaling (image by Author, using monitor image by 1117826 on Pixabay).

Making your system ready for changing load and keeping the minimal needed capacity is a way of ensuring high availability. When you start small a big load is not an issue, however, using cloud mechanisms like scale sets is still a good idea. They can keep the minimal number of virtual machines your system needs to be up. In case of unexpected events like a machine being taken down, the scale set rule should spin up a new instance for you. There are two main kinds of scaling, horizontal scaling, and vertical scaling.

Horizontal scaling means that you add or remove the same type of instances (like virtual machine instance or a container running an application) to your stack. The new instance is the same as other instances in terms of the resources it uses and is capable of handling the load. It’s also called scaling out (increasing the number of instances) and scaling in (decreasing the number of instances). To scale horizontally your system needs to be ready for it, and every single instance has to be capable of working independently. It’s particularly important in the case of stateful systems when some kind of synchronization may be needed.

Vertical scaling is done when you increase the resources of one of your instances. You can add more RAM, CPUs, GPUs, disk space, or any other resource. It’s like making your machine more powerful, which is well described by the picture at the beginning of this section. Vertical scaling is also known as scaling up (adding resources) or scaling down (removing resources). The main drawback of this type of scaling is that sometimes it requires stopping an instance, adding resources, and starting it again. It can cause disruption. It’s not the case while using horizontal scaling.

Scaling your system is not an easy task. If load fluctuates a lot it may be hard to find a way to keep low costs and be ready for high demand. Usually, you can use different metrics of the system to build scaling rules. CPU utilization, amount of memory, or disk space are some examples of the metrics. You can also think of using metrics like the number of requests that reach the system, the capacity of queues (if you use them), or any specific information related to load changes. For more complex cases, there are even Machine Learning algorithms that help to find optimal scaling rules.

Multi-region deployments

Basic cloud units (image by Author using cloud image by camelia_sasquana on Pixabay).
Basic cloud units (image by Author using cloud image by camelia_sasquana on Pixabay).

In cloud environments systems are deployed in units called regions. A region is a data center or a set of data centers that are located close to each other. There is also a more granular unit inside of the regions, called an availability zone. Each availability zone is a single data center within one region.

Both regions and availability zones serve well to the availability of a system. What’s more when you deploy the system in different regions like West Europe and East US users can benefit from lower latency, as they’ll connect to the nearest instance.

Having your system deployed in different regions and/or different availability zones makes it more resistant to region failures. It simply adds more redundancy to your architecture. When given cloud service you use is down in your region, you still have another region that works well. That’s the main idea behind multi-region deployments.

Sometimes the whole region or a couple of regions may go down, it’s rare, but it may happen. In such cases, you can’t do much if you use one cloud provider. On the other hand, using multiple providers and multiple regions is costly so you need to calculate what’s the best choice for you.

How many regions are available in different cloud environments? Let’s look at Azure and AWS.

In Azure [1] there is:

  • 51 regions
  • 12 regions with at least 3 availability zones

In AWS [2] you can use:

  • 25 regions
  • 72 availability zones
  • each region except Osaka has at least 2 availability zones

As you can see these providers chose different strategies. Microsoft went for regions and availability zones as an expansion, while Amazon chose to equip every region with availability zones, but has fewer regions. Despite which solution you choose it will fit your high availability requirements.

Scale sets vs availability zones vs regions (image by Author, using images by camelia_sasquana, 1117826, OpenClipart-Vectors on Pixabay).
Scale sets vs availability zones vs regions (image by Author, using images by camelia_sasquana, 1117826, OpenClipart-Vectors on Pixabay).

If you decide to go for a multi-region strategy you need to think if your architecture fits it. Let’s say you have a system that can be deployed in multiple regions and each deployment can work on its own. In such a case there is no problem, you can choose whatever region or availability zone that fits your users’ needs.

However, if your system has components that need to communicate with each other and send a lot of data through the network, multi-region deployment may harm its performance. In the picture, you can see a tradeoff between availability and latency.

If you choose to have only one region, one availability zone, your instances will be in one data center. It gives you lower availability because if the data center goes down, the system goes down. However, system components will be placed close to each other so it will have the lowest latency.

If you choose to deploy your system in many availability zones inside one region, you’ll spread your instances through different data centers. It will increase the theoretical availability of your system. You will add some latency, however, availability zones are connected with a fast fiber network, so it should not be so bad.

Finally, if you go for multi-region deployment your components will communicate between huge distances. It has the highest availability, as regions are far away from each other, natural disasters should not affect multiple regions. But latency between regions will be much higher than in other cases.

VM placement strategies

AWS virtual machines placement groups (image by Author using server image by Clker-Free-Vector-Images on Pixabay).
AWS virtual machines placement groups (image by Author using server image by Clker-Free-Vector-Images on Pixabay).

No matter which strategy you choose, you’ll end up with at least one deployment in one datacenter. That’s why it’s important to understand failure domains and update domains.

Instances can land in the same or different failure domains. One failure domain is basically a rack with the power supply. If a failure domain goes down, all instances from the rack are also down. You can check how many failure domains are available in the region you chose.

Update domains work in the same manner as failure domains but they help when cloud providers introduce changes (like patching operating systems) or need to run some maintenance actions. If instances are spread into different update domains you can be sure that during maintenance only a part of them (inside one update domain) will be unavailable.

In AWS you can also define virtual machines placement groups. There are three types of them. Cluster type keeps your instances on one rack. Partition enables you to create a logical partition and decide which instances go to which partition. Finally, spread (default), tries to dispatch your instances as much as possible so they’re less vulnerable to outages.

Why should I care?

Ok, but why I need to understand and take care of it? I’m just a cloud user, I pay for services and I want to have it working. As I wrote before, using the cloud doesn’t mean you can ignore how it works under the hood. You are responsible for the architecture of your system and making it highly available.

Let’s describe it using an example. In our theoretical system, we have an Apache Zookeeper cluster [1]. It’s a tool that supports the coordination of distributed systems. It helps with distributed configuration and state.

Zookeepers need to work in a quorum. In very particular quorum. Basically, you need to have 2N + 1 Zookeepers. N is a natural number. The minimal highly available setup contains three Zookeepers. One can go down and the cluster works, however, if two go down, the whole cluster is down. Let’s say you didn’t care much and deployed your Zookeepers in the data center with two failure domains (racks). Zookeepers will be spread like that:

Zookeepers spread on two failure domains (image by Author, using Apache Zookeeper logo).
Zookeepers spread on two failure domains (image by Author, using Apache Zookeeper logo).

Now you have 50% percent of chances that your cluster goes down when one of the racks gets broken. So your system is not really highly available, one failure makes it down:

Rack with two Zookeepers goes down (image by Author, using Apache Zookeeper logo).
Rack with two Zookeepers goes down (image by Author, using Apache Zookeeper logo).

Ok, so maybe we can scale Zookeepers? Let’s go for N = 2, in such a setup, two Zookeepers can go down without any issue:

Zookeepers cluster of 5 instances (image by Author, using Apache Zookeeper logo).
Zookeepers cluster of 5 instances (image by Author, using Apache Zookeeper logo).

But wait a moment, if you have only two racks they can be spread like this (best case scenario):

Rack with three Zookeepers goes down (image by Author, using Apache Zookeeper logo).
Rack with three Zookeepers goes down (image by Author, using Apache Zookeeper logo).

As you can see it doesn’t help much. Your cluster is as vulnerable as with three Zookeepers. That depicts why it’s important to understand what’s going on behind the scenes.

Be ready!

We went through three basic yet useful concepts that can help to make your system highly available. Implementing them together with having well-designed architecture will increase the availability of your services and make them ready for unexpected, yet possible events. And trust me, if you use cloud you’ll experience them sooner or later.

Bibliography:

  1. https://azure.microsoft.com/en-us/global-infrastructure/geographies/
  2. https://aws.amazon.com/about-aws/global-infrastructure/regions_az/
  3. https://zookeeper.apache.org/

Related Articles