
Availability is one of the critical aspects of Distributed Systems. In this modern age, most systems need to guarantee availability. We may say, availability is the time a system remains functional to perform its required task in a specific period.
According to Wikipedia, Availability is generally defined as uptime divided by total time (uptime plus downtime)
We sometimes use the terms reliability and availability interchangeably, but they are not the same. Reliability is availability over time if we consider the full range of possible real-world situations that can occur. If a system is reliable, you can say it is available. However, if it is available, it may not necessarily be reliable.
★Why is Availability critical?
Let’s say we are launching a tutorial site. We have to make sure the availability of the videos of the tutorial, coding, and reading materials. Whenever clients go to the website of the tutorial or learning materials, they expect the system to be fully operational and available. The system may lose the client, and also it will be hard to get new clients. A broken system gives you bad publicity.
If the system is unavailable often, users will be dissatisfied. They may even switch to other competitor systems that provide the same service. So, for system designers, ensuring availability matters a lot.
Now, these web services, if not available, may cause harm to their publicity, financial security, etc. But it will not be the end of the world. There is no such system that does not become unavailable sometimes.
Let’s get into the example of supporting an airplane software. Software that helps an airplane to function properly when it’s in the air. Now, this kind of system becoming unavailable even for some minutes would be unacceptable. So, an aircraft that can be flown for many hours can be said to have an extremely high amount of availability. This can affect a life or death situation, as you can see.
But we don’t need to go that far in our imagination to look for a life or death situation for high availability scenarios.
Even a platform like youtube, imagine if youtube ever goes down, how much impact it will have on how many users. Hundreds of millions of people use youtube every day, many of them depend on youtube for their livelihood also. The same goes for cloud providers like AWS or Google cloud platforms. Can you imagine how many services will be blocked because of that? You might not be able to read this article right now!! Yes..that bad of an effect.
One outage on such cloud providers can have huge and farreaching repercussions. All of these examples show us that availability matters a lot for system design.
★How to Measure Availability?
Ok, we get it, availability is essential, but how do we measure availability? It can be calculated as the percentage of time that a system or service remains operational under normal conditions.
Let’s say we measure a system’s availability based on the percentage of its uptime in a year. So, if a system is is up and operational for six months of a year, it will have 50% availability. Now, imagine if Facebook or Uber were down 50% of the year!!! That’s really bad; it would not be acceptable. Nobody would use them, right?
In case of availability, we need to deal with very high percentages. To be honest, even the availability of 90% isn’t good enough. In this rate, Facebook would have been down for 2.5 hours every day. That’s 35 or 36 days a year. Would you use it then? I doubt even Zuckerberg would have used it if that happened. There is no way any product can survive in today’s market, with only 90% availability.
★What are Nines in Availability
If a system has availability of 99%, it is called two nines availability as the number nine appears two times. Even a 99% available system gives almost four days of downtime a year, which is unacceptable for services like Facebook, google.
Percentages of a availability are sometimes referred to by the number of nines or "class of nines" in the digits.

For 99.9% availability is known as three nines and 99.99% as four nines availability. Five nines availability (99.999%) gives a 6 minutes downtime in a year, which you can say is the gold standard of high availability.
High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
Having high availability comes with trade-offs like higher latency or lower throughput. When you are a system designer, it’s your job to think about these trade-offs and decide how high available systems you need. Maybe a part of your system needs to be highly available, not the whole system.
Say, for example, let’s take the instance of Medium; it has many parts of the system. Do we need to have high availability of stat page in the profile? Not really. On the other hand, article writing, reading, and homepage loading parts need to be highly available. If you cannot see the stat page for some time, say 5–10 minutes, does it matter? I don’t think so. So, that part of the system can be less available (but of course, not 50% availability).
★What is the solution for high availability?
Now, we need to know how we can improve the availability of a system. The first and straightforward solution is your system should not have a single point of failure. So, if a component in the system is such that if that one fails, the entire system will fall.

Redundancy is the solution to a single point of failure. Redundancy is the act of duplicating or multiplying a component of the system.
Now let’s say a system has one server to handle the request of clients. It’s a single point of failure.
So, we need to add more servers to handle the requests. But we will need a load balancer to balance the loads between the servers. Now, if the load balancer is out, the system will not be available.

So, we need multiple load balancers to remove the single point of failure of the system. Now, if one of the servers is down, other servers can handle the client requests. If a load balancer is down, other load balancers can take requests and send them to the server.

Conclusion:
Availability is a key property of a distributed system. If you want to make a system available, you need to eradicate any single point of failure in that system. And you can do so by making that part of the system redundant. Besides, you need to have a process in the system that handles system failures. In case of system failure, it may require human intervention to get the system back up; then you need to have processes in the system which ensure that human gets notification of the system failure within a timeframe.