
Let’s imagine we have a service which is dealing with a scale of a few thousand requests per month with a single DB server which serves all the read/write request coming from the application servers.
Things have been working fine till this point but with the time you start observing the spike in the incoming traffic and you are not able to reduce the latencies and Database failures even after trying the following things:
- Have a database with plenty of memory and high-performance flash storage.
- Have implemented the optimum caching to reduce latencies.
- Have proper indexing and optimizations over database queries.
So now you need to solve for your DB infrastructure so that it can withstand the incoming traffic.
Let’s look at how Database sharding and Replication can provide a highly reliable and performant database system that can withstand a scale of millions.
Replication
Data replication is the process of storing the same data on more than one database server that leads to reduced latencies, read scalability, easier data recovery, and higher availability of data.
Generally, the copy of the data is stored in a separate hardware component because if that hardware crashes then you don’t want your replica to crash.
The most commonly used replication strategy is Single Leader Replication.
Single Leader Replication
In this kind of replication, there is a leader node that takes all the writes and handles replication of data on all the other nodes. The leader node is generally termed as the master and the other dependent nodes are termed as the slaves.
Both Reads and Writes are enabled only on the master and slaves can serve the read queries only.

Now to update the data on slaves we can either opt for either synchronous or asynchronous replication. Both of these have their own pros and cons.
1. SYNCHRONOUS REPLICATION: Whenever a write comes to master it sends the same request to all of the slaves and waits for the acknowledgment from all of them. Once it receives the acknowledgment it completes the write and notifies the client.
Benefits of Synchronous replication
- It provides data consistency by assuring that all the nodes are having the updated data as all the slaves are in sync with the master.
- If the master node goes down any of the slaves can easily become master without any data loss.
Disadvantages of Synchronous replication
- Sometimes the system can be highly latent as master waits for an acknowledgment from all the slaves.
- If any of the slaves is unavailable or goes down then the master blocks all the writes until the synchronous replica is available again. Hence leads to lower availability.
2. ASYNCHRONOUS REPLICATION:Whenever a write comes to master it sends the same request to all of the slaves and doesn’t wait for an acknowledgment and completes the transaction.
This way master can continue to take writes even if all of the slave nodes fail.
Benefits of Asynchronous replication
- It provides low latency as the master doesn’t wait for data updation on the slaves.
- Higher availability as even if any of the slave nodes fail it doesn’t affect the master.
Disadvantages of Asynchronous replication
-
It can lead to data inconsistency due to replication-lag. i.e. It might take a significant time for the master to slave data replication. Due to this if you run the same query on master and slave (concurrent execution), you can get different results.
- If the master node fails in asynchronous systems, the newly appointed master might not have all the writes from the previous master. If the previous master comes back online and becomes a slave, it will have conflicting writes.
Hence a read-scale to the database can be achieved via replication. But how to add a write-scale? The answer is Sharding.
Sharding
Let’s say you have a pizza and you can’t have the entire thing by yourself so you break it into slices and call your friends.
Now each of these friends is going to get one slice of pizza what you have done effectively is partitioned the pizza according to each friend share just like that we can have multiple database servers which are going to be taking the load of the requests which are being sent into it.

So technically we are going to divide the complete DB load to N database servers in Sharding.
And each subset of incoming requests is going to be served by an individual server. Now each partition sits on an entirely different physical machine, and under the control of a separate database instance with the same database schema.
This is termed as sharding. It can also be termed as horizontal partitioning because sharding is basically horizontal partitioning across different physical machines/nodes.
Sharding provides Higher Availability, reduces read/write DB latencies, and can handle a high DB load efficiently.
There are multiple things that need to be considered while enabling sharding for your database. Two main things to be considered are as follows:
Shard-Key:
We need some identifier to decide from which database server a particular request should be served. i.e. there should be key based on which we should be deciding on which shard a request should go. We call this a shard_key or partition_key. Generally, we choose any column of the table as a shard_key.
Hash-Function:
Hashing is the process of converting an input of any length into a fixed-sized string of text using a mathematical function.This means that any text no matter how long it is can be converted into an array of numbers and letters.
This mathematical function used to generate a hash-value is termed as the hash function.
_So using the hash function we calculate the hash value for the shardkey and then map the hash value to a DB server.
In the below diagram, the hash function takes the shard_key (row-id) as input and returns the value by which we can identify to which shard it should go.

Now, NoSQL Dbs like Mongo DB provides an inbuilt standard implementation for sharding. More databases that provide sharding at the database level are Cassandra, Hadoop, and Apache H-Base e.t.c.
Whereas in relational DB’s like MySQL there is no standard implementation of sharding. Hence, the common practice is to have application-level sharding for the Relational Databases.
Problems that may arise due to sharding
1. Redistribution of Data:
If in a database cluster one of the database servers goes down or you are adding a new database shard to the cluster then you might need to do the redistribution or rehashing of the data.
This overhead is directly dependent upon the hash function we are using to distribute data between different shards.
If the hash function is a simple modulo function that takes the modulo of the row-id with the number of shards to decide the shard to which a request should go then, in that case, it will lead to the complete rehashing of the data as shown below:

In the above image, with 4 shards we can see the mapping of data is based out of the id modulo 4 (no of shards)
12%4 -> 0th shard 29%4 -> 1st shard 27%4 -> 3rd shard
With 5 shards the mapping of data is based out of the id modulo 5(no of shards)
12%5 -> 2nd shard 29%5 -> 4th shard 27%5 -> 2nd shard
So here we observed that all the rows need to be remapped when we add a DB shard to an existing four shard cluster.
And this kind of redistribution generally requires a system downtime that could be risky to your system.
How to Avoid:
- To avoid this we can use a special kind of hashing technique named _consistent hashing_ that generally requires remapping of only
k/N
keys wherek
is the number of keys andN
is the number of DB servers (more specifically, the maximum of the initial and final number of servers).
2. Data Denormalisation (Relational Databases)
With a single database server queries that require joins can easily be done but as we know in the case of distributed architecture, the data is spanned across multiple regions. Hence a join over the tables which resides over multiple regions will require lots of processing and would be significantly latent.
Hence you might need to do data denormalization to avoid joins as much as you can.
How to Avoid:
- You can also go for a NoSQL DB in case you want to avoid this. But in some cases, you might have an exact requirement of Relational Databases.
-
CoSharded Joins:To avoid joins in relational databases generally, we try to have FK mappings of the row in the same shard in a different table. Hence all the rows that need to be joined are located on the same shard.
To achieve this we choose the column on which we can have joins as the shard_key. Hence rows from different tables having the same value of the shard_key column will end up lying in the same shard.

In the above example of a hospital database, we choose patient-id as a shard key. Hence all the entries of different tables with the same patient-id will lie in the same shard.
3. Hot Key Issue
Excessive access to a particular key or keys that lies in the same shard, can cause node exhaustion.
i.e. Most of the requests are being served by a small subset of shards and other shards remain ideal most of the time. This can lead to node failure or overload and hence a resulting spike in latencies and failures.

In the above diagram, we can see that most of the requests are being served by shard-1 hence it’s heavily loaded. Whereas shard-2 and shard-3 are almost ideal.
This can result in increased latencies and failures for all the requests which are being served by shard-1.
A real-word example would be let’s say Facebook chooses user_name as the shard_key for its user table. And all of the most accessed profiles like Cristiano Ronaldo’s, Lady Gaga’s, and Justin Bieber’s end up being in the same shard. Then that particular shard would be a hotspot of requests as compared to other shards.
How to Avoid:
- To avoid this, we need to choose the shard key and hash function in such a way that the complete application load is almost equally distributed among all the database shards.
Conclusion
Sharding is an excellent way to withstand the incoming traffic in a smart way. But you need to identify that when is the exact requirement of sharding because as you can see data partitioning and sharding can bring a bunch of extra complexities.
So before making the decision to shard the database you should consider other options to make your database more performance such as indexing or caching.
But if you still decide on designing a distributed database system there are other trade-offs also that you will have to consider.
Thanks for reading. Happy coding!