Database Replication Explained

Part 1 — Single Leader Replication

Zixuan Zhang
Towards Data Science

--

Introduction

In the modern era of big data, data replication is everywhere. From bank accounts to Facebook profiles to your beloved Instagram pictures, all data that people deem important are almost for sure to be replicated across multiple machines to ensure data availability and durability. In this article, we will explore one of the most common replication strategies called single leader replication.

Single leader replication, also known as primary-secondary backup, is a straightforward idea to increase the availability and durability of data. Suppose you ask a college student to devise a replication strategy. In that case, she will probably come up with something similar — a single node(leader) handles all the traffic and actively broadcasts it to other nodes (followers).

Theory

Below is a chart that describes the data flow of single leader replication. The idea is simple — the client sends all write requests to the leader (we’ll talk more about reads later). The leader then broadcasts the request to all followers. Once the write is processed by followers, ACK is sent back to the client.

Fig 1 — Single Leader Replication, image by author

Deep Dive

Single leader replication is a straightforward idea, but it is not as simple as it seems. There are tons of considerations that an engineer should know before choosing single leader replication. There are a few details that I glossed over in the last paragraph, and we are going to examine them closely.

Sync or async, that is the question.

Firstly, when do we send ACK to the client? Does the leader wait for all followers to process the request (synchronous replication)? Such a strategy works well for certain applications, but it inherently reduces the availability of the system. The leader may need to wait for ages if the network between it and the followers is interrupted, which happens all the time with public network infrastructure (Fig 2). To make things worse, imagine what will happen if a follower dies in a sync replication setting. The leader can’t process any requests simply because it never gets ACK from the dead follower.

Fig 2 — Sync replication with long response time, image by author

Who’s Open for Business?

We know all write requests go to the leader, but what about read requests? There are two options here, either letting the leader serve reads as well, or exposing followers to the client. Putting all the pressure on the leader isn’t a good idea for read-heavy applications (if the pattern is mostly writes, then there’s not much difference). However, if the followers are allowed to serve read requests, it takes a toll on consistency, as demonstrated in the figure below.

Fig 3 — serving reads from followers compromises consistency, image by author

Fig 3 exposes a fundamental trade-off between availability and consistency, as dictated by the CAP theorem. Problems arise because of the unavoidable replication lag between the leader and followers introduced by async replication. However, we still have a few tricks up our sleeves to achieve stronger consistency guarantees.

Trick 1: using timestamps for read-after-write consistency.

In many applications, we want to ensure that the writer never gets to see the stale value (for instance, if you click the like button and refresh, you certainly don’t want to see it undone). Such a guarantee can be implemented by forcing the client to remember the last write timestamp. Only values younger than that timestamp is returned by the database.

Trick 2: sticky routing for monotonic read consistency.

With the naïve implementation, clients can read values from any replica. This is sometimes problematic, as different replicas have different data versions. It is possible that the client gets different versions of data in successive reads: version 2 from client A, then version 1 from client B, then version 2 again from client A. To address this issue, a sticky routing load balancer can be used to make sure each client only gets to read from one follower (if it dies, read from another, of course)

Who Is the Chosen One?

Thirdly, what about leader failure? The current design relies exclusively on the health of the leader node. In a big data center, server outage happens all the time. If the leader goes down, failover occurs as a follower must be promoted to handle write requests for the sake of availability.

Failover in the setting of sync replication is not a big issue when it comes to data durability. However, in completely async settings, leader failure will probably lead to data loss as followers do not have the latest writes. For the clients, it is as if their writes vanish up in thin air. To make things worse, consider what would happen if the old leader rejoined the group. It may very well have writes conflicting with the new leader, who has processed additional requests

Fig 4 — Async replication can lead to data loss, image by author

Another fun thing to consider is what people call split-brain, a situation in which two nodes are leaders because of a network outage. This is truly a dangerous situation as multiple leaders invariably lead to conflicting writes. Without proper conflict resolution code, the system will never reach any consensus, hence corrupting the data.

Fig 5 — Split brain when network is partitioned, image by author

There is no simple way to avoid split-brain automatically. The most common solution is manual failover by unplugging the power chord of the old leader. However, if you are those people who like algorithms, there are a few tools to try out.

Trick: quorum and fencing.

If a network is partitioned into two subsets, the subset with the majority of nodes remains active while shooting the minority subset (I did not make this up, this approach is literally called STONITH — Shoot The Other Node In The Head) by sending out a special signal to power supply controller.

Summary

Like many engineering designs, picking the right replication strategy is all about trade-offs. When using single leader architecture, you should carefully evaluate the following questions to come up with the best configuration:

  1. Consistency vs. Availability? Serving reads from followers is a double-edged sword. If you choose it, the application code must be carefully designed to handle stale reads and other consistency issues.
  2. Availability vs. Durability? Async replication is fast, but it can lead to data loss when the leader fails. Sync replication trades time for durability, but it might not be a good deal for certain applications. Some people decide to stay in the middle — having one sync replica and a bunch of async replicas.
  3. Labor vs. Automation? Handling leader node failure is no fun. Engineers don’t want to work at 3 AM in the morning. However, automatic failover is a rabbit hole that goes on forever. Split-brain, conflicting writes, data loss, you name it.

--

--

New grad SDE at some random company. Student at Columbia & USC. I know how hard learning CS outside the classroom can be, so I hope my blog can help!