A Fault-tolerant Kafka Replication Tool across Multiple Datacenters that scales

Introduction of a new cross-datacenter replication tool for Apache Kafka

Ning.Zhang

Published in

Towards Data Science

5 min readJan 5, 2021

Image Courtesy of sumanley on Pixabay.com

Introduction

Apache Kafka is the de-facto data streaming platform for high-performance data pipelines, streaming analytics and mission-critical applications. For enterprises, as business continues to grow, many scenarios will require to evolve from one Kafka instance to multiple instances. For example, critical services can be migrated and run on dedicated instances to achieve better performance and isolation to satisfy Service Level Agreement or Objective.

Another example is Disaster Recovery (DR) — the instance in a primary datacenter is continuously mirrored to the backup datacenter. When the disaster happens in the primary instance, applications (or “services” alternatively) will quickly fail over to backup datacenter and continue to operate with minimum downtime.

Last but not least, when a business operates in multi-datacenter mode, data are first routed to the geographically nearby datacenter for locality, then transferred to a central cluster in a remote datacenter, called “aggregate cluster”, for a holistic and complete view of data.

Any of the above scenarios demand a tool that replicates real-time data with 5 requirements: (1) fault-tolerant and scale horizontally, (2) low-latency and performant (3) across data centers, (4) very strong message delivery guarantee, (5) simple and transparent failover and failback to applications.

Product Survey

A legacy open-source tool called MirrorMaker can copy data from one to another Kafka instance. However, it has several shortcomings that make it challenging to maintain a low-latency multi-datacenter deployment and build a transparent failover and failback plan, mostly because of the following:

No clean mechanism to migrate producers or consumers between mirrored Kafka instances. Consumer offsets from two instances never make sense to each other.
Rebalancing causes latency spikes, which may trigger further rebalances, as it uses high-level Consumer API.

uReplicator from Uber solved some of the MirrorMaker problems. But it uses Apache Helix that requires additional domain knowledge and maintenance.

Confluent Replicator should be a better solution, but it is a proprietary and enterprise software.

We want to promote MirrorMaker 2 (or called MM2), a new Kafka component to replace the legacy MirrorMaker. It satisfies all above five requirements of replicating data between Kafka instances across datacenters.

In the following, we will discuss three major practical use cases of MM2:

Migrate to new Kafka instance

As workload grows over the time, the following risks will be eventually exposed on one Kafka instance:

any turbulence of Kafka instance will impact all services or applications
resource contention: services are competing with the shared resources
unpredictable SLO: a service could take unbounded amount of resources, causing starvation of other services to miss SLO
slower recovery and maintenance: rebalancing the data partitions in Kafka instance will become slower when the data volume and workload is larger.
no “one size fits all”: one set of configurations can not satisfy the “conflicting” expectations of different services (stability over performance, performance over consistency)
any maintenance of Kafka instance (e.g. upgrade, node swap) needs communication to all engineering teams

To mitigate the above risks: critical services can be considered running on dedicated Kafka instances. To migrate from one Kafka instance to another, AWS has a tutorial for their Managed Kafka, which can be generalized to the open-source Apache Kafka.

Amazon MSK Labs

Using a custom Replication Policy and a background process to sync MM2 checkpointed offsets to the __consumer_offsets…

amazonmsk-labs.workshop.aws

Disaster Recovery

Though the data in Kafka instance has 3 replicas across all brokers, it is still possible that the whole instance is unavailable when all brokers are located in the same region that can suddenly goes offline, or majority of brokers is offline because of the outage of some racks. To achieve higher availability, it becomes important to set up a backup Kafka instance and continuously replicate from the primary to the backup instance. When the primary is not available, all services are routed to the backup.

It is simplest to not have producers to send new data to the backup while the primary is down. More realistically, producers are redirected and continue producing new data to the backup. When primary is restored from disaster and data is intact, only new data generated during the disaster need to be mirrored back from backup to the primary instance by MM2.

Active-Active Replication across Multi-Datacenters

In the active-active design shown below, one MM2 instance copies data from origin DC-1 to destination DC-2, and another MM2 copies data from origin DC-2 to destination DC-1.

active-active Replication across datacenters by MirrorMaker 2. Image by Author

“Producer 1" writes to “Topic 1” in their local DC-1 and “Producer 2” writes to “Topic 2” in their local DC-2.

“Consumer 1” can read data from “Topic 1” that is produced by “Producer 1” in DC-1, and also read data from “Topic 2 mirrored” that is originally produced by “Producer 2” in DC-2 and then replicated to DC-1. Vice versa.

In the event of a disaster causing DC-1 to fail, “Producer 2” and “Consumer 2” in DC-2 can continue operating. If the outage in DC-1 only takes a short period of time and the data produced to “Topic 1” is not too critical, it is not always necessary to aggressively fail over the “Producer 1” to DC-2, since DC-2 is still operating. When DC-1 recovers, the two instances of MM2 will catch up and continue to replicate data across datacenters.

From the application’s point of view, active-active deployment increases both availability and performance, as “Consumer 1” and “Consumer 2” receive the same data (may not in the same order) almost in real time. One data center can completely fail without impacting the data consumption at the other datacenter.

Summary

In the next few blogs, we will introduce several follow-up topics, including:

Please stay tuned for more articles!