Opinion

Change Data Capture(CDC) for Data Lake Data Ingestion

Farhan Siddiqui
Towards Data Science
6 min readFeb 2, 2021

--

As data volumes continue to explode and business users increasingly demand continuous access to insights, extraction of all underlying data in real-time is no longer practical. Instead, solutions must be deployed to identify and replicate a change log to support near real-time analytics applications.

A proper data ingestion strategy is critical to any data lake's success. This blog post will make a case that Change Data Capture(CDC) tools like Oracle Golden Gate, Qlik Replicate, and HVR are best suited for data ingestion from frequently refreshed RDBMS data sources.

Fig 1: Typical data ingestion landscape for a data lake. Image by Author

Large data transfer times limit data lake adoption

Large data transfer times over networks are a critical factor that may limit a data lake's adoption. Many transactional data sources continue to reside in the on-premises data centers in a typical large enterprise today. Simultaneously, most analytical platforms have migrated to the cloud to leverage utility compute capabilities and cloud providers' on-demand scale. A redundant, dedicated 10 Gbps network link to a cloud provider is very typical for most large enterprises and cost tens of thousands of dollars per month. As good as it sounds, a 10 Gbps connection has significant limitations when dealing with large datasets over few terabytes.

Fig 2: Large dataset transfer time. Image by Author using Google data (reference below)

The data transfer times depicted by the above graph are optimistic estimates. Other factors like network latency, congestion, number of hops, and server response times may slow the real-world data transfer by many orders of magnitude. The bottom line is that full load replication is not practical for large datasets.

Limitations of Homegrown Change Data Capture(CDC) solutions

Many data engineering teams develop homegrown CDC solutions. There are two main patterns for homegrown CDC solutions:

  • CDC Calculations using timestamps: This pattern requires Create, Update, and Expire timestamps on source tables. Any process that inserts, updates, or deletes a row must also update the corresponding timestamp column. Hard deletes are not allowed. Once these conditions are met, it becomes trivial to calculate the changelog. However, enforcing the constraints discussed previously is not a trivial matter. As such, practically speaking, a tiny percentage of tables in a typical database have these timestamps or follow this pattern. It also creates a tight coupling between the source table and Extract, Transform & Load (ETL) code.
Fig 3: CDC calculation using timestamp columns. Image by Author
  • CDC Calculations using a minus query: This is more of an anti-pattern than a pattern. However, that does not seem to stop this pattern's proliferation in CDC calculation for many production systems. A DB link is created between the source and the target database, and a minus SQL query is executed to calculate the changelog. This pattern results in a large amount of data flowing between the source and target database. Moreover, this pattern can only work if both source and target databases are of the same kind, e.g., Oracle to Oracle.
Fig 4: CDC calculation using minis query. Image by Author

Both CDC methods discussed above cause a significant load on the source database. Additionally, the minus query pattern also causes a significant network load.

Fig 5: Adverse network, CPU, and disk impact from homegrown CDC calculations. Image by Author

How do the CDC tools calculate changelog?

CDC tools minimize the network load and the source database by mining database logs for the changelog calculation.

Every database supporting transactions first writes any changes (inserts, updates, and deletes) to a database log before writing it to the database. This is done to assure the integrity of transactions from unexpected occurrences like power failure etc., while the transaction is still inflight.

The database logs cycle between active log (redo log) and archive logs based on file size or time interval events. Depending on the target database's data latency requirements, the CDC tool can access the active log or archive logs for the source database.

Fig 6: Basic architecture of a typical CDC tool. Image by Author, inspired by Qlik blog. (reference below)

Since the CDC tool only reads from the source database logs, it does not add to the source database load. Moreover, the changelog is typically many orders of magnitude smaller than the source table. Such a flow of such a changelog over the network does not cause any network load issues.

Fig 7: CDC tool minimize load on the source database and the network. Image by Author

Since database logs only contain information about the most recent inserts, updates, and deletes, changelog calculation becomes a trivial undertaking. However, since there are many database vendors, each with its own (sometimes proprietary) log format, such tasks are best left to commercial CDC tool vendors.

CDC tools ike Oracle Golden Gate, Qlik Replicate, and HVR allow near real-time synchronization of transactional data sources with analytical databases in a network efficient manner, while minimizing impact to source systems, and as such are best suited for data ingestion from frequently refreshed RDBMS data sources.

Additional benefits of commercial CDC tools

In addition to minimizing the impact on the source system and the network, most commercial CDC tools offer other benefits like:

  • Enterprise-grade resiliency
  • Reduced ETL compute footprint
  • Reduced ETL codebase
  • On-the-fly lightweight transformations
  • Lose coupling between the source database and data ingestion code

So why are CDC tools not more popular?

The number one reason why CDC tools are not adopted more widely is cost. CDC tool vendors invest heavily in understanding log formats of many databases, which themselves may continue to evolve. This requirement for continued investment limits the true CDC tool landscape to a few key players, allowing them to charge a premium for their product.

The number two reason CDC tools are not adopted more widely is the reluctance of source system DBAs to grant database log level access (privileged access) to third-party CDC tools for security reasons.

Both of the above concerns are valid but can be surmounted with patience and hard work. You need to engage CDC vendors in price negotiations to get favorable pricing terms. It would help if you also had detailed discussions with your DBAs to apprise them of the value and address their security concerns.

There is significant enterprise value that can be unlocked by implementing a CDC tool for ingesting data into a data lake. I would encourage you to embark on that journey. CDC tools are one of the best solutions that can help manage exploding data volumes and business user’s demands for near-real-time access to insights, at least in the context of RDBMS sourced frequently refreshed data.

Image credits: Fig 2: Google, Fig 6: Qlik Blog

Disclaimer: This is a personal blog post. The opinions expressed here represent my own and not those of my current or former employers. All content is provided for educational purposes only, without any warranty of suitability.

--

--

Technology leader with over 20 years of experience delivering analytics solutions for fortune 100 companies.