Data Integration: The Definitive Guide

Simply put, it is the process of moving data

Arpit Choudhury
Towards Data Science

--

All images in the post are by the author

What is Data Integration?

A quick Google search says, “data integration is the process of combining data from different sources into a single, unified view”. Sounds so simple right? But hey — since you’re reading this, you already know that such a narrow definition of data integration, to say the least, is reckless.

Before I jump into describing what data integration really means, let me assure you that this guide is not meant to convince you that a particular solution or technology is better than the others. Instead, the goal of this guide is to provide you with a comprehensive, unbiased, 360-degree overview of the data integration landscape.

To that end, I will cover all the technologies that comprise the data integration landscape — iPaaS, CDP, ETL, ELT, and finally, reverse ETL.

Even if you are a seasoned data person, I hope that this guide becomes a ready reckoner for you every time you’re tasked with choosing the appropriate method to move data.

OK, so what the heck is data integration?

To put it simply, data integration is the process of moving data between databases — internal, external, or both. Here, databases include production DBs, data warehouses (DWs) as well as third-party tools and systems that generate and store data.

It’s good to keep in mind that all integration tools use the same underlying technology — APIs. If you’d like to learn more about APIs, here’s an in-depth guide, a video, and a course.

So many moves..

  • iPaaS or Integration Platform as a Service: data moves between cloud apps directly with little to no transformation taking place in the iPaaS
  • CDP or Customer Data Platform: data moves between cloud apps via a central hub which enables moderate transformation capabilities
  • ETL or Extract, Transform and Load: data moves from cloud apps to a data warehouse via a robust transformation layer built into the ETL tool
  • ELT or Extract, Load, and Transform: data moves from cloud apps to a data warehouse directly post which transformation and data modelling take place in the warehouse via SQL. The main difference here is that with ETL, transformation takes place before data is loaded into the warehouse, whereas with ELT, transformation takes place afterwards
  • Reverse ETL: data moves from a data warehouse to cloud apps. Typically, the core transformation takes place in the warehouse prior to the reverse ETL process, but the reverse ETL tool may have a minimal transformation layer to fit data to an external system’s schema

Let us now look at each of these technologies in more detail — their pros and cons, the audience each one caters to, and the key players operating in the market today. Once again, do keep in mind that the commentary is wholly based on the technology in question and not the companies or products operating under it.

iPaaS or Integration Platform as a Service

iPaaS was allegedly coined in 2008 by Boomi, an enterprise iPaaS vendor. Since then, iPaaS has seen wide adoption and has resulted in a proliferation of companies offering iPaaS solutions in various shapes and packages.

Amongst the more popular ones today are Tray and Workato focused on enterprises and Zapier, Integromat, and Automate.io catering to SMBs.

Needless to say, the range of integrations and capabilities differ from vendor to vendor but fundamentally, they all do the same thing — perform actions based on a trigger. A trigger is essentially an event taking place in system A that is transmitted to the integration platform (via an API call or a Webhook) which then performs one or more predefined actions.

The simplest example being every time you receive an email (an event or trigger), you also receive a Slack message with the contents of that email (action one) and the email is then marked as read (action two). Here, your email client delivers this message to the integration platform as soon as a new email lands in your inbox and then the iPaaS does what it does.

An interesting thing to keep in mind about iPaaS solutions is that they can also check for changes in system A on a schedule (like every hour) and if a change is detected, the specified actions are performed.

Taking the same example as above, the integration platform can knock on your inbox every hour and if there is a new email, it can perform the actions, and if not, then it does nothing. Here too, the trigger is a new email being found in your inbox but this time, the iPaaS had to fetch this information from the email client instead of having it delivered.

How data moves through an iPaaS

iPaaS solutions can also be used to move data between internal systems where events take place. However, it’s rather uncommon since integration platforms generally charge based on the number of operations or tasks they perform and it can get pretty expensive pretty quickly.

One of the biggest drivers of the adoption of iPaaS is the fact that they offer a visual interface to build integrations, enabling business teams to take control of their workflow automation needs. In essence, iPaaS brings the simplest of coding concepts into a UI that is easy to maneuver with little to no technical knowledge, which is truly amazing!

Additionally, one of the areas where iPaaS solutions shine is the depth of integrations they offer with third-party SaaS tools. As long as an API endpoint is exposed by a SaaS vendor, an iPaaS can pull data from it or push data to it, enabling some really complex integrations that would otherwise require writing and maintaining a lot of code.

That said, like any other technology, iPaaS has a set of limitations and is not the ideal solution for all types of integration needs. Event-based integrations particularly fall short when there is a need to move data that does not rely on, well, events. Lastly, since iPaaS solutions are so flexible and free-form, they push a lot of complexity onto the user.

CDP or Customer Data Platform

Customer data platforms collect and collate customer data from different sources and send that data to different destinations. It’s important to note here that besides moving data, CDPs also enable data collection via proprietary SDKs and APIs. However, I intend to keep this guide focused on data integration and thus will leave data collection for another day and guide.

Like iPaaS solutions, CDPs have also gained steam in the last few years, resulting in horizontal CDP vendors like Segment, mParticle, Lytics, and Tealium as well as vertical CDPs like Amperity (retail and hospitality) and Zaius (ecommerce) focused on solving the needs of specific industries.

How data moves through a Customer Data Platform

In terms of moving data, Customer Data Platforms rely on predefined data models and offer limited or shallow integrations with third-party vendors, making them less than ideal for a lot of data integration use cases.

What’s important to keep in mind about Customer Data Platforms is that by definition, a CDP does a lot more than move data between tools. It enables marketing and growth teams to build segments based on user behaviour and user traits, and sync these segments to third-party tools to deliver personalized experiences — all without relying on engineering or data teams. If you’re curious to learn more about CDPs, check out this guide from Data-led Academy.

On the other hand, if you’re already using a Data Warehouse such as Redshift, Snowflake, or BigQuery, you might not even need a CDP — read this in-depth piece that argues why your Customer Data Platform should be the Data Warehouse and not an off-the-shelf product.

ETL or Extract, Transform and Load

ETL is a traditional data integration process whose origins, as per Wikipedia, date back to the 1970s. However, it was only in the early nineties that Informatica made ETL commonplace in the enterprise. Then Talend launched in 2005 and is a leader in the ETL space today.

Under the ETL paradigm, data is first extracted from first-party databases and third-party sources (primarily SaaS tools for sales, marketing, and support), transformed to meet the needs of analysts and data scientists, and finally loaded into a Data Warehouse.

How data moves through an ETL/ELT solution

The transformation is particularly resource-incentive and time-consuming which significantly impacts the time it takes between the extraction and the loading of data.

However, due to advancements in warehousing technologies and ecosystem, ETL is being replaced by ELT which is faster and more flexible — read on to learn more.

ELT or Extract, Load, and Transform

ELT is the modern approach to ETL which is largely being fueled since cloud data warehouses such as Redshift, Snowflake, and BigQuery have become extremely fast and reliable, enabling transformation to take place inside the warehouse itself. Add to that the flexibility and cost benefits of an architecture that separates compute from storage and data warehousing becomes a no-brainer.

Fivetran, Stitch, and Matillion are companies leading the new ELT paradigm. Talend, which is a leading ETL provider, acquired Stitch in 2018 to embrace ELT. Hence, it is fair to say that ELT is the new ETL but both terms are at play today.

Under the ELT paradigm, data is extracted from source systems and loaded into the warehouse without any transformation taking place. In fact, modern ELT tools don’t even offer in-built transformation capabilities but integrate well with services like dbt that are purpose-built solutions to handle the transformation layer within the data warehouse, doing away with the need to execute transformations elsewhere.

ELT is fast, affordable and most importantly, requires no coding, all of which is fueling the shift from ETL to ELT.

Reverse ETL

The benefits of warehousing data from various sources using ELT tools are plenty. However, the ability to transform data in the warehouse using tools like dbt makes the data warehouse the true source of truth for all types of data, particularly customer data which is typically scattered across internal and external systems. As a result, a new breed of tools has emerged and for the lack of a better term, Reverse ETL is born.

A reverse ETL tool like Hightouch or Census takes care of the following:

  • Extract data from a warehouse or a database on a regular cadence and load it into sales, marketing, and analytics tools
  • Trigger a webhook or make an arbitrary API call every time the data changes
  • Move extracted rows of data to a production database to deliver personalized experiences
How data moves through a Reverse ETL solution

So how does this new approach benefit data teams, you ask?

Well, since data teams are already maintaining the data warehouse as the source of clean and consistent customer data for analytics purposes, moving this data to cloud apps from the same source of truth is a no-brainer — data engineers can finally maintain a single data pipeline for teams to analyze as well as act upon data.

But does this only benefit data teams? Hell no!

Sales, marketing, and analytics teams are positively impacted by this new data workflow as they are finally able to analyze and act upon the same, consistent, reliable data. This builds trust in the data and harmony amongst the teams.

Conclusion

I hope this article helps you get a handle on the overall data integration landscape and enables you to understand the pros and cons of the different technologies. It should be obvious that there are multiple ways to fulfill certain integration requirements and what is best for you truly depends on your use cases and the resources available to you.

I’d love to hear from you if you have any feedback or would like to learn more about data integration technologies, you can do so on databeats.

--

--

Building databeats to help organizations turn good data into growth, and in the process, beat the gap between data people and non-data people for good.