5 Helpful Extract & Load Practices for High-Quality Raw Data

Immutable raw areas, no transformations, no flattening, and no dedups before finishing your excavations

Sven Balnojan

Published in

Towards Data Science

8 min readApr 4, 2023

Excavator - photo by Dmitriy Zub on Unsplash.

This post is an updated version of the original version of the Meltano blog.

ELT is becoming the default choice for data architectures, yet many best practices focus primarily on “T”: the transformations.

But data quality is determined for transformation and beyond in the extract and load phase. As the saying goes, “Garbage in, garbage out.”

Robust EL pipelines provide the foundation for delivering accurate, timely, and error-free data.

Luckily we have a community full of data experts that have worked with Meltano, Stitch, Airbyte, Fivetran, and all the big extract and load tools on the market. So we asked them to provide their most crucial extract and load practices!

We distilled 5 data practices used & loved by the community that will drive up quality for all your data sets, no matter what tool you use.

But wait, why aren’t these “best practices”? Because we consider them to pick and choose. If you’re tackling a new project or don’t yet have many practices inside your extract and load processes, you implement all of them. If you already have some, complement them with the ones that make sense.

Setting the Stage: We need E&L practices because “copying raw data” is more complex than it sounds.

The idea of ELT as a pattern sounds easy, “just copy over the source data first and then run your transformations over the raw data inside your own space.” However, " copy " and “raw data” are two words with hidden obstacles.

“Copying” sounds easy. But source data changes, and unless you know what changed, “copying” is more complicated than you think. Imagine a production table of 150 million “orders” that does come with a “timestamp” but not with “modified” data. And yes, these exist all over the place. So, how do you know that orders got modified, and if so, which ones? For instance, how would you know which orders got “canceled,” an operation that usually takes place in the same data record and “modifies” it in place?

“Raw data” sounds clear. And yet the idea of extracting & loading implicitly means that usually, you copy between two different technical systems A and B, where you need to adjust the data to match system B. You ingest from REST APIs and put it into Snowflake or ingest from an Oracle database into Redshift. Every time you change systems, you will need to modify the “raw data” to adhere to the new system's rules. You need to do typecasting; you need to think about whether you want to “flatten JSON” or whether you want to add additional metadata to your data.

Simply “copying raw data” will raise new questions every time you add a new data source or target to your list. Even if it is just a new table from the same production database, you’ve always been copying from.

These practices will guide you when you use a new source for ingesting data into your data system.

1. Make each EL run uniquely identifiable — timestamp everything

We start with arguably the most important best practice: Make every bit of data you load into your data system identifiable and traceable back to the process that got it there.

Typical ways of doing this are to include metadata values that capture:

Ingestion time: the timestamp indicating when the load process started.
Ingestion process: a unique identifier representing the load process and its instance.
Source system: Metadata about where the data was extracted from.

Add any or all of these metadata to each row/entry of the data you ingest. We recommend you use the starting time of your ingestor as the “ingestion time” as it simplifies the process. The “identifier of your ingestion instance” should be clear. Don’t just provide the “Airflow-OracleDB-Ingester” as a process, but the “Airflow-OracleDB-Ingester-ID1244” where ID1244 identifies the specific run of ingestion.

One benefit of having the source system as metadata is that you can quickly debug problems in downstream dashboards and identify their source. This is also useful metadata for other use cases.

Suppose you have a legacy and a new customer registration component. In that case, you can provide the source as a filter option inside dashboards, allowing users to filter for customers from just one system.

2. Deduplicate data at a level beyond the raw level

There are usually three cases of duplicate data hitting your data systems you will want to “deduplicate.” But no matter the case, don’t do it at the raw/landing level!

The first case is intentional duplicate data, where a source system contains something your end-users or you consider duplicated. For instance, your CRM system might have two entries for a certain customer that canceled and signed up again. If you deduplicate at the raw level, this means either merging the two or deleting one. Both of which will delete data that is present in the source system.

The second case is unintentional duplicate data, where the source system either deletes a record you still have in your data warehouse or the source system unintentionally produces duplicate data it will likely delete in the future. Even though this is an “error,” I don’t recommend deleting this data in your raw ingestion area but rather filter it further down the line, for instance, in the next stage of your modeling. Otherwise, you add logic to your ingestion that is hard to follow up on later.

The third case is duplication happening due to technical restrictions. It might be the case that your ingestion tooling prefers an “at least once delivery” strategy, or it might even be a bug in an ingestion process. With “at least once delivery” incremental load strategies, you’re ensuring to get all data rows but might duplicate some. Again we recommend keeping the duplicate data at the raw level and filtering it down at a later level.

No matter the case, don’t deduplicate on load. Load everything, and keep it there as it is. Do your deduplication later at the next stage.

3. Don’t flatten during EL, do it one stage later

Many source systems you ingest will return arrays, JSONs, or other nested objects with some hierarchy you want to break down for further processing. But not at the ingestion level. Take the raw data and dump it as it is in your data system, then have a process do your “flattening.”

A very typical example is JSON objects. Your source might contain a large JSON object, and you would like to have it processed into individual columns inside your Snowflake database. This practice suggests first having a raw table with just your metadata columns and one “JSON_blob” column containing the JSON object. In a second step, you can then process this data into columns.

The reason for this is that flattening involves business logic. It involves you knowing what properties are “always there.” If you flatten on ingestion, your ingestion process might break because one JSON object is NULL, or one JSON object doesn’t come with one expected value. It is always easier to take the already ingested data and rerun your flattener than to run your ingestion + flattening process together.

Additional tip: The same practice leads to avoiding type casting (if possible) on ingestion. We recommend doing typecasting after ingestion.

4. Have an immutable raw level

At some point, you will switch to incremental updates on your data. It’s time to remember to create an “immutable raw level,” an area where you never, I repeat, modify or delete data.

“immutable raw level”: an area where you never modify or delete data.

Not deduplicating is one part of it (see rule 2), but there is more: In your immutable raw level, you do not remove records removed upstream or modify data modified upstream. You’re simply loading new data in; that is it.

A great example I still painfully remember is a north star metric dashboard. It showed the current development of the north star metric I worked at based on customer behavior over the last couple of months. The dashboard and the numbers looked great, moving upwards. Product & management decisions were made on this basis. New records of the north star metric were broadcasted each week.

Then suddenly, one day, the dashboard looked different. Our north star metric shaved 10% of its value and 30% in a specific segment.

One huge customer left, and the records were completely wiped.

Since we were modifying our raw data, we completely messed up the north star metric irreversibly.

From that day on, we used snapshots of all the raw data that might be subject to change.

Do yourself a favor, and have your raw level immutable.

Note: immutable staging areas are effectively what you create with full table syncs if you do not delete data. Also, for GDPR and privacy concerns, you should make an exception here.

5. Do not transform data on ingestion, not even slightly, unless you have to

There are good reasons to transform data on the fly on ingestion, but almost every case you will think of also works without. Legal and security are two good reasons for transforming data on the fly. For all other reasons, you should try first to ingest data and then run a small transformation on the ingested data.

If you opt to do an “on the fly” transformation while ingesting data, make sure you make it fail-proof. Try only to add data or subtract data, not modify it.

If you want to do transformations by default, it is always better to first ingest it. Then you can, for instance, create a mapping table and do the join there.

You can do so by using mechanisms like “dbt seeds” or ingesting Google Sheets maintained by an external contributor.

Summary

We all know about “Garbage in, garbage out,” yet we often fail to recognize it in the extract and load world.

These practices focus on reducing garbage at the very beginning of our processes. They will help you to fix problems faster and increase your data quality in the long run.

Look at the graphic below if you want the short version of all the practices.