The world’s leading publication for data science, AI, and ML professionals.

Transitioning from ETL to ELT

How cloud computing and analytics engineering forced the transition from ETL to ELT

ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) are two terms commonly used in the realm of Data Engineering and more specifically in the context of data ingestion and transformation.

While these terms are often used interchangeably, they refer to slightly different concepts and have different implications for the design of a data pipeline.

In this post, we will clarify the definitions of ETL and ELT processes, outline the differences between the two, and discuss the advantages and disadvantages both have to offer to engineers and data teams in general.

And most importantly, I am going to describe how the recent changes in modern data teams’ formation has impacted the landscape around ETL vs ELT battle.


Understanding Extract, Load and Transform independently

The main stake when it comes to comparing ETL and ELT is obviously the sequence the Extract, Load and Transform steps are executed within a data pipeline.

For now, let’s ignore this execution sequence and let’s focus on the actual terminology and discuss about what each individual step is supposed to do.

Extract: This step refers to the process of pulling data from a persistent source. This data source could be a database, an API endpoint, a file or really anything that contains any form of data, including both structured or unstructured.

Transform: In this step, the pipeline is expected to perform some changes in the structure or format of the data in order to achieve a certain goal. A transformation could be an attribute selection, a modification of records (e.g. transform 'United Kingdom' into 'UK'), a data validation, a join to another source or really anything that changes the format of the input raw data.

Load: The load step refers to the process of copying the data (either the raw or the transformed version) into the target system. Usually, the target system is a Data Warehouse (i.e. an OLAP system used for analytics purposes) or an application database (i.e. an OLTP system).

Unavoidably, the sequence we execute these three steps matters. And with the increasing volumes of data that need to be handled, the order of execution matters a lot. Let’s discuss why!


Extract Transform Load (ETL)

ETL stands for Extract-Transform-Load and the term itself refers to a process where the data extraction step is followed by the transformation step and ends with the load step.

Transform > Load – Source: Author” />

The data transformation step in ETL processes occurs in a staging environment outside of the target system, where the data is transformed just before it gets loaded to the destination.


Extract Load Transform (ELT)

On the other hand, ELT, that stands for Extract-Load-Transform, refers to a process where the extraction step is followed by the load step and the final data transformation step happens at the very end.

Load > Transform – Source: Author” />

In contrast to ETL, in ELT no staging environment/server is required since data transformation is performed within the destination system, which is usually a Data Warehouse or Data Lake hosted on the Cloud.


How to choose between ETL and ELT

ETL and ELT come with their pros and cons, and the chances are you will come across both in your day-to-day work given that they are typically used for different use-cases.

ETL is best suited in use cases where data resides on-premises and needs to be structured before loading them into the target database or warehouse. Therefore, the ETL process is usually preferred when smaller amounts of data are involved in the process and/or when complex transformations are about to be performed.

Furthermore, since ETL transforms the data before the load step, sensitive data can be masked, encrypted or completely removed before it’s loaded. This aspect of ETL can help companies and teams enforce and implement compliance to various regulations (e.g. GDPR) more easily.

Since the transformation happens in an intermediate (staging) server, there’s an overhead for moving the transformed data into the target system. In addition, the target system won’t contain the raw data (i.e. the data in the form prior to the transformation). This means that whenever additional transformations are required, we would have to pull the raw data once again.

On the other hand, ELT provides more flexibility over ETL given that the latter was historically intended for structured (relational) data. Modern cloud architectures and warehouses enabled ELT for both structured and unstructured data.

As mentioned earlier, ETL should be used for small volumes of data. ELT offers faster transformation since it is not dependent on the data size and is usually performed on an as-needed basis.

In addition, when the data is loaded before the transformation as part of an ELT process, it means that users and systems still have access to the raw data. This means that if more transformations are required at a later stage, we already have the raw data within the Data Warehouse that can be accessed any time. The only downside would be the additional storage required to store this raw data, but given the constantly decreasing storage cost I don’t think this is a major issue, anyway.

Now that we all have a good understanding regarding the technical implications of both ETL and ELT processes, let me question something. When it comes to choosing one over the other, is it just a matter of technical implementation?

It’s not just about when to perform a Transformation

In addition, the data realm has been constantly evolving and moving forward and the data roles are no exception to this fast-paced environment. ETL vs ELT is not just about where the transformation step occurs – it’s (also) about who is supposed to perform them.

A transformation step, usually involves some sort of business logic. Traditional ETL processes were used to be executed by Data Warehouse Engineers (not sure if this is still a thing to be honest) which means these people were in charge of crafting business logic, too.

On the other hand, ELT processes evolved due to the nature and landscape of modern data teams and their formation. The EL (Extract-Load) steps are usually performed by Data Engineers whereas the Transformation step is executed by the – so called – Analytics Engineers.

And it makes so much sense to me. A Data Engineer is a purely technical guy who is concerned about efficiency, scalability, readiness and availability (among another million of things). On the other hand, an Analytics Engineer is still a technical person with much better business understanding. And therefore it makes more sense for the Analytics Engineer to be responsible for transforming the data, given that (usually) transformation corresponds to business value.

Modern cloud architectures, data stacks (including cloud-based OLAP systems) and team formations have made ELT processes more relevant and effective. From my personal experience, I would say there’s an undergoing shift from ETL to ELT despite the fact that ETL is still relevant and useful.


Modern data stacks and teams favour ELT processes

While ETL is not dead, in my opinion, modern data stacks and technologies favour ELT processes. As an example, let’s consider dbt (data build tool), which is one of the hottest additions in the data realm and it has become the de-facto transformational tool for analysts and engineers.

Usually, we would want to bring into the Data Warehouse the raw data (i.e. without applying any sort of transformation) from external or internal data sources. Then on top of these data models (in dbt we usually call them staging models) we build additional models (intermediate and mart) which are the result of some transformational processes that occur within the Data Warehouse.

In such workflows, it therefore makes more sense to load the data into the Warehouse before transforming it. This also enables the access to the raw data at any given time so that future use-cases can be supported.

If you are interested in gaining a deeper understanding around how dbt works and how the different components come together in order to transform raw data and build meaningful data models to support decision-making, I would recommend the following article.

Final Thoughts

Designing data pipelines is a challenging task and when doing so, many factors need to be carefully taken into account. When it comes ingesting data from data sources into a Data Warehouse, there are typically two approaches you can take.

In this article we discussed how ETL and ELT perform a sequence of steps in order to transform and load (or load and transform) data into a destination system.

Depending on the landscape of the organisation and the specific use-case you may want to choose one over the other. I hope this tutorial has provided all the details you need in order to choose the best and most effective approach when it comes to ingesting and transforming data


Related Articles