ETL vs. ELT and all that is wrong with the ‘E’

Understanding the implication of performing transformations during data ingestion and of extracting the data that are not yet ready or meant for external consumption

Prajwalan Karanjit
Towards Data Science

--

Image via Pixabay under Pixabay License

ETL, or Extract Transform Load, has become synonymous with the process of data integration, particularly in data warehouses. ETL has been quite successful, with so many tools focusing on just that. But product/tooling companies sell their products and that is not necessarily always a better architectural choice. This article addresses some of the issues with concepts in ETL — The transformation T in the middle and The extraction E.

The Almighty T

The industry has realized that performing transformations on the journey can lead to issues. See here, here, and there are many other references. At first, it may sound like a nice thing. We prepare the data and load it so that further process of analysis and reporting becomes easier. And sure thing, it delivers that. All good, except when things go bad or a new analytical work needs to be performed.

  1. When things go bad, e.g. there is a big mismatch in the BI reports or analytics output so much that the deviation is an obvious error, then there are 3 places where it could have gone wrong. First, during the analysis, second, during the transformation T, and third, in the source systems themselves. Even worse is when the blame games start between those working with business systems and those in data analytics. If there had been no T, then it would be much easier to locate the issue — it is either during the analytics or in the source.
  2. When the data prepared for one analytical use case is reused as a source in another analytical use case, it may just be a wrong starting point. Even worse is if this T is performed based on some organization-wide canonical data model. Without T in the pipeline, the analytics team would perform the context-specific transformations.
  3. The T performed during ingestion may modify the data that no longer matches the source data. This may lead to misinterpretation of the data because the people in source systems won’t be able to explain it and if there is no proper lineage and documentation of T, the data ingestion guys may not be able to explain it either. This is likely another cause of errors.
  4. Moreover business is always changing, customer requirements change, and therefore the IT systems change and also the underlying data. Incorporating this change requires the data ingestion team to adapt the T they placed during ingestion. This adaptation will result in a ripple effect in all the downstream analytics. Avoiding this will require all the analytical work to adapt to the new T, for new data coming in — which in turn means greatly reduced agility.
  5. The people performing the T will require domain knowledge of the source systems and the target system (e.g. data warehouse). For such reasons, a data ingestion team performing T on the data that they do not own is similar to a dubious integration team (a separate topic of its own).

Can we do it better? Perhaps.

Keep data pipelines simple

ELT, Extract Load Transform, on the other hand, ensures that data loaded to the staging is always the same as that in the source systems. Then it is up to every analytical workload to take that data and apply the context-specific T. And note that the keyword here is “context-specific”. The starting point for every workload is then the original data and therefore each analytical team and interacts with the source system teams for explanation or any other cooperations. The middle-man from the ingestion team is gone during such meetings.

It makes the job of the ingestion team also easier as they are no longer the middle-man, which most likely nobody wants to be. They only create the data pipelines and monitor them and ensure they do just that.

Compare this to a postman delivering letters. The postman here is an analogy for the data ingestion. The postman just delivers the letters “as-is”. If he/she starts opening the letters and fixing the typos and texts and doing the translations then I guess nobody would like that. It is the job of the recipient to do any of those and only as required.

So ELT is better. But then again, there is still a problem and that problem is with the concept of E.

Why not to E?

With E, or Extract, the data pipelines are “extracting” data from the source systems, which in most cases happens directly from the underlying database or data store. Do you see the problem now?

I see several.

  1. Exposing and sharing the application’s underlying data creates serious dependencies. It means that the application team needs to be aware that some data lake/warehouse is taking this data and if they change the data model, it may cause trouble there. It takes away independence. It drags the agility from the application team keeping them from quickly reacting to the new customer requirement.
  2. This dependency leads to “unclarity in the ownership of the application platform and data platform. Should the data models and data be updated because there is a new requirement in the business process or should that be done because there is a new analytics requirement? In both cases, the other party gets unnecessarily affected.
  3. E means that the data on the outside is very likely the same as data on the inside. This means the consumer takes the application’s (and domain’s) internal representations. This means that certain data may not be standardized, i.e. using some particular date format instead of UNIX seconds (or milliseconds) or ISO date formats. This also means the identities might be internal e.g. application’s internal customer id instead of customer id from the customer master data system.
  4. Further, the data on the inside is “mutable” as there is an application running on top of it. This means that data may be updated and the model may also change (especially on schema-less databases e.g. Mongo DB, Cosmos DB, or Dynamo DB, etc.). This creates mismatches and data pipelines fail with an error. Not every change should be or needs to be reflected outside. But by exposing the data on the inside, every change in an application demands an update in the data pipeline and stanging areas and likely further in the analytical workloads.

A better way

Instead of data pipelines pulling and extracting data, they should just read what has been made available. Applications should expose the data that they prepare for “external consumption” and in a way that ensures the implicit expectation (e.g. quality, trustworthiness, interoperable, etc.) of any such exposed data.

The application is obligated to implement product thinking and ensure that the data exposed are of good quality, are standardized based on whatever data standard is applicable, and follow the associated schema definition. Practically the approch could be APIs or events or even an intermediate secondary database containing a copy of data meant for external consumption.

Whatever the approach, the main thing is that the application exposes the data that it makes “ready” for “external” consumption and not that the data pipeline extracting whatever it wants directly from the source database.

The data pipelines perform no more E. They only load the data in the staging for example by calling the API or subscribing to the events.

Coming back to the postman analogy, he/she delivers what has been provided to him/her. That also in a package with a proper stamp. The postman does not go inside people’s houses and drags the letter as you are writing and delivers them. You as a writer can write as many letters, rewrite them, but only delivers the one that must be delivered packaged in an envelope with a stamp. The postman picks just that.

In other words, a postman neither performs E, nor T.

So, then it is just LT, with T being applied later by whoever needs to do that.

--

--