Making Sense of Big Data

How to Build Data Engineering Pipelines at Scale

You are at a beach and see lots of shells around you. Let’s use this to learn ETL / ELT, Spark and Airflow.

Published in

Towards Data Science

9 min readAug 20, 2021

Imagine you are at a beach and you are hanging out and seeing all the waves come and go and all the shells on the beach. And you get an idea. How about you collect these shells and make necklaces to sell?

You want to make necklaces from shells at a beach. Let’s learn data engineering concepts from this example. [Source]

Well how would you go about doing this? Maybe you’d collect a few shells and make a small necklace and try to show to your friend. This is an example of ETL process — where you first extract shells that look nice / raw data, then you transform i.e. build a necklace and Load where you show to your friend. If she liked it, you could come back and collect a lot more shells and make a lot more necklaces.

We will use this example to learn all about building data engineering pipelines. I found using a relatable example much easier (and more fun!) to understand complex concepts.

1. Why data engineering?

To create pristine data sources and gain insights

Internet companies create a lot of data. Datasets are getting larger and messier in the internet era where there are large datasets for all the actions and interactions users make with your website all the way to product descriptions, images, time series info, comments etc.

It is said that data is the new oil. For natural oil, we also need a way to drill and mine this oil for it to be useful. In the same way, we need a way to mine and make sense of all this data to be useful.

On one hand, there is a desire by executives and management to get insights from these datasets.

There is also a desire by data scientists and ML practitioners to have clean datasets to model with.

There are some really interesting trade-offs to make when you do this. And knowing about these can help you in your own journey as a data scientist or ML person working with large data. Irrespective of where you are in your data journey, I think you will find these interesting.

2. ETL vs ELT

Make necklaces on beach itself vs transport shells to a warehouse first with trucks

Most companies store disparate data sources (JSONs, CSVs, Parquets etc) in a data lake. Data lake can be built in Amazon S3 or Google Cloud Storage. It is highly available, durable and fully managed by cloud provider and is really cheap — which is ideal for datasets that you may or may not use in future. The data in your data lake is like shells on the beach. We need to transform and load this into clean and pristine structure to be used for executives and data scientists. The place to keep clean, pristine and canonical data can be a data warehouse. Data warehouses also support analytical queries (COUNT, SUM, GROUPBY) with really low latencies as they store data in a columnar format — think collocating all of transaction amounts in a same place in memory.

We want to transform data in our data lake into a pristine and canonical truth in a data warehouse [Source]

Coming back to shells example. You sent your friend a necklace made from shells and she loved it. She wants you to make 1000 necklaces now to be sold. How can we do this?

You could continue doing the ETL process where you make say 1000 necklaces on the beach itself. And just transport the final products. The con here would be that if a necklace had an issue with a shell or a shell was broken you’d have to just discard the whole necklace and go back to the beach and start from scratch. In data land — it’d mean that if you transformed the data first and then loaded into a data warehouse. Somebody could find an issue or a bug with the code — and you’d then have to find your raw sources again and start from scratch. So its bad right? Whats the alternate?

Well it is ELT. You take all the shells that you think might be interesting, load in trucks first and then later you make necklaces. If there was a problem in shell or something — well you have entire trucks of shells there for you. Sounds awesome right? The con here is that you need a lot of storage, you need these giant trucks to carry shells to your warehouse where you have a team processing these shells.

3. When to use ETL vs ELT

For sensitive data use ETL. When storage cost is not a problem use ELT.

ETL is basically transforming the shells and making necklaces and then loading these into trucks to a warehouse. ELT is loading shells which are interesting and then transform them later inside a warehouse. When would we use one over another? Well, if storage is an issue ETL is better. Nowadays though warehouses have very cheap storage and storage costs are no longer an issue usually. A more valid reason to use ETL over ELT is for sensitive data. In case of shells imagine if some shells are poisonous, we don’t want them anywhere near our warehouse, we would want to clean them and process them first and possibly make a necklace first only. Same with sensitive data — we wouldn’t want to load PII data into a warehouse for transforming later.

A place where ELT is better is if you think your transformations will have frequent changes. Transformations could be anywhere from filling and imputing missing data to calculating statistics or aggregations or some other info. If you think there will be changes to this, better to ELT so that you can compare both pre-transformed and post-transformed data really well. A common place where ELT is really great is for web analytics.
You want to load all of the customer interactions and transactions into a warehouse first and then transform them again and finally reload them back for business intelligence. dbt is a great tool which lets you do ELT — you can version control SQL code of the transform step.

4. Big data processing with Spark

To enable handling massive datasets in distributed manner

So far in your necklace business, we have assumed it is just you and a few trucks and a small warehouse, what if people really love your shells and you have to process thousands and millions of them? What would you need? You’d need to rent or buy several processing machines which take different shells and all work in parallel to build necklaces. You wouldn’t want your customers to wait for a long time to receive the necklace. This is what a big data framework like Spark does.

Spark is a framework for handling massive datasets. It lets you process big data in a distributed manner. It spins up several workers which handle and process chunks of your large dataset, all orchestrated by a driver node. The word “big data” got popularized in late 2000s/ early 2010s and frameworks like Hadoop got popular. You might be wondering how Hadoop is different from Spark. Basically Spark is the faster version of Hadoop , they both use MapReduce techniques underneath but Spark keeps chunks of data in memory while Hadoop keeps data as physical files in disk storage. This makes Spark much faster.

Spark lets you deal with massive scales of data [Source]

Spark lets you do ETL or ELT at scale for billions of records and Spark can also read from places like S3 and write to S3 or data warehouses. You can do a hybrid where one stage extracts and loads to S3 and then another stage transforms S3 data, imputes, adds new info and then loads to a warehouse -> this is combination of ETL and ELT.

Spark also has support for parallel machine learning algorithms using MLLib which divides up the data across machines and periodically updates the parameters during gradient descent. This is called allreduce technique. MLLib doesn’t have support for all ML algorithms but is slowly catching up.

You can either self-run Spark or use a fully managed service like Amazon EMR or Databricks. Read more in my other Medium post. Learning Spark book by Jules Damji is really awesome to learn about Spark.

5. Data Orchestration with Airflow

You need a systematic way to orchestrate the sequence of tasks.

Finally when your parallel processing of shells becomes really a large operation — you may need an orchestrator — a person who tells that trucks should start moving or that processing is delayed and so on. This orchestrator can either be a person or a bot which tracks individual processes. In data land, this orchestrator can be custom cron jobs or something like Airflow. Airflow lets you schedule everything in a DAG or a directed acyclic graph and the next step is dependent on a previous one and you can control when these run using cron statements.

We need a way to orchestrate and schedule different tasks [Source]

Airflow is something that needs a place to host it to run the webserver. Amazon has created Managed Workflows for Apache Airflow (MWAA) service. While it is a great service if you are in Amazon ecosystem and just needs you to upload DAGs to S3 — their support for latest Airflow releases is slow. If latest release support is important to you — try Astronomer.io — they are great.

You might also want to do monitoring of such a large system, Airflow would let you orchestrate and see which stages fail. But you might want to use Spark’s Ganglia UI to see metrics of different workers- CPU usage, free memory, network bandwidth etc. You can also use Datadog to monitor these.

6. Choice of storage formats and data warehouses

While CSV and JSON is great for general data science as they are human readable, for large datasets it is better to use file formats like Parquet, Avro or Delta. Parquet is columnar storage type while Avro is row based format. Parquet is common with big datasets as it lets you query specific columns quickly. Another option is to use Delta format which is on top of Parquet but adds history of events, transaction logs and other features. You could use any of these with Spark and store the data in S3.

You can periodically load this data into a data warehouse. There are various options for data warehouses. Redshift is AWS offering for data warehouse. BigQuery is GCP offering and Snowflake is another option. I have compared Redshift and Snowflake extensively and have been really impressed with how fast and smooth Snowflake is. It also supports pay as you use format where you are charged for only the queries. Redshift on the other hand charges you no matter what as long as instances are running.

Yet another option for you is the recent lakehouse architectures which combine data lake and warehouses into one place. Delta Lake is one such offering from Databricks where it underneath works with Delta storage formats and provides a way to query data just like a warehouse.

Summary

An ETL pipeline can be built where raw data is processed from a data lake (S3) and transformed in Spark, and then loaded into a data warehouse like Snowflake or Redshift which then powers a BI tool like Tableau or Looker or PowerBI. This meets the use case of managers looking to make data driven decisions.

You can also collect data from disparate sources inside a warehouse first, and then use Spark to transform these large datasets, load them into S3 in Parquet files and then read them from SageMaker in case you prefer to use SageMaker over Spark’s MLLib. This meets the use case of data scientists / ML engineers.

What have we learnt?
1. We started with just a few shells and a necklace to understand trade-offs of ETL vs ELT. Make necklaces on beach itself vs transport shells to a warehouse first with trucks.
2. We scaled this operation up and understood use case of Spark.
3. We hooked everything up using an orchestrator like Airflow.

This concludes the blog post on understanding data engineering pipelines. Thanks for reading so far, for any questions or comments, reach out to me on LinkedIn!

P.S — Check out my new substack!
I have started a new Substack where you can read more about my musings on ML, MLOps and LLMs. Follow me here to get articles right in your inbox.