Building a data pipeline from scratch on AWS

Published in

Towards Data Science

7 min readSep 5, 2019

When starting to dive into the data world you will see that there are a lot of approaches you can go for and a lot of tools you can use. It may make you feel a little overwhelmed at first.

On this post, I will try to help you to understand how to pick the appropriate tools and how to build a fully working data pipeline on the cloud using the AWS stack based on a pipeline I recently built.

The pipeline discussed here will provide support for all data stages, from the data collection to the data analysis. The intention here is to provide you enough information, by going through the whole process I passed through in order to build my first data pipeline, so that on the end of this post you will be able to build your own architecture and to discuss your choices.

Which tools should I use?

Let’s solve the first question that might come to your mind: what’s the right tools for building that pipeline? And the answer I found while building mine was:

There is not a right tool or architecture, it will always depend on your needs!

If you need to process stream data maybe Kinesis is a good thing for you, but if you have some budget limitations and you do not mind about taking care of the infrastructure you can go for Kafka. If you have to process historical data, you won’t need that stuff but Glue on the other hand can be a great friend. In other words, your needs will be the judge on what is best for you. The important thing here is to understand your challenge and know your limitations in order to choose the right ones.

The challenge

By the time I got into the company, there was a big problem: the data was too isolated. Analyzing data was too slow and difficult that people could not find the motivation to do it. And the challenge was: centralize that data and promote data democratization on the company in order to empower people! A big challenge, right?

What was the scenery?

The data sources we had at the time were diverse. There were some data that we had to collect from Facebook Ads API, Ad Words API, Google Analytics, Google Sheets and from an internal system of the company. In order to collect data from those sources, I built a Node JS application since Node JS has the power of running asynchronously and it speed up things when it comes to collecting data in that scenario.

The pipeline

The proposed pipeline architecture to fulfill those needs is presented on the image bellow, with a little bit of improvements that we will be discussing.

Data Ingestion

The first step of the pipeline is data ingestion. This stage will be responsible for running the extractors that will collect data from the different sources and load them into the data lake.

In order to run those Node JS scripts that do exactly this, we were using an EC2 instance on AWS, but a great improvement I recommend you to make is to use Lambda to run those scripts. Lambda is a great serverless solution provided by AWS. By using Lambda, you will not need to worry about maintaining a server nor need to pay for a 24 hour server that you will use only for a few hours.

Data Storage

But where should I load that data? S3 is a great storage service provided by AWS. It is both highly available and cost efficient and can be a perfect solution to build your data lake on. Once the scripts extracted the data from the different data sources, the data was loaded into S3.

It is important to think about how you want to organize your data lake. For this pipeline, once we would not have a team of scientists and analysts working on that data and once our data came from the sources pretty organized, I created only a raw partition on S3 where I stored data in its true form (the way they came from the source) with just a few adjustments that were made in the Node JS script.

However, if you would like to have data scientists and analysts working on that data, I advise you to create other partitions in order to store data in a form that suits each one of those users. You can create three directories here, like that:

Raw: here you will store data in its true form, the way it came from the source without modifications.
Transformed: after transforming data, treating possible problems such as standardization, missing values and those kind of problems, data will be loaded here. That data will be useful for data scientists.
Enriched: for analysis you will have to enrich data. You may want to create a One Big Table (OBT) that suits your business rules so that you can have all information an analyst will need in one place. That’s the data that will be stored on this layer.

Now you may ask: and how will I transfer data from one stage to another? And the answer is: it depends! If you have a small volume of data that will not exceed 3008M of memory and 15 minutes of execution time (those were the limits when I wrote that post, check now if it still applies) a good solution could be Lambda. You can create transforms and enrichment functions, so that you can process data from one stage and load it into another. However, if your data exceeds this limit, you may go for Glue. Glue is a very useful tool for that. On that pipeline, I used Glue to perform the transformations on the data, but since I did not implemented the transformed and enriched stages, I used it to load data directly to the data warehouse. But if your needs are of having those three (or more) stages, Glue can also be a nice solution for it. However, if you need to handle a really large volume of data, it can be a better solution to use an EMR cluster. It will depend on the volume of data you are processing, the velocity you have to process it and on how much you can spend.

Data warehouse

Now that your data is already on your data lake, transformed and enriched, it is time to send it to a data warehouse! I have been using Redshift for a while now and I have been having a great experience with it. It is a very performing and reliable solution with a fair price. Redshift also provides a very great resource, called Redshift Spectrum, that makes it possible to query data directly from your data lake on S3.

For my solution, since the volume of data was not a problem, I stored all data on Redshift and gained on performance. However, if you have a large volume of data it can become expensive to maintain all historical data in Redshift, so it is good for you to store only the most recent data on Redshift and let the historical data on S3. In addition to that, a good thing to keep in mind is to store that historical data on S3 in a columnar format such as Parquet because it will decrease a lot the cost of querying it using Redshift Spectrum.

Data Visualization

What is data worth for if people cannot access it? As the last step, you will need to integrate a visualization tool to your pipeline. The tool I chose to use for that was Metabase.

Metabase is a great open source visualization tool. It offers an intuitive and user-friendly interface so that users with no knowledge of queries, SQL and those stuffs will be able to explore data and create graphs and dashboards to visualize their results. Metabase also allows users to define notifications via email and slack, to receive scheduled emails informing about defined metrics or analysis, to create collections where you can group data by company’s divisions, to create panels to present your analysis to restrain access to user groups and so on.

Wrapping Up

On this post we discussed about how to implement a data pipeline using AWS solutions. I shared with you some of the things I used to build my first data pipeline and some of the things I learned from it. Of course there are a lot more things you can use to improve it such as logs and so on but this is already a big step to start. I hope by now you have a very good idea of how to get started building your own pipeline!

Thanks for reading and if you have any questions or suggestions just let me know and I will be happy to discuss it with you :)