5 Steps to Build Efficient Data Pipelines with Apache Airflow

Uncovering best practices to optimise big data pipelines

Vachan Anand

Published in

Towards Data Science

9 min readJun 13, 2022

Apache Airflow

Airflow is an open-source workflow orchestration tool. Although used extensively to build data pipelines, airflow can be used to manage quite a wide variety of workflows.

Simply put, if we were to build a scalable system to perform a set of tasks in an orderly fashion that requires interacting with different components, we could manage such a workflow with airflow. We use DAGs (directed acyclic graphs) to perform such operations/tasks.

In this blog, we will uncover some practices to build workflows efficiently.

Developing Cost-Efficient Data Pipelines

As a data engineer, one of the major concerns while working on a project is the efficiency of the data pipeline that is required to process terabytes worth of data. Although the solution is usually straightforward, there might be instances where pipeline architecture, infrastructure or the underlying data model hinders the pipeline’s capabilities to run efficiently, thereby rendering the solution obsolete.

Let us look at some strategies we can use to tackle the problem of enormous data without compromising the effectiveness of the pipeline.

1. Data Modelling

Building a data lake or a data platform is not equivalent to dumping the data on a modern stack. Instead, it usually requires the engineers to identify models that support the efficient use of data. The modelling usually includes normalisation and standardisation of data to remove duplicates, enforce data standards and build new entities by grouping data based on business domains. The benefits of an appropriate data model are three folds:

It increases the capabilities of data platforms by reducing the unnecessary load on the infrastructure.
It simplifies the organisation’s data assets management, supporting efficient data analytics and other frontend applications.
The data is usually grouped based on business domains, so it helps with data governance and cataloguing.

Although data modelling is not exclusive to Apache Airflow, it plays a crucial role in building effective data pipelines.

2. Choosing the Optimal Operator

Analogous to many fields, there are several ways to solve any problem in data engineering. Airflow offers similar flexibility by providing a range of operators. Choosing the correct operator can improve pipeline performance and significantly reduce development time.

The selection of the operator would vary from use case to use case.

Choosing the suitable operator usually depends on the following:

Data Source and Destinations
Operator Flexibility
Operator Efficiency
Cost Overhead

a. Data Source and Destinations

While developing a data pipeline, the data usually passes through several stages. In the case of a data lake, the data might have to go through the landing zone and transformed zone before making it into the curated zone.

Therefore, the case may arise where an Airflow operator needs to export data from a SQL database to an S3 bucket in our data lake or read files from an S3 bucket and load them into a database or another s3 bucket based on the architecture of our data platform.

Each operator’s source and destination can narrow the search space. For instance, if the operator export data from a SQL server to an S3 bucket, although it can be done by using a PythonOperator or a BashOperator, SqlToS3Operator would be a good candidate due to its simplicity and performance.

Additionally, we can get even better performance with the above operator if the Airflow cluster is on the AWS cloud and the source, for instance, is an RDS instance as the data doesn’t leave the AWS environment and flows through services with lightning speed on the AWS infrastructure.

b. Operator Flexibility

Although Airflow has the advantage of having hooks and operators supported by several providers, there are cases where it is not viable to use the out-of-box operators from the providers due to the complexity of the task. In such cases, we can use PythonOperator or BashOperator, which gives us enough flexibility to produce the desired results.

Both PythonOperator and BashOperator, have their benefits and pitfalls. On the one hand, PythonOperator is easy to understand and implement, making the development process easy, thereby reducing the overhead while developing. On the other hand, although notorious for being less intuitive, BashOperator makes up for it by having better performance.

c. Operator Efficiency

Although several operators can achieve the desired ETL/ELT job, not all are optimal. Some operators perform better than others. To add to the previous point of PythonOperator vs BashOperator, we will see the difference in each operator’s performance in this section.
We create the following pipeline to evaluate the time taken to complete a simple export task.

We can see from the code block below that the operators in the DAG above perform the same task, i.e. they both export a table with more than 2 million records from an RDS (MS SQL Server) to a CSV file.

The bash script used by the BashOperator is as follows:

The evaluation of the operators is as follows:

We can observe that the BashOperator took nearly 316 seconds to export 2 million records.

On the contrary, the PythonOperator took almost 470 seconds to perform the same export.

PythonOperator almost had an almost 50% increase in runtime for the same export using the same infrastructure.

Having that said, PythonOperators comes with its benefits. Python is easy to learn and code and has the advantage of extensive libraries that comes in convenience while developing pipelines for machine learning use cases.
Therefore, my rule is that if we have to export/import small chunks of data or require publically available libraries that support development and flexibility, PythonOperator is the way to go. However, if we need bulk loads with comprehensive transformation, BashOperator serves us well due to its fast performance.

d. Cost Overhead

Finally, we need to consider the additional cost(if any) associated with using an operator. For instance, AWS provides its Database Migration Service Operators to migrate data from databases or external sources to a cloud platform. DMS offers ongoing replication, minimal downtime, and other features to make data migration easy. However, the convenience of it comes at a price. Therefore, if the price is not an issue and the data platform is in AWS, then DMS becomes a suitable candidate for the choices of operators.

3. Batching

In the section above, we looked at several operators used to load the data into our data platform. Even though we manage to choose the operators that execute the tasks efficiently, each operator has a limit to the performance that it can achieve.
While building data pipelines, there are times when it can be observed that even with the best possible operators, the pipelines are still relatively slow. More often than not, the pipeline’s poor performance is due to its not utilising the underlying infrastructure’s full potential. Moreover, sometimes the time taken by the pipeline to process the data is crucial enough to determine the project’s fate.

In such cases, we can improve the tackle the problem by designing an efficient pipeline.

Divide and Conquer!

For illustration, if the task at hand is to export/import an enormous database with terabytes worth of data, rather than to export full tables as described in the sections above, we can export/import the data based on batches.

Some of the things to look for while designing a pipeline with batches are as follows :

The pipeline should maintain data integrity. Care should be taken to ensure that the batching logic does not create duplicates or data loss.
Although parallelism is good, it should be only to the point it does not disrupt other applications sharing the same infrastructure. The amount of parallelism can be tuned based on the metrics such as CPU utilisation, memory usage, etc.
Efficient queries should be constructed to retrieve batches from the source system as they can significantly impact the pipeline’s performance.

4. Efficient Queries & Scripts

The data pipeline’s performance depends not only on the architecture, i.e. the chosen operators or the amount of parallelism in tasks but also on the transformation logic put in place. Inefficient queries increase the pipeline’s processing time and add unnecessary load to the underlying infrastructure.

This problem gets particularly aggravated while designing a pipeline with batching. This section explores two ways to export data in batches and compare their performance.

In the first method, we use -F(first row) and -L(last row) flags to export data in batches.

The performance for the script above is as follows :

Secondly, we look at a more efficient way to perform the same task using optimal SQL queries.

The performance for the script above is as follows :

Note: Although we create a single script to test the batching performance, each iteration is independent of the previous iterations and hence gives an accurate performance estimation if the operation were to be parallelised. Both tasks use BCP and the same infrastructure and differ only in batching logic.

Although both methods look similar, we can observe from the runtimes that the latter is much faster than the former. The better performance becomes more noticeable with large datasets containing hundreds of millions of records as the export time increases exponentially in the first method.
The increase in time in the method first described is because if we have 100 batches with a million records each, then to export the 100th batch via the first method, BCP has to read 99 million records even before it can start the export of data. It implies that as the number of batches increases, the time taken to export the data increases.
However, with the latter method, the query is optimised to get the desired ID from the 100 million records and then read the necessary records for the export making each export consistent.

5. Choosing The Right Infrastructure

The data pipelines can only perform as good as the underlying infrastructure supporting them. A well-designed data pipeline with a below-par infrastructure will have inferior results and vice versa. Therefore it is essential to understand and enable the infrastructure to build efficient pipelines.

There are many ways to achieve an adequate infrastructure. For instance, Airflow integrates well with technology like Kubernetes. Kubernetes supported by cloud platforms such as AWS, GCP or Azure can help get the full potential of data pipelines. Cloud platforms become very powerful while integrating with Airflow as they provide features such as automatic scaling up or scaling down of infrastructure based on metrics like CPU utilisation, memory, etc.

The methodology described above can help create a robust infrastructure because of the following:

Kubernetes can help scale up or down the pods (each task in our data pipeline) on hardware.
Additionally, Cloud platforms can scale up or down by adding or removing hardware resources to the Kubernetes cluster, potentially growing as the load in the data pipelines increases.
Moreover, both the technologies interact with each other to make the infrastructure more reliable.

Conclusion

Building data pipelines is a blend of multiple components. This blog explored the importance of well-designed infrastructure, architecture and data modelling to build efficient pipelines.

References

Home

No more command-line or XML black-magic! Use standard Python features to create your workflows, including date time…

airflow.apache.org

apache-airflow-providers-amazon - apache-airflow-providers-amazon Documentation

This is a provider package for provider. All classes for this provider package are in airflow.providers.amazon python…