
What is data orchestration?
DataOps teams use Data Pipeline Orchestration as a solution to centralize administration and oversight of end-to-end data pipelines.
The process of automating the data pipeline is known as data orchestration.
It is important to manage data pipelines right as it affects almost everything, i.e. data quality, process speed and data governance.
What makes an effective data pipeline management?
- Transparency and visibility. It is important that everyone in the team know how exactly the data is being transformed, where it comes from and where the process of data transformation ends.
- Faster deployments. It is crucial to be able to reproduce the elements of the data pipeline continuously. Consider these elements as building blocks of the data pipeline. So, when there is a requirement to create a new data pipeline, it is important to replicate building blocks with ease for each new data process instead of creating them from scratch.
- Efficient data governance. By creating a structured data flow, processes are managed and source controlled by relevant teams and data is accessible and manageable with ease.
Common problems while handling data workflows
ETL workflow complexity
My experience suggests that the most typical issue while managing a data pipeline is the ETL process complexity. The data pipeline is usually defined by a set of data transformation steps moving the data from its source to its destination. There are numerous tools, even frameworks, approaches and techniques to do it and I previously wrote about it here:
So it does make sense to source control data pipelines and the steps they consist of. Documenting everything is very important so the rest of the team knows exactly what the data solution does.
It is always nice to have some visual representation of the data pipeline.
If you deploy pipelines with Airflow, DBT, Dataform, Jinja or AWS Step functions, it is great and usually these tools provide a great dependency graph functionality.

Hard to replicate and deploy changes
Very often data pipelines are complex and it might take a lot of time to change the associated resources and deploy them. In other words, it might become a very time-consuming task for data and machine learning engineers to do so.
It is also essential to cover all parts of the data pipeline and not only the data transformation part. For example, Airflow is a great tool to orchestrate the data pipeline and we might want to use an S3 bucket connector for the data lake there but you also might want to describe the S3 resource and keep it in Github.
This is where Infrastructure as Code (IAC) becomes useful.
With IAC tools like Terraform and AWS Cloudformaion, we can describe all resources we might need for our data pipeline and not only the ones that actually perform the data transformation. For example, we can describe not just the pipeline resources that transform the data, i.e. AWS Lambda functions and other services, but also data storage resources, notification settings and alarms for those microservices. Some IAC solutions are platform agnostic (Terraform), some are not (AWS Cloudformation) and all have their pros and cons depending on the data stack in hand and DataOps team skills.
Data pipeline solutions that can be continuously reproduced and deployed in different environments are great because they are source controlled and have great CI/CD features. All these things help to avoid any potential human errors and to decrease Data Engineering time and costs.
So the modern approach for data pipeline management and orchestration is the one to reduce all potential issues mentioned above, i.e.
To reduce human errors, be able to easily replicate data pipeline resources, visualize the dependencies and improve data quality.
Data orchestration done right increases the availability and accessibility of data for analytics.
Typical data flow that needs to be managed
Typically by Data Pipeline we mean a collection of data-related resources that help us to deliver and transform the data from point A to point B.
By resources we mean tools and at the conceptual level we have three main data processing tasks that must be able to perform effectively:
– Data Storage
This can be data lakes in Google Cloud Storage, AWS S3 data lakes, etc. or any Relational and non-relational databases or even third-party resources available via APIs. they all serve one purpose to store the data. And in the majority of cases, this is where the data will be coming from on our data pipeline design diagram.
– Data loading or ingestion.
This can be managed tools (Fivetran, Stitch, etc.) or something bespoke like serverless microservices and various AWS Lambda or GCP Cloud Functions. They all serve one purpose to perform an ETL and to load data into the destination, i.e. data warehouse or another data lake depending on our data platform architecture type. Sometimes it might be more efficient to use tools that scale well, i.e. Spark.
– Data transformation tools
Historically the most natural way to transform data is SQL. This is a common data manipulation language recognised by all teams, Business Intelligence (BI) and data analysts, software engineers and data scientists. So there is a variety of tools at the moment that can offer reliable source-controlled data transformation with SQL queries, i.e. Jinja, DBT, Dataform, AWS Step Functions etc.
– Business intelligence
These tools help to deliver analyses and insights. Some of them a free community tools, i.e. Looker Studio and others are subscription-based paid only. All have pros and cons and it might be wise to choose one based on the company size. For example, SMEs don’t usually need to pay extra for BI OLAP cube features where data is being additionally analyzed and transformed. All they need is to email the daily dashboard with the main KPIs.
The list of tools is massive and some great BI solutions are also available with free features.
Not an extensive list:
- AWS Quicksight
- Mode
- Sisense
- Looker
- Metabase
- PowerBI
The problem is that data tools with fully integrated features don’t exist.
How do we link and connect data services together?
Wrapping our data processes into one solution can also be achieved with a little bit of programming and Infrastructure as Code techniques.
Conceptually we might want to deploy a data pipeline like this:

How do we make this solution robust and cost-effective?
With no doubt, the most cost-effective way would be to use serverless architecture. Consider this data pipeline below, for example. It can be deployed with IAC (AWS Cloudformation or Terraform) and all main parts are integrated into one complete data solution.
We have AWS Lambda functions, AWS Step functions, a relational database, data lake storage buckets and a data lake house solution (AWS Athena).

Consider this data pipeline example which consists of an AWS Lambda function. Standalone, microservice like this is just a lambda function but there is so much more we can do with it, i.e. extract data from APIs, export data from relational databases, invoke and trigger other services, etc.
We can deploy the pipeline with just one AWS CLI command using infrastructure as code:

With this, we should be able to create a simple Step Function and can use it as a template to extend and improve our ETL pipeline.
Should we choose to be platform agnostic we can deploy the complete pipeline with Terraform and use various data services across different cloud platforms. Often I use this data pipeline design pattern with AWS Lambda functions to run SQL queries in my BigQuery data warehouse solution or to perform any other ETL task. Very useful.
Conclusion
Wrapping our data processes into one solution might be challenging and would require some imagination.
Infrastructure as code sounds like the right way to go. Indeed, it simplifies data solution deployments and replication, eliminates human errors and helps to perform data engineering and Mlops tasks faster. Although it might look complex and sophisticated, it all comes with experience and requires time to learn. Don’t hesitate to invest a bit of yours in learning it. This is a very rewarding skill.
It can offer multitudes of options for data pipeline orchestration.
With just a little bit of coding and knowing how APIs work opportunities for data pipeline design and management become endless.
Recommended read
Create MySQL and Postgres instances using AWS Cloudformation
Provision Infrastructure as Code – AWS CloudFormation – AWS
Serverless Workflow Orchestration – AWS Step Functions – Amazon Web Services