
What is a data environment? Data engineers split infrastructure resources into live and staging to create isolated areas (environments) where they can test ETL services and data pipelines before promoting them to production.
Data environment refers to a set of applications and related physical infrastructure resources that enable data storage, transfer, processing and data transformation to support company goals and objectives. This story provides ** an overview of CI/CD tech available for data and a working example of a simple ETL service built in Python and deployed with Infrastructure as code (IaC) using Github Action**s.
Continuous integration and continuous delivery (CI/CD)
Continuous integration and continuous delivery (CI/CD) is a software development strategy in which all developers collaborate on a common repository of code, and when changes are made, an automated build process is used to discover any potential code problems.

CI/CD benefits
One of the primary technical advantages of CI/CD is that it improves overall code quality and saves time.
Automated CI/CD pipelines using Infrastructure as Code solve a lot of problems.
Deliver faster
Adding new features numerous times each day is not an easy task. But, if we have a simplified CI/CD workflow, it is definitely achievable.
Using CI/CD tools such as GoCD, Code Pipeline, Docker, Kubernetes, Circle CI, Travis CI, etc. dev teams now can build, test, and deploy things independently and automatically.
Reduce errors
Finding and resolving code issues late in the development process is time-consuming and, therefore, expensive. When features with errors being released to production, this becomes even more important.
By testing and deploying code more often using a CI/CD pipeline, testers will be able to see problems as soon as they arise and correct them right away. This helps to mitigate risks in real time.
Less manual effort and more transparency
Tests should run automatically for new code features to ensure that neither the new code nor the new features damage any already-existing features. We would want to get regular updates and information regarding the development, test, and deployment circles throughout this process.
Easy rollbacks
To prevent downtime in production, the most recent successful build is normally deployed immediately if something is wrong with our new release or feature. This is another greate CI/CD feature that enables easy rollbacks.
Extensive logs
Knowing the deployement process is essential. Undestanding why our code fails is even more important. One of the most important parts of DevOps and CI/CD integration is observability. Being able to read extensive logs for our builds is definitely a must have function.
When do we use CI/CD for data platforms?
Managing data resources and infrastructure: With CI/CD techniques and tools we can provision, deploy and manage the infrastructure resources we might need for data pipelines, i.e. Cloud Storage buckets, Serverless microservices to perform ETL tasks, event streams and queues. Tools like AWS Cloudformation and Terraform can manage infrastructure with ease to provision resources for tests, staging and live environments.
SQL unit testing: CI/CD helps with data transformation. If we have a data pipeline that transforms data in ELT pattern we can automate SQL unit tests to test the logic behind it. A good example would be a GitHub Actions workflow that compiles our SQL scripts and runs unit tests.
Validating ETL processes: Many data pipelines rely heavily on ETL (Extract, Transform, Load) operations. We would want to ensure that any changes we commit to our Github repository do the right job with the data. This can be achieved by implementing automated integration testing. Here is a simple example of how to do it:
Monitoring data pipelines. A great example would be using CI/CD and Infrastructure as Code to provision Notification Topics and Alarms for ETL resources, i.e. Lambda, etc. We can receive notifications via selected channels if something goes wrong with our ETL processing service, for instance, if the number of errors reaches the threshold. Here is an AWS Cloudformation example of how to do it:
How to set up a CI/CD pipeline for a data platform?

Step 1. Create a repository This is a fundamental step. A version control system is required. We would want to ensure that every change in our code is version controlled, saved somewhere in the cloud and can be reverted if needed.
Step 2. Add build step Now when we have a repository we can configure our CI/CD pipeline to actually build the project. Imagine, we have an ETL microservice that loads data from AWS S3 into a data warehouse. This step would involve building a Lambda package in the isolated local environment, i.e. in Github. During this step, CI/CD service must be able to collect all required code packages to compile our service. For example, if we have a simple AWS Lambda to perform an ETL task then we would want to build the package:
# This bash script can be added to CI/CD pipeline definition:
PROFILE=Named_AWS_profile
# Get date and time for our build package:
date
TIME=`date +"%Y%m%d%H%M%S"`
# Get current directory to name our packge file:
base=${PWD##*/}
zp=$base".zip"
echo $zp
# Tidy up if any old files exist:
rm -f $zp
# Install required packages:
pip install --target ./package pyyaml==6.0
# Go inside the package folder and add all dependencies to zip archive:
cd package
zip -r ../${base}.zip .
# Go to the previous folder and package the Lambda code:
cd $OLDPWD
zip -r $zp ./pipeline_manager
# upload Lambda package to S3 artifact buacket (we can deploy our Lambda from there):
aws --profile $PROFILE s3 cp ./${base}.zip s3://datalake-lambdas.aws/pipeline_manager/${base}${TIME}.zip
Step 3. Run tests
We would want to ensure that the changes we deploy for our data pipeline work as expected. This can be achieved by writing good unit and integration tests. Then we would configure our CI/CD pipeline to run them, for example, every time we commit the changes or merge into the master branch. For instance, we can configure Gitflow Actions to run a **pytest test.py**
or **npm run test**
for our AWS Lambda. If tests are successful we can proceed to the next step.
Step 4. Deploy staging In this step, we continue to implement Continuous Integration. We have a successful build for our project and all tests have been passed and now we would want to deploy in the staging environment. By environment we mean resources. CI/CD pipeline can be configured to use settings relevant to this particular environment using Infrastructure as code and finally deploy. Example for Lambda. This bash script can be added to a relevant step of CI/CD pipeline:
STACK_NAME=PipelinaManagerStaging
aws --profile $PROFILE
cloudformation deploy
--template-file stack_simple_service_and_role.yaml
--stack-name $STACK_NAME
--capabilities CAPABILITY_IAM
--parameter-overrides "StackPackageS3Key"="pipeline_manager/${base}${TIME}.zip"
# Additionally we night want to provide any infrastructure resources relevant only for staging. They must be mentioned in our Cloudformation stack file stack_simple_service_and_role.yaml
Step 5. Deploy live
This is the final step and typically it is triggered manually when we are 100% sure everything is okay.

CI/CD would use IaC settings for the production environment. For instance, we might want to provide any infrastructure resources relevant only for production, i.e. our Lambda function name should be pipeline-manager-live
. These resource parameters and configuration settings must be mentioned in our Cloudformation stack file. For example, we might want our ETL Lambda to be triggered by Cloudwatch event from S3 bucket every time a new S3 object is created there. In this case, we would want to provide the name of this S3 bucket in the parameters. Another example would be our Lambda settings such as memory and timeout. There is no need to over-provision memory for staging service but on live we would want it to be able to process larger amounts of data.
CI/CD Live step example:
STACK_NAME=SimpleCICDWithLambdaAndRoleLive
aws
cloudformation deploy
--template-file stack_cicd_service_and_role.yaml
--stack-name $STACK_NAME
--capabilities CAPABILITY_IAM
--parameter-overrides
"StackPackageS3Key"="pipeline_manager/${base}${TIME}.zip"
"Environment"="live"
"Testing"="false"

Rollbacks, version control and security can be handled via CI/CD service settings and IaC.
CI/CD pipeline example with infrastructure as code and AWS Lambda
Let’s imagine we have a typical repo with some ETL service (AWS Lambda) being deployed with AWS Cloudformation.
That can be a data pipeline manager application or something else to perform ETL tasks.
Our repo folder structure will be the following:
.
├── LICENSE
├── README.md
└── stack
├──.github
| └──workflows
| ├──deploy_staging.yaml
| └──deploy_live.yaml
├── deploy.sh
├── event.json
├── package
├── pipeline_manager
│ ├── app.py
│ ├── config
│ └── env.json
└── stack_cicd_service_and_role.yaml
We will define our CI/CD pipeline with deploy_staging.yaml and deploy_live.yaml in .github/workflows folder.
On any Pull Request, we would want to run tests and deploy on staging.
Then if everything is okay we will promote our code to production and deploy the stack to live environment.

This pipeline will be using Github repository secrets where we will copy paste AWS credentials.

After STAGING AND TESTS has been executed successfully and everything passed we can manually promote our code to live. We can use workflow_dispatch:
for that:

CI/CD tools available in the market
There are various CI/CD solutions that may be used to automate data pipeline testing, deployment, and monitoring. Github Actions is a great tool but sometimes we might need more and/or something different.
This is not an extensive list but some popular tech to try:
AWS CodePipeline: Solid tool for $1.5 a month per one pipeline. Lots of features including automated builds and deployments via infrastructure as code.
Circle CI: Circle CI is a cloud-based CI/CD system for automated data pipeline testing and deployment. It has a number of connectors and plugins that make it simple to set up and operate.
Jenkins: Jenkins is a free and open-source automation server for continuous integration and deployment. It offers a diverse set of plugins and connectors, making it a powerful data pipeline management solution.
GitLab CI/CD: GitLab CI/CD is a cloud-based system that allows teams to manage changes to their code and data pipelines in a single location. It has an easy-to-use interface for creating, testing, and deploying data pipelines.
Travis CI: Travis CI is a cloud-based CI/CD system for automated data pipeline testing and deployment. It is simple to set up and utilize, making it a popular choice for teams with little automation expertise.
GoCD: GoCD is a free and an open source build and release tool. It’s free and rely on bash scripts a lot.
Conclusion
One of the main benefits of CI/CD is that it improves code quality. Continuous integration and deployment bring a lot of benefits for data platform engineers and ML Ops. Every step of our data pipeline deployments can be easily monitored and managed to ensure faster delivery with no errors in production. It saves time and helps engineers to be more productive.
I hope this simple example given in this story will be useful for you. Using it as a template I was able to create robust and flexible CI/CD pipelines for containerized applications. Automation in deployment and testing is pretty much a standard these days. And we can do so much more with it including ML Ops and provisioning resources for data science. There are a lot of CI/CD tools available in the market. Some of them are free some aren’t but bringing more flexible setups that might become a better fit for your data stack. My advice for beginners would be to start with free tools and try to implement this story example. It describes the process that can be reproduced for any data service later.
Recommended read
- https://docs.github.com/en/actions
- https://stackoverflow.com/questions/58877569/how-to-trigger-a-step-manually-with-github-actions
- https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html
- https://medium.com/gitconnected/infrastructure-as-code-for-beginners-a4e36c805316
- https://betterprogramming.pub/great-data-platforms-use-conventional-commits-51fc22a7417c