
Data science promises to generate immense business value across all industries through the compelling capabilities of machine learning.
However, a recent report by Gartner revealed that most data science projects fail to progress beyond experimentation despite having sufficient data and intent.
To unlock the full potential of data science, machine learning models need to be deployed in the real world as scalable end-to-end systems that are managed automatically.
This article explores the concepts behind data science pipelines and how to leverage Kedro to create a financial data Anomaly Detection pipeline.
This article was first published in Neptune.AI
Contents
(1) What is a Data Science Pipeline? (2) What is Kedro? (3) Why Kedro? (4) Step by Step Guide – Data Science Pipeline for Anomaly Detection
(1) What is a Data Science Pipeline?
As the name suggests, a Data Science pipeline involves the seamless linkage of various components to facilitate the smooth movement of data as intended.
If we were to do an online search for data science pipelines, we would see a dizzying array of pipeline designs. The good news is that we can boil down these pipelines into these six core elements:
- Data Retrieval and Ingestion
- Data Preparation
- Model Training
- Model Evaluation and Tuning
- Model Deployment
- Monitoring

The above diagram illustrates how these six components are connected to form a pipeline where machine learning models are primed to deliver the best results in production settings.
(1) Data Retrieval and Ingestion
Data is the lifeblood of all data science projects, so the first step is to identify the relevant raw data from various data sources.
This step is more challenging than it sounds, as data is often stored in different formats across different silos (e.g., third-party sources, internal databases).
Once the required datasets are correctly identified, they are extracted and consolidated for downstream processing.
(2) Data Preparation
The quality of insights from data depends on the data quality. Therefore it is no surprise that data preparation takes up the most time and effort.
The techniques used for data preparation are based on the task at hand (e.g., classification, regression, etc.) and includes categories such as data cleaning, data transformations, feature selection, and feature engineering.
(3) Model Training
Model training is where the model prowls through the data and learns the underlying pattern. The trained model will be represented as a statistical function that captures the pattern information from the data.
The selection of machine learning models to implement is dependent on the actual task, nature of the data, and business requirements.
(4) Model Evaluation and Tuning
Once model training is complete, it is vital to evaluate its performance. The evaluation is done by tasking the model to run predictions on data that it has not seen before. It serves as a proxy of how well it performs in the real world.
The evaluation metrics help guide the changes needed to optimize model performance (e.g., select different models, adjust hyperparameter configurations, etc.).
The machine learning development cycle is highly iterative because there are many ways to adjust the model based on the metrics and error analysis.
(5) Deployment
Once we are confident that our model can deliver excellent predictions, we expose the model to real action by deploying it into production settings.
Model deployment is the critical step of integrating the model into a production environment where it takes in actual data and generates output for data-driven business decisions.
(6) Monitoring
To maintain a robust and continuously operating data science pipeline, we must monitor how well it is performing after deployment.
Beyond model performance and data quality, the monitoring metrics can also include operational aspects such as resource utilization and model latency.
In a mature MLOps setup, we can trigger new iterations of model training based on predictive performance or the availability of new data.

(2) What is Kedro?
The importance of data science pipelines has spurred the development of numerous frameworks for building and managing them effectively. One such framework is Kedro, which is the focus of this article.
Kedro is an open-source Python framework for creating reproducible, maintainable, and modular **** data science code. It helps to accelerate data pipelining, enhance data science prototyping, and promote pipeline reproducibility.
Kedro applies software engineering concepts to developing production-ready machine learning code to reduce the time and effort needed for successful model deployment.
Its impact is achieved by eliminating re-engineering work from low-quality code and standardization of project templates for seamless collaborations.
Let’s take a look at the applied concepts within Kedro:
- Reproducibility: Ability to recreate the steps of a workflow across different pipeline runs and environments accurately and consistently.
- Modularity: Breaking down large code chunks into smaller, self-contained, and understandable units that are easy to test and modify.
- Maintainability: Use of standard code templates that allow teammates to readily comprehend and maintain the setup of any project, thereby promoting a standardized approach to collaborative development
- Versioning: Precise tracking of the data, configuration, and machine learning model used in each pipeline run.
- Documentation: Clear and structured information for easy understanding
- Seamless Packaging: Allowing data science projects to be documented and shipped efficiently into production (with tools like Airflow or Docker).
(3) Why Kedro?
The path of bringing a data science project from pilot development to production is fraught with challenges. Some of the significant difficulties include:
- Codes that need to be rewritten for production environments, leading to significant project delays
- Disorganized project structures that make collaboration challenging
- Data flow that is hard to trace
- Functions that are overly lengthy and difficult to test or reuse
- Relationships between functions that are hard to understand
The QuantumBlack team developed Kedro to tackle the challenges above. It is born out of the belief that data science code should be production-ready from the get-go.
(4) Step by Step Guide: Building a Data Science Pipeline for Anomaly Detection
Let us get to the exciting part where we work through a practical hands-on project.
The project use case revolves around financial fraud detection. ** We will build an anomaly detection pipeline to identify anomalies in credit card transactions using isolation fores**t as the primary machine learning model.
The credit card transaction data is obtained from the collaboration between Worldline and Machine Learning Group. It is a realistic simulation of real-world credit card transactions and has been designed to include complicated fraud detection issues.
The following visualization shows our final anomaly detection pipeline and serves as a blueprint for what we will build in the following sections.

Feel free to check out the GitHub repo for this project as you follow along.
Step 1 – Installing Kedro and Kedro-Viz
It is recommended to create a virtual environment so that each project has its isolated environment with the relevant dependencies. To work with Kedro, the official documentation recommends that users install Anaconda.
Because my Python version is >3.10, Anaconda makes it easy to create an environment (using conda instead of venv) on a version that is compatible with Kedro’s requirements (i.e., Python 3.6 – 3.8 at the point of writing).
In particular, this is the command (in Anaconda Powershell Prompt) to generate our Kedro environment :
conda create --name kedro-env python=3.7 -y
Once the virtual environment is set up and activated with conda activate kedro-env
, we can use pip to install Kedro and the Kedro-Viz plugin:
pip install kedro kedro-viz
We can check whether Kedro is correctly installed by changing the directory to our project folder and entering kedro info
. If installed correctly, we should see the following:

At this point, we can install the other packages needed for our project.
pip install scikit-learn matplotlib
If we wish to initialize this project as a Git repository, we can do so with:
git init
Step 2 – Project Setup
One of the key features of Kedro is the creation of standard, modifiable and easy-to-use project templates. We can initialize a new Kedro project with:
kedro new
After providing the relevant names to the series of prompts, we will end up with a highly-organized project directory that we can build upon:

The project structure can be grouped into six main folders:
/conf: Contains configuration files that specify details such as data sources (i.e. Data Catalog), model parameters, credentials, and logging information.
/data: Contains the input, intermediate and output data. It is organized into an eight-layer data engineering convention to clearly categorize and structure how data is processed.
/docs: Contains the files relating to the project documentation.
/logs: Contains the log files generated when pipeline runs are performed.
/notebooks: Contains Jupyter notebooks used in the project e.g., for experimentation or initial exploratory data analysis.
/src: Contains the source codes for the project, such as Python scripts for the pipeline steps, data processing, and model training.
Step 3 – Data Setup
Data comes before science, so let us start with the data setup. The raw data (70 CSV files of daily transactions) is first placed inside _data/01_raw_.
Based on our project blueprint earlier, we know what data will be generated and utilized along the pipeline. Therefore, we can translate this information into the Data Catalog, a registry of data sources available for the project.
The Data Catalog provides a consistent way of defining how data is stored and parsed, making it easy for datasets to be loaded and saved from anywhere within the pipeline.
We can find the Data Catalog in the .yml file – conf/base/catalog.yml.

The above image is a snippet of the data sources defined in the Data Catalog. For example, we first expect our raw CSV files to be read and merged into an intermediate CSV dataset called merged_data.csv
.
Kedro has built-in data connectors (e.g., pandas.CSVDataSet, matplotlib.MatplotlibWriter) to accommodate the different data types.
Step 4 – Create Pipelines
Once our Data Catalog is defined, we can build our pipelines. Firstly, there are two key concepts to understand: Nodes and Pipelines.
- Nodes are the building blocks of pipelines. They are Python functions representing data transformations, e.g., data pre-processing, modeling.
- Pipelines are sequences of nodes connected to deliver a workflow. It organizes the nodes’ dependencies and execution order and connects inputs and outputs while keeping the code modular.
The complete pipeline for anomaly detection can be divided into three smaller modular pipelines, which we will eventually connect:
- Data Engineering Pipeline
- Data Science Pipeline
- Model Evaluation Pipeline
We can instantiate these modular pipelines with the following commands based on the names we assign:
kedro pipeline create data_engineering
kedro pipeline create data_science
kedro pipeline create model_evaluation
While the pipelines are empty at this stage, their structures have been nicely generated inside the /src folder.

Each pipeline folder has the same files, including the nodes.py (codes for the nodes) ** and pipeline.py** (codes for the pipeline).
Step 5 – Build Data Engineering Pipeline
Let’s first look at the data engineering pipeline, where we process the data for downstream machine learning. More specifically, there are three preprocessing tasks to be performed:
- Merge raw datasets into an intermediate merged dataset
- Process the merged dataset by keeping only the predictor columns and creating a new date column for subsequent train-test split
- Perform chronological 80:20 train-test split and drop unnecessary columns
We start by scripting the tasks as three separate node functions inside nodes.py:
We then import these node functions into pipeline.py to link them in the correct sequence.
Notice in each of the node wrapper node(..), we specify a name, function (imported from node.py), and the input and output datasets defined in the Data Catalog (see Step 3).
The arguments in the node wrappers should match the dataset names in the Data Catalog and the arguments of the node functions.
For the node node_process_data
, the list of predictor columns is stored in the parameters file found in conf/base/parameters.yml.
Our data engineering pipeline setup is complete, but it is not ready since it is not yet registered. We will explore this later in Step 8, so let’s continue building the two remaining pipelines.
Step 6 – Build Data Science Pipeline
The anomaly detection model for our pipeline is isolation forest. Isolation forest is an unsupervised algorithm that is built using decision trees.
It ‘isolates’ observations by randomly selecting a feature and then choosing a split value between maximum and minimum values. Since anomalies are few and different, they are expected to be easier to isolate than normal observations.
We will use the scikit-learn implementation for isolation forest modeling. There are two tasks (and nodes) to be created – (i) model training and (ii) model predictions (aka inference).
The contamination value for the model is set at 0.009, matching the proportion of fraud cases observed in the original dataset (i.e., 0.9%).
Like before, we link the nodes together within a pipeline function in pipeline.py.
As seen in the Data Catalog, we will be saving our trained isolation forest model as a pickle file in _data/06_models_.
Step 7 – Build Model Evaluation Pipeline
Although isolation forest is an unsupervised model, we can still evaluate its performance if we have ground truth labels.
In the original dataset, there is the TX_FRAUD variable that serves as an indicator of fraudulent transactions.
With the ground truth labels and predicted anomaly scores, we can obtain and present the evaluation metrics as AUC and AUCPR plots.
Here is the pipeline.py script to run the model evaluation node.
This model evaluation step is separated from the data science pipeline seen in Step 6. This separation is because we are using an unsupervised anomaly detection algorithm, and we do not expect to always have ground truth data.
Step 8 – Registering All Pipelines in Pipeline Registry
At this point, all the hard work in pipeline creation has been accomplished. We now need to conclude by importing and registering all three modular pipelines in the pipeline registry.
The __default__
line in the return statement indicates the default sequence of modular pipelines to run, which in our case is all three modular pipelines – data_engineering
, data_science
, and model_evaluation
.
The beauty of Kedro is that its modular structure gives us flexibility in structuring our pipeline. For example, if we do not have ground truth labels, we can exclude model_evaluation
from the default pipeline run.
Step 9 – Visualize the Pipeline
Before running the pipeline, it would be good to examine what we have built so far. The fantastic Kedro-Viz plugin allows us to readily visualize the entire pipeline structure and dependencies.
Given its ease of use, clarity, and aesthetic display, it is no surprise that many QuantumBlack clients expressed their delight at this feature.
We can easily generate the visualization with this command:
kedro viz
A new tab will open in our browser, and we will be greeted with a beautiful visualization tool to explore our pipeline structure. This visualization can also be easily exported as a .png image file.

Step 10 – Run the Pipeline
We are finally ready to run our pipeline. The following command will execute the default pipeline that we registered earlier.
kedro run
Upon running, the pipeline will populate the respective directories with the generated data, including the anomaly predictions and model evaluation plots.
We can also run specific pipelines registered in the pipeline registry. For example, if we wish only to run the data engineering modular pipeline (de
), we can add--pipeline=<NAME>
to the command:
kedro run --pipeline de
Step 11 – Evaluating Pipeline Output
Finally, it is time to assess the output of our anomaly detection pipeline. In particular, let us review the evaluation plots (saved in _data/08_reporting_) to see how performant the model is.

The plots show that the isolation forest model AUC is 0.8486, a pretty good baseline machine learning model performance.
Additional Features
Congratulations on making it this far and successfully creating an anomaly detection pipeline with Kedro!
Beyond the fundamental functionalities, Kedro has other useful features for managing data science projects. Here are several capabilities worth mentioning:
(1) Experiment Tracking
Kedro makes it easy to set up experiment tracking and access logged metrics from each pipeline run. Besides its internal experiment tracking capabilities, Kedro integrates well with other Mlops services.
For example, the Kedro-Neptune plugin lets users enjoy the benefits of a nicely organized pipeline together with a powerful Neptune user interface for metadata management.

(2) Pipeline Slicing
The pipeline slicing capabilities of Kedro allow us to execute specific portions of the pipeline as we desire. For example, we can define the start and end nodes in the pipeline slice we wish to run:
kedro run --from-nodes train-test-split --to-nodes train_model
(3) Project Documentation, Packaging, and Deployment
We can generate project-specific documentation (built on the Sphinx framework) by running this command in the project’s root directory.
kedro build-docs
Next, to initiate the packaging of the project as a Python library, we run the following command:
kedro package
Lastly, we can deploy these packaged data science pipelines via first-party plugins such as Docker and Airflow.
Conclusion
There are other interesting Kedro features and tutorials available, so check out the official documentation for further exploration. Also, go ahead and take a look at the GitHub repo containing all the codes of this project.
If you would like to learn how Kedro compares with other pipelining tools in the market, then check out the comprehensive NeptuneAI article below:
Kedro vs ZenML vs Metaflow: Which Pipeline Orchestration Tool Should You Choose? – neptune.ai
Before You Go
I welcome you to join me on a data science learning journey. Follow this Medium page and check out my GitHub to stay in the loop of practical and educational data science content. Meanwhile, have fun building data science pipelines with Kedro!
End-to-End AutoML Pipeline with H2O AutoML, MLflow, FastAPI, and Streamlit
Financial Fraud Detection with AutoXGB