The world’s leading publication for data science, AI, and ML professionals.

How To Use Kedro to Orchestrate Your MLOps Pipelines

Start your hands-on journey with MLOps here

Photo by Mizan M Latief on Unsplash
Photo by Mizan M Latief on Unsplash

Introduction

MLOps is a difficult set of concepts to learn at home. If you don’t have a production environment, it can be hard to simulate what happens in the real world. I’ve been investigating ways to develop MLOps at home and have selected some tools which implement the concepts and just work. This article covers Kedro, a tool developed by QuantumBlack, a McKinsey company.

The official documentation describes it as…

An open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering best-practice and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.

In this article, we’ll take a look at how Kedro helps us to be build repeatable, automated code known as a pipeline.

Table of Contents

Get The Notebook

I’ve prepared a notebook which has nicely refactored code, ready to turn into a Kedro project. You can use your own but you may need to do some additional wrangling and code refactoring.

To transition smoothly from a notebook to a pipeline you need to ensure you have refactored your code to write any transformations as functions. These functions will become nodes in our Kedro pipeline.

To find out how to refactor your code, check out my article here:

Want to Implement MLOps at Home? Refactor Your Code

If you just want the notebook, say no more:

mlops-at-home/3 Refactored.ipynb at main · AdamShafi92/mlops-at-home

The final repo containing the complete Kedro project is located here:

mlops-at-home/2-pipelines/hpmlops at main · AdamShafi92/mlops-at-home

Create a Kedro Project

We’re now ready to start moving into Mlops.

Create a new folder somewhere on your computer. Navigate to this folder and open a terminal or bash window.

Start by creating a fresh environment, I recommend using Conda to manage your environments. Kedro recommend having 1 environment per project.

conda create -n house_prices_mlops python=3.8 anaconda

Make sure you activate the environment

conda activate house_prices_mlops

Then install Kedro using

conda install -c conda-forge kedro

Now initialise a new kedro project. You will be asked for 3 names for different parts of the project, you can call all of these _house_pricesmlops.

kedro new

Finally, you can initialise a git repo if needed. I haven’t done so this time.

You should end up with the following folder structure inside a folder with your project name, which we’ll refer to as project name. It does looks pretty complex but there is a lot we can ignore for now. Very little actually needs to be configured to get started.

├── conf
   ├── base
      ├── catalog.yml
      ├── logging.yml
      ├── parameters.yml
   ├── local 
      ├── credentials.yml
├── data
   ├── 01_raw
   ├── 02_intermediate
   ├── 03_primary
   ├── 04_feature
   ├── 05_model_input
   ├── 06_models
   ├── 07_model_output
   ├── 08_reporting
├── docs
   ├── source
      ├── conf.py
      ├── index.rst
├── logs
   ├── journals
├── notebooks
├── src
   ├── project_name
      ├── __pycache__
      ├── pipelines
         ├── __init__.py 
      ├── __init__.py
      ├── __main__.py
      ├── cli.py
      ├── hooks.py
      ├── pipeline_registry.py
   ├── tests
      ├── pipelines
         ├── __init__.py
      ├── __init__.py
      ├── test_run.py 
   ├── requirements.in
   ├── requirements.txt
   ├── setup.py
├── pyproject.toml
├── README.md
├── setup.cfg

If you’ve downloaded my notebook, you can add it to the the notebooks folder, _./house_pricesmlops/notebooks

Add Python Packages to Requirements.txt

For every package imported into your notebooks, you need an entry in the requirements.txt file. This is a file containing all the Python packages used by the project. These are installed as part of Kedro’s initialisation steps.

If you’d like an easy way to extract requirements from a Jupyter Notebook, take a look here:

Generating a Requirements.txt File from a Jupyter Notebook

Once you’ve created your file, copy the contents into Kedro’s equivalent, which is located here:

├── conf
├── data
├── docs
├── logs
├── notebooks
├── src
   ├── requirements.txt

Note we are using the .txt file, not .in.

My finished requirements.txt file is located in the main repo, here.

Configure the Data Catalog

The data catalog is a .yml file used by Kedro to load and save data. It means we can access data using a dataset name instead of a filepath.

Every dataset you read in or save needs to be registered in the data catalog.

The catalog is located here:

├── conf
   ├── base
      ├── catalog.yml
      ├── logging.yml
      ├── parameters.yml
├── data
├── docs
├── logs
├── notebooks
├── src

First, copy your raw data file into the directory below.

├── conf
├── data
   ├── 01_raw
├── docs
├── logs
├── notebooks
├── src

Then, register your raw data in the catalog.yml file. My dataset is named _houseprices, so the lines needed in the catalog.yml file are:

house_prices:
 type: pandas.CSVDataSet
 filepath: data/01_raw/train.csv

We can now reference this dataset in other areas of Kedro.

At this point you can save and close the catalog.yml file.

Create a New Pipeline

A pipeline is a series of transformations known as nodes. To create one, we need 2 Python files:

  • A nodes file, which are Python functions for the individual transformations
  • A pipeline file, which specifies the order and parameters of the python functions.

We need to create some files and folders here to get started. Each pipeline should have its own pipeline and nodes file in its own folder. It is common to create separate pipelines for separate stages e.g. EDA, data processing, model training. Let’s create a data processing pipeline.

Create folders and empty Python files to match the below structure, items in italics are those that need to be created.

├── conf
├── data
├── docs
├── logs
├── notebooks
├── src
   ├── project_name
      ├── pipelines
         ├── __init__.py
         ├── data_processing
            ├── __init.py__
            ├── nodes.py
            ├── pipeline.py

Start by opening your newly created init.py file and adding the below line of code. Then save and close it.

from .pipeline import create_pipeline

Creating the Nodes

A node is a Kedro term for 3 items, a Python function, input location and parameters and an output location. If you are using my notebook above, the steps are already turned into simple functions which we can easily turn into nodes.

To create the nodes.py file,

  • Open the file you created in the previous step
  • Copy all Python package imports into the top of the file.
  • Copy every function from your notebook into the nodes.py file.

Here’s an example of one of the functions in the notebook.

import pandas as pd
def remove_outliers(train):
   train = train.copy()
   train = train.drop(
   train[(train['GrLivArea']>4000) & 
   (train['SalePrice']<30000)].index)
   return train

Note: we may want to vary the parameters such as GrLivArea, in this tutorial, we are going to do this using the parameters dictionary. This is a central dictionary that contains all the values for our functions. We can use this to vary these from one location when running experiments.

We’ll add this function to a pipeline before updating the parameters dictionary.

Building the Pipeline

The pipeline just stitches nodes together and defines what should be used as keyword arguments for each function. It also defines where the output of each function should be saved. You don’t have to save between every step, but I find this easier. Let’s take a look at one example.

node(
 func=remove_outliers,
 inputs=["house_prices", "parameters"],
 outputs="house_prices_no_outliers",
 name="outliers_node",)
  • func: Name of the function to use, from the nodes.py file.
  • inputs: In this example, list containing the data source (using the Kedro name, defined in the Data Catalog) and the Kedro parameter.yaml file, which is named by Kedro as Parameters.
  • outputs: Another Kedro data source, defined in the Data Catalog
  • name: Name of the node

Because we’ve specified an output name, this name also needs to be in the Data Catalog for Kedro to understand. As we’ll be creating a dataset for every node, we’ll update this in a batch later on.

Each node should sit within the pipeline as follows with each node containing processing code in the format above between the brackets.

def create_pipeline(**kwargs):
   return Pipeline(
        [
           node(),
           node(),
           node(),])

The Parameters Dictionary

The parameters dictionary is a clean way for us to manage any parameter in our pipeline such as a function keyword argument, model hyperparameters or a loaded filepath.

The dictionary is a .yml file located here:

├── conf
   ├── base
      ├── catalog.yml
      ├── logging.yml
      ├── parameters.yml
├── data
├── docs
├── logs
├── notebooks
├── src

To utilise this, we need 3 things

  • A key value pair in the dictionary containing the information we need.
  • The parameter dictionary specified in our nodes and pipeline file.
  • The dictionary needs to be called in each function it is being used.

Let’s say we want to experiment with dropping outliers at different levels.

In the parameter dictionary, we need to give the values a name. I have nested this by putting the GrLivArea and SalePrice variables inside outliers.

outliers: 
  GrLivArea: 4000
  SalePrice: 300000

In Python, you could access this dictionary in the standard way, e.g.

parameters['outliers']['GrLivArea']
>>> 4000

We can now update our function to include the parameters dictionary.

  • Include the parameter dictionary as a keyword argument.
  • Call it in the relevant part of the function.
def remove_outliers(train, parameters):
   train = train.copy()
   train = train.drop(
   train[(train['GrLivArea']>parameters['outliers']['GrLivArea']) &amp; 
   (train['SalePrice']<parameters['outliers']['SalePrice'])].index)
   return train

At the moment, ‘parameters’ is just a key word argument. We provide this with the dictionary itself in the pipeline.py file. The parameter.yaml file has its own name in Kedro so we simply just include the string ‘parameters’ in our pipeline.

node(
  func=remove_outliers,
  inputs=["house_prices", "parameters"],
  outputs="house_prices_no_outliers",
  name="outliers_node",

Completing Nodes.py and Pipeline.py and Parameters Dictionary

You should now have updated these files with a single function to remove outliers. The original notebook has many more functions we use to process the data, and these should also be registered.

Nodes.py

Add each function to the nodes file and ensure you include all relevant imports.

We register anything we want to vary in the parameters dictionary and ensure parameters is a key word argument in the function.

The finished nodes file is here.

Pipeline.py

We can now register each node in the pipeline. This is fairly straightforward – we just have to add the name of each node and the inputs/outputs, as we did before.

The finished pipelines file is here.

Parameters Dictionary

Ensure anything you’ve added is a parameter is also in the dictionary, my finished file is here.

Completing the Data Catalog

You’ll notice that in the nodes, we use strings to define the inputs and outputs of each function. These could be .csv files (or another compatible file type). We have to tell Kedro what each file is and give it a name. This is all specified in a single location, the Data Catalog. We already added one file but let’s look at it in a bit more detail.

The data catalog is a .yml file located here:

├── conf
   ├── base
      ├── catalog.yml
      ├── logging.yml
      ├── parameters.yml
├── data
├── docs
├── logs
├── notebooks
├── src

The minimum to define a data source is 3 things, a name, a datatype and the filepath to the data. We previously added or house_prices dataset, which should look something like this.

house_prices:
   type: pandas.CSVDataSet
   filepath: data/01_raw/train.csv

However, in our first node, we also defined an output dataset, "house_prices_no_outliers". Kedro also needs to know where to save this, it is defined in the same way.

house_prices_no_outliers:
   type: pandas.CSVDataSet
   filepath: data/02_intermediate/house_prices_no_outliers.csv

We’ll be saving csv files after each processing step so lets define the rest now.

y_train:
  type: pandas.CSVDataSet
  filepath: data/05_model_input/y_train.csv
house_prices_drop:
  type: pandas.CSVDataSet
  filepath: data/02_intermediate/house_prices_drop.csv
house_prices_no_na:
  type: pandas.CSVDataSet
  filepath: data/02_intermediate/house_prices_no_na.csv
house_prices_clean:
  type: pandas.CSVDataSet
  filepath: data/02_intermediate/house_prices_clean.csv

Registering Pipelines

We’ve now defined all our data sources, parameters, creates nodes and tied them together in a pipeline.

We can now put this together in the pipeline registry.

This is a .py file stored in:

├── conf
├── data
├── docs
├── logs
├── notebooks
├── src
   ├── project_name
      ├── pipeline_registry.py

In this file, we simply map our pipeline.py files to a name. If there are any other pipelines, they should also be defined here but we just have one at the moment.

Run the Project

We now have everything we need to run our project and process the data!

To actually run your pipelines, simply navigate to your root directory (the one containing the conf, data, docs, etc folders) and enter the following in your command line:

kedro run

Kedro will now run all your pipelines. You’ll most likely get a few error messages here due to depreciated packages or not having a git repo. For now, these can be ignored. You should see the following in the command line

INFO - Completed 5 out of 5 tasks
INFO - Pipeline execution completed successfully

A handy way to understand your steps is to produce a visualisation. You’ll need to pip install kedro-viz to do this. You can read more about it here.

pip install kedro-viz

Simply run the following command in your command line to get the below interactive visualisation. There are filters and settings that I haven’t included in the screenshot below, but yours should look very similar.

A flow chart showing each step in our Kedro pipeline. Image by Author.
A flow chart showing each step in our Kedro pipeline. Image by Author.

Conclusions

We have now set up a way to automate pipelines. Kedro provides a more robust, standardised way to process data than a simple Jupyter Notebook. The code is highly reuseable. However, the full benefits may not be clear yet and for that we need to start integrating other tools to automate preprocessing, testing and model tracking.

The diagram below shows my personal plan for developing MLOps at home. You can see that we have developed a framework and started to manage pipelines. Things aren’t fully automated and we have a long way to go, but this is a good start!

Developing MLOps at home. Image by Author.
Developing MLOps at home. Image by Author.

Learn More

TabNet: The End of Gradient Boosting?


Related Articles