Kedro: A New Tool For Data Science

A new Python library for production-ready data pipelines

Jo Stichbury
Towards Data Science

--

In this post, I will introduce Kedro, a new open source tool for data scientists and data engineers. After a brief description of what it is and why it is likely to become a standard part of every professional’s toolchain, I will describe how to use it in a tutorial that you should be able to complete in fifteen minutes. Strap in for a spaceflight to the future!

Image by stan (licensed under CC BY-SA 2.0)

Suppose you are a data scientist working for a senior executive who makes key financial decisions for your company. She asks you to provide an ad-hoc analysis, and when you do, she thanks you for delivering useful insights for her planning. Great!

Three months down the line, the newly-promoted executive, now your CEO, asks you to re-run the analysis for the next planning meeting…and you cannot. The code is broken because you’ve overwritten some of the key sections of the file, and you cannot remember the exact environment you used at the time. Or maybe the code is OK, but it’s in one big Jupyter notebook with all the file paths hard coded, meaning that you have to go through laboriously to check and change each one for the new data inputs. Not so great!

Some everyday principles of software engineering are unfamiliar to data scientists who do not all have an extensive background in programming. In fact, many data scientists are self-taught, learning to program as part of a research project or when necessary in their jobs. And, maybe, a few data scientists will argue that their code does not need to be of a high standard because they aren’t working on a production system.

Any code that feeds some business decision process should be considered as production code

Data scientists may not consider the primary output of their work to be code, but this doesn’t mean that the code they write shouldn’t follow the standards expected by a software engineering team. In fact, it should have the following, minimum, characteristics:

  • Versioning — use git or a similar tool to save your changes regularly whether you work alone or in a team.
  • Reproducibility — You should be able to transfer a project to another computer and run it without significant effort or reconfiguration.
  • Follow standards — A common project structure, standard coding conventions and tools.
  • Documentation — Automated documentation to keep your documentation up-to-date with your code.
  • Modularity — Code can be executed and tested easily.

For more detail on these principles, take a look at a useful blog post by Thomas Huijskens from QuantumBlack.

What is Kedro?

Kedro is a development workflow tool that allows you to create portable data pipelines. It applies software engineering best practices to your data science code so it is reproducible, modular and well-documented. If you use Kedro you can worry less about how to write production-ready code (Kedro does the heavy lifting for you) and you’ll standardise the way your team collaborates, allowing you all to work more efficiently.

Many data scientists need to perform the routine tasks of data cleaning, processing and compilation that may not be their favourite activities but form a large percentage of their day to day tasks. Kedro makes it easier to build a data pipeline to automate the “heavy lifting” and reduce the amount of time spent on this kind of task.

Kedro has been created by QuantumBlack, an advanced analytics firm, that was acquired by the consultancy giant, McKinsey & Company, in 2015. QuantumBlack have used Kedro on more than 60+ projects with McKinsey and have now decided to open source it. Their goal is to enable staff, clients and third-party developers to use Kedro and build upon it. By extending this open tool for their own needs, data professionals are able to complete their regular tasks effectively and potentially share their enhancements to the community.

The team from Quantum Black explain Kedro

Is Kedro a workflow scheduler?

Kedro is not a workflow scheduler like Airflow and Luigi. Kedro makes it easy to prototype your data pipeline, while Airflow and Luigi are complementary frameworks that are great at managing deployment, scheduling, monitoring and alerting. A Kedro pipeline is like a machine that builds a car part. Airflow or Luigi tell different machines to switch on or off in order to work together and produce a car. QuantumBlack have built a Kedro-Airflow plugin, providing faster prototyping time and reducing the barriers to entry associated with moving pipelines to both workflow schedulers.

What do I need to know?

Kedro’s documentation has been designed for beginners to get started creating their own Kedro projects (in Python 3.5+). If you have just the very basic knowledge of Python then you might find the learning curve more challenging but bear with it and follow the documentation for guidance.

A fifteen minute spaceflight with Kedro

My easy-to-follow fifteen minute tutorial will be based on the following scenario:

It is 2160 and the space tourism industry is booming. Globally, there are thousands of space shuttle companies taking tourists to the moon and back. You have been able to source three (fictional) datasets about the amenities offered in each space shuttle, customer reviews and company information. You want to train a linear regression model that uses the datasets to predict the price of shuttle hire. However, before you get to train the model, you need to prepare the data by doing some data engineering, which is the process of preparing data for model building by creating a master table.

Getting set up

The first step, is to pip install kedro into a new Python 3.5+ virtual environment (we recommend conda), and confirm that it is correctly installed by typing

kedro info

You should see the Kedro graphic as shown below:

This will additionally give you the version of Kedro that you are using. I wrote this tutorial based on version 0.14.0 in May 2019 and since this is an active and open source project, there may, over time, be changes to the codebase which cause minor changes to the example project. We will endeavour to keep the example code up to date, although not this blog post, so you should consult Kedro documentation and release notes if you hit upon any inconsistencies.

Kedro workflow

When building a Kedro project, you will typically follow a standard development workflow as shown in the diagram below. In this tutorial, I will walk through each of the steps.

Set up the project template

To keep things simple, the QuantumBlack team have provided all the code you need, so you simply need to clone the example project from its Github Repo. You do not need to set up a project template in this tutorial.

Once you have the code for the Spaceflights project, to make sure you have the necessary dependencies for it, please run the following in a Python 3.5+ virtual environment:

kedro install

Set up the data

Kedro uses configuration files to make a project’s code reproducible across different environments, when it may need to reference datasets in different locations.

To set up data for a Kedro project, you will typically add datasets to the data folder and configure the registry of data sources that manages the loading and saving of data

This tutorial makes use of three datasets for spaceflight companies shuttling customers to the moon and back, and uses two data formats: .csv and .xlsx. The files are stored in the data/01_raw/ folder of the project directory.

To work with the datasets provided, all Kedro projects have a conf/base/catalog.yml file which acts as a registry of the datasets in use. Registering a dataset is as simple as adding a named entry into the .yml file to include file location (path), parameters for the given dataset, type of data and versioning. Kedro comes with support for a few types of data, such as csv, and if you look at the catalog.yml file for the Spaceflights tutorial, you will see the following entries for the datasets used:

companies:
type: CSVLocalDataSet
filepath: data/01_raw/companies.csv
reviews:
type: CSVLocalDataSet
filepath: data/01_raw/reviews.csv

To check whether Kedro loads the data correctly, and to inspect the first five rows of data, open a terminal window and start an IPython session in the Kedro project directory (and just type (exit() when you want to end the session and continue with the tutorial):

kedro ipython
io.load(‘companies’).head()

Create and run the pipeline

The next part of the workflow is to create a pipeline from a set of nodes, which are Python functions that perform distinct, individual tasks. In a typical project, this stage of the project comprises of three steps:

  • Write and document your node functions
  • Build your pipeline(s) by specifying inputs and outputs for each of its nodes
  • Run Kedro, choosing whether to run the nodes, sequentially or in parallel

Data engineering pipeline

We have reviewed the raw datasets for the Spaceflights project, and it is now time to consider the data engineering pipeline that processes the data and prepares it for the model within the data science pipeline. This pipeline preprocesses two datasets and merges them with a third into a master table.

In the file data_engineering.py inside the nodes folder you’ll find the preprocess_companies and preprocess_shuttles functions (specified as nodes within the pipeline in pipeline.py). Each function takes in a dataframe and outputs a pre-processed data, preprocessed_companies and preprocessed_shuttles respectively..

When Kedro runs the data engineering pipeline, it determines whether datasets are registered in the data catalog (conf/base/catalog.yml). If a dataset is registered, it is persisted automatically to the path specified without a need to specify any code in the function itself. If a dataset is not registered, Kedro stores it in memory for the pipeline run and removes it afterwards.

In the tutorial, the preprocessed data is registered in conf/base/catalog.yml:

preprocessed_companies:
type: CSVLocalDataSet
filepath: data/02_intermediate/preprocessed_companies.csv
preprocessed_shuttles:
type: CSVLocalDataSet
filepath: data/02_intermediate/preprocessed_shuttles.csv

CSVLocalDataSet is chosen for its simplicity, but you can choose any other available dataset implementation class to save the data, for example, to a database table, cloud storage (like S3, Azure Blob Store etc.) and others.

The data_engineering.py file also includes a function, create_master_table which the data engineering pipeline uses to join together the companies, shuttles and reviews dataframes into a single master table.

Data science pipeline

The data science pipeline, which builds a model that uses the datasets, is comprised of a price prediction model, which uses a simple LinearRegression implementation from the scikit-learn library. The code is found in the file src/kedro_tutorial/nodes/price_prediction.py.

The test size and random state parameters for the prediction model are specified in conf/base/parameters.yml and Kedro feeds them into the catalog when the pipeline is executed.

Combining the pipelines

The project’s pipelines are specified in pipeline.py:

def create_pipeline(**kwargs):  de_pipeline = Pipeline(
[
node(preprocess_companies, “companies”, “preprocessed_companies”),
node(preprocess_shuttles, “shuttles”, “preprocessed_shuttles”),
node(create_master_table,[“preprocessed_shuttles”, “preprocessed_companies”, “reviews”],“master_table”,),
],
name=”de”,
)
ds_pipeline = Pipeline(
[
node(split_data,[“master_table”, “parameters”],[“X_train”, “X_test”, “y_train”, “y_test”],),
node(train_model, [“X_train”, “y_train”], “regressor”),
node(evaluate_model, [“regressor”, “X_test”, “y_test”], None),
],
name=”ds”,
)
return de_pipeline + ds_pipeline

The de_pipeline will preprocess the data, then use create_master_table to combine preprocessed_shuttles, preprocessed_companies and reviews into the master_table dataset. The ds_pipeline will then create features, train and evaluate the model. The first node of the ds_pipeline outputs 4 objects: X_train, X_test, y_train, y_test] which are then used to train the model and, in a final node, to evaluate the model.

The two pipelines are merged together in de_pipeline + ds_pipeline. The order in which you add the pipelines together is not significant since Kedro automatically detects the correct execution order for all the nodes. The same overall pipeline would result if you specified ds_pipeline + de_pipeline.

Both pipelines will be executed when you invoke kedro run, and you should see output similar to the following:

If you have any problems getting the tutorial code up and running, you should check that you are working in a Python 3.5 environment and have all the dependencies installed for the project. If there are still problems, you should head over to Stack Overflow for guidance from the community and, if the behaviour you’re observing appears to be a problem with Kedro, please feel to raise an issue on Kedro’s Github repository.

Kedro runners

There are two different ways to run a Kedro pipeline. You can specify:

  • SequentialRunner — runs your nodes sequentially: once a node has completed its task then the next one starts
  • ParallelRunner — runs your nodes in parallel: independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores, and be more efficient when there are independent branches in your pipeline. Note that ParallelRunner performs task parallelisation (i.e. can run multiple tasks at once), but does not parallelise the calculations within the task.

By default, Kedro uses a SequentialRunner when you invoke kedro run. Switching to use ParallelRunner is as simple as providing an additional flag as follows:

kedro run — parallel

That’s the end of the tutorial to run through the basics of Kedro. There’s plenty more documentation on the various configuration options and optimisations you can use in your own projects, for example:

Contributions welcome!

If you start working with Kedro and want to contribute an example or changes to the Kedro project, we would welcome the chance to work with you. Please consult the contribution guide.

Acknowledgements

The Spaceflights example is based on a tutorial written by the Kedro team at QuantumBlack Labs (Yetunde Dada, Ivan Danov, Dmitrii Deriabin, Lorena Balan, Gordon Wrigley, Kiyohito Kunii, Nasef Khan, Richard Westenra and Nikolaos Tsaousis), who have kindly given me early access to their code and documentation in order to produce this tutorial for TowardsDataScience. Thanks all!

--

--

Technical content creator writing about data science and software. Old-school Symbian C++ developer, now accidental cat herder and goose chaser.