Video Tutorial

The ultimate guide to building maintainable Machine Learning pipelines using DVC

Learn the principles for building maintainable Machine Learning pipelines using DVC

Déborah Mesquita
Towards Data Science
10 min readJul 6, 2020

--

Video Tutorial

When my ML projects start to evolve I usually get anxious because: everything starts to become messy, I’m aware that it’s becoming a mess but I don’t know what to do to improve it. I love working with open source tools and frameworks because as the projects evolve, the knowledge of the contributors gets “embedded” in them. To me, this means that if I don’t have much experience on building reproducible machine learning pipelines, if I use a tool created by people who have a lot of experience building them I’m actually using their principles on my project (and learning as well).

DVC is an open-source version control system for Machine Learning projects. At first, I thought it was just a Git for large files, but the system actually addresses all my needs for experiment and pipeline management. They recently released DVC 1.0 along with a new Get Started Guide, which I used as a starting point for this tutorial.

Today I’ll show you how to build reproducible Machine Learning pipelines using DVC. You can check the final code here. Enough said, let’s get started!

Thanks @realaxer for this cool picture! ;)

What we are going to build

Let’s build a model to classify the 20newsgroups dataset. To keep the evaluate phase simple we’ll only use two categories. This is the main script to do that:

Main script to train a classifier

A good workflow to write clean and maintainable code is to always keep improving the quality of the code. It’s ok to make experiments and see if it works (like we did on main.py above). Now that we know what we’re going to build, we can take the next step and make the code more maintainable.

💡 The principle to build pipelines using DVC

If we take a closer look to main.py we can break down the script into those famous Machine Learning steps:

1 — Gather data

2 — Generate the features

3 — Train the model

4 — Evaluate the model

Now that we have the steps, these are the principles to build maintainable pipelines using DVC:

  • Write a python script for each of these steps
  • Save the parameters each script uses in a yaml file
  • Specify the files each script depends on
  • Specify the files each script generates

Let’s install DVC and see how to implement those steps.

🔨 Installing DVC

I’m using Linux and we’re going to install DVC as a Python library. To follow along you should have Python 3, pip and Git installed in your environment. We’re going to create a virtual environment for the project (tip: if you care about managing your projects you should always do this). DVC works best in a Git repository, so we’ll initialize one before initiating the DVC project:

$ mkdir dvc_tutorial
$ cd dvc_tutorial
$ python3 -m venv .env
$ source .env/bin/activate
(.env)$ pip3 install dvc
(.env)$ git init
(.env)$ dvc init

And we’re ready to go. Now let’s implement each step.

📍 1 —The ‘gather data’ step

We’re using Scikit’s fetch_20newsgroups() method to load data to memory, but I want to save it to a file so I can use it as a dependency for the next step of the pipeline. So in this script I’ll gather the data and save it to a csv file.

We’ll set the name for this stage as prepare.

I was using three categories at first (['comp.graphics', 'sci.space', 'rec.sport.basecball']), but then when I got to the evaluation phase I thought that the tutorial would be simpler if I used only two categories. I then had to build the dataset again, specifying the new list with these categories. This showed me that it would be a good idea to use categories as a parameter for this script. DVC uses a params.yam file as the default parameters file, so let’s create one and define the categories there:

# file params.yamlprepare:
categories:
- comp.graphics
- sci.space

prepare is the name of the stage, categories is the name of the parameter and to create a list using yaml we add a - before each item.

I want to save the data to a data/prepared folder, so we’ll use the script to do that. Here is the final prepare.py file:

To execute this stage of the pipeline we only depend on the script code file, and we will save the files generated by the script on the data/prepared folder. Now we have all the components to build this step:

  • Write a python script: prepare.py
  • Save the parameters: categories inside params.yaml
  • Specify the files the script depends on: prepare.py
  • Specify the files the script generates: the folder data/prepared

To keep things organized we’ll save the scripts inside a src folder, so let’s create one:

(.env)$ mkdir src
(.env)$ cd src # now save the prepare.py file here

The contents of your dvc_tutorial folder should look like this:

├── params.yaml
└── src
└── prepare.py

All set, now let’s learn how to build this stage using DVC.

⏺ dvc run — Building stages using DVC

DVC saves the pipeline stages on a dvc.yaml file (human readable) and a dvc.lock (this is for DVC use only). To create a pipeline stage we use the dvc run command. These are the main options:

-n <stage>: specify a name for the stage generated by this command-p [<path>:]<params_list>: specify a set of parameter dependencies the stage depends on-d <path>: specify a file or a directory the stage depends on-o <path>: specify a file or directory that is the result of running the command

After setting these options we add a command argument where we specify how to actually run this step of the pipeline. In this step the command will be python3 src/prepare.py. Let’s first install the dependencies prepare.py needs:

(.env)$ pip install pyyaml scikit-learn pandas

And now let’s run the dvc run command to generate the stage:

(.env)$ dvc run -n prepare -p prepare.categories -d src/prepare.py -o data/prepared python3 src/prepare.pyRunning stage 'prepare' with command:                                           
python3 src/prepare.py
Creating 'dvc.yaml'
Adding stage 'prepare' in 'dvc.yaml'
Generating lock file 'dvc.lock'
To track the changes with git, run:git add dvc.yaml data/.gitignore dvc.lock

Now your folder show look like this:

├── data
│ └── prepared
│ ├── test.csv
│ └── train.csv

├── dvc.lock
├── dvc.yaml

├── params.yaml
└── src
└── prepare.py

And these are the contents of the dvc.yaml file auto-generated by DVC:

stages:
prepare:
cmd: python3 src/prepare.py
deps:
- src/prepare.py
params:
- prepare.categories
outs:
- data/prepared

Pretty straightforward, right?

🔬 dvc dag — Visualize the pipeline using DVC

The dvc dag command displays the stages of a pipeline. We only have one stage so far but let’s see it:

(.env)$ dvc dag
+---------+
| prepare |
+---------+
~
~
/tmp/tmpixsrsfo0 (END)

You can hit q to hide the visualization.

⏯ dvc repro — Reproducing the pipelines using DVC

The dvc repro command reproduces complete or partial pipelines by executing commands defined in their stages. As the docs says:

DVC caches relevant data artifacts along the way and recursively searches in pipeline stages to determine which ones have changed. Then it executes the corresponding commands. Outputs are deleted from the workspace before executing the stages command that produces them. — repro docs

Cool, let’s test it then (I didn’t make any changes yet):

(.env)$ dvc reproStage 'prepare' didn't change, skipping                                         
Data and pipelines are up to date.

What if I change the name of a category?

# file params.yamlprepare:
categories:
- comp.graphics
- rec.sport.baseball # it was 'sci.space'

And then run the command again?

(.env)$ dvc reproRunning stage 'prepare' with command:                                           
python3 src/prepare.py
Updating lock file 'dvc.lock'
To track the changes with git, run:git add dvc.lock

It runs the stage again because we added -p prepare.categories as a parameter for this stage. DVC then saw that we changed this parameter and ran the stage again. The dvc.yaml file is still the same but if you check the dvc.lock file you’ll see that the parameters changed there. Amazing right?

So these are the basics for creating, running and visualizing pipeline stages:

  • dvc run
  • dvc dag
  • dvc repro

Now let’s go to the next pipeline step.

📍 2 — The ‘generate features’ step

We’ll set the name for this stage as featurize.

In this step, we’ll use Scikit’s TfidfVectorizer and save the transformed matrix into pickle files inside the data/features folder. We’ll depend on the script file and the data/prepared folder, so this is our recipe:

  • Write a python script: featurize.py
  • Save the parameters: (we don’t need any)
  • Specify the files the script depends on: featurize.py and data/prepared
  • Specify the files the script generates: the folder data/features

This is the final featurize.py file:

You should save it inside the /src folder and then create the stage using DVC like this:

(.env)$ dvc run -n featurize -d src/featurize.py -d data/prepared -o data/features python3 src/featurize.py data/prepared data/features

Go ahead and run the dvc dag command to check the new step of the pipeline.

📍 3 — The ‘train the model’ step

In this stage, we’ll finally train the model and save it to a pickle file. We’ll use the Naive Bayes classifier and only set the alpha parameter. This is the recipe:

  • Write a python script: train.py
  • Save the parameters: alpha inside params.yaml
  • Specify the files the script depends on: train.py and data/features
  • Specify the files the script generates: the file model.pkl

First, let’s add the new alpha parameter:

# file params.yamlprepare:
categories:
- comp.graphics
- sci.space
train:
alpha: 0.1

Then save the train.py script inside /src:

And finally, create the stage using DVC:

(.env)$ dvc run -n train -p train.alpha -d src/train.py -d data/features -o model.pkl python3 src/train.py data/features model.pkl

📍 4 — The ‘evaluate the model’ step

In this step, we’ll get to know two new dvc run parameters: --metrics and --plots:

-m <path>: specify a metrics file produced by this stage. This option behaves like -o but registers the file in a metrics field inside the dvc.yaml stage--plots <path>: specify a plot metrics file produces by this stage. This option behaves like -o but registers the file in a plots field inside the dvc.yaml stage

We’ll use the Area Under the Curve (AUC) metric and compute the precision-recall pairs for different probability thresholds to plot the graphics. To do that we’ll create a script to save a scores.json file with the AUC score and a plots.json file with the precision/recall/threshold pairs:

Go ahead and save this evaluate.py script inside the /src folder.

Regarding the metrics and plots files, we have two options:

  • Let DVC track the metrics/plots files or
  • Track the metrics/plots files using Git

Since on this tutorial we’ve not seen how to use DVC to track files, we’ll go with the second option and track the files ourselves using Git. To accomplish that we’re going to use the --metrics-no-cache and the --plots-no-cache commands. This is the recipe for this stage:

  • Write a python script: evaluate.py
  • Save the parameters: (we don’t need any)
  • Specify the files the script depends on: evaluate.py, model.pkl and data/features
  • Specify the files the script generates: (none)
  • (NEW) Specify the metrics and plots files: scores.json and plots.json

Ok, now let’s create this last step:

(.env)$ dvc run -n evaluate -d src/evaluate.py -d model.pkl -d data/features --metrics-no-cache scores.json --plots-no-cache plots.json python3 src/evaluate.py model.pkl data/features scores.json plots.json

🏋🏿 dvc metrics— Comparing metrics using DVC

The dvc metrics command let us display and compare metrics. dvc metrics show prints the metric values and dvc metrics diff shows the difference between metrics values when there are metric file changes (before committing them with Git). Let’s see this in action.

The dvc metrics show allows us to see current scores:

(.env)$ dvc metrics showscores.json:                                                            
auc: 0.9993366236676577

(☕️ hmm that smell of overfitting… but that’s not out point to today so let’s go on 😜)

Let’s check the parameters we’re currently using:

# file params.yamlprepare:
categories:
- comp.graphics
- rec.sport.baseball
train:
alpha: 0.1

The dvc metrics diff command is calculated between a previous commit and a current state, so let’s first commit this experiment:

(.env)$ git add src/ params.yaml dvc.yaml dvc.lock scores.json plots.json
(.env)$ git commit -m "exp: alpha=0.1"

Now let’s first change the alpha parameter:

# file params.yamlprepare:
categories:
- comp.graphics
- rec.sport.baseball
train:
alpha: 0.9

And then run everything again:

(.env)$ dvc reproStage 'prepare' didn't change, skipping                                         
Stage 'featurize' didn't change, skipping
Restored stage 'train' from run-cache
Skipping run, checking out outputs
Updating lock file 'dvc.lock'
Restored stage 'evaluate' from run-cache
Skipping run, checking out outputs
Updating lock file 'dvc.lock'
To track the changes with git, run:git add dvc.lock

We can see the params diff with dvc params diff:

(.env)$ dvc params diffPath         Param        Old    New                                            
params.yaml train.alpha 0.1 0.9

And finally, see how the scores change with dvc metrics diff:

(.env)$ dvc metrics diffPath         Metric    Value    Change                                          
scores.json auc 0.99869 -0.00064

📊 dvc plots — Visualize and compare metrics using DVC

The dvc plots command generates plots as HTML files that can be open with a web browser. These HTML files use Vega-Lite. Let’s plot a precision recall curve. The plots.json file looks like this:

# file plots.json{
"proc": [
{
"precision": 0.927570093457944,
"recall": 1.0,
"threshold": 0.4513363759032511
},
{
"precision": 0.927400468384075,
"recall": 0.9974811083123426,
"threshold": 0.45201756623495926
},
# [...]
]
}

And we want a graph with precision on the y-axis and recall on the y-axis, so let’s build one:

(.env)$ dvc plots show -y precision -x recall plots.json
The precision recall curve

We can even plot the difference between the precision scores for alpha=0.1 and alpha=0.9

(.env)$ dvc plots diff --targets plots.json -y precision
Different precision scores for different alpha values

And that’s the end of our tour. You can check more plots options and configurations here: https://dvc.org/doc/command-reference/plots.

The final code for this tutorial is here

Final remarks

Using DVC to track experiments and manage Machine Learning pipelines can really take our projects to the next level. The key is to make your ML projects reproducible is to create single Python scripts for each step and specify the parameters, inputs and outputs used by each script. You can do that with a simple dvc run command and use dvc repro to run the pipeline as you like.

Besides experiments and pipelines management, DVC also provides Version Control and the Deployment and Collaboration features. You can read more about them here and here.

That’s it for today, thanks for reading! 😁

--

--

Award-winning Data Scientist 👩🏾‍💻 Loves to write and explain things in different ways✨ - http://deborahmesquita.com/