Video Tutorial
The ultimate guide to building maintainable Machine Learning pipelines using DVC
Learn the principles for building maintainable Machine Learning pipelines using DVC
When my ML projects start to evolve I usually get anxious because: everything starts to become messy, I’m aware that it’s becoming a mess but I don’t know what to do to improve it. I love working with open source tools and frameworks because as the projects evolve, the knowledge of the contributors gets “embedded” in them. To me, this means that if I don’t have much experience on building reproducible machine learning pipelines, if I use a tool created by people who have a lot of experience building them I’m actually using their principles on my project (and learning as well).
DVC is an open-source version control system for Machine Learning projects. At first, I thought it was just a Git for large files, but the system actually addresses all my needs for experiment and pipeline management. They recently released DVC 1.0 along with a new Get Started Guide, which I used as a starting point for this tutorial.
Today I’ll show you how to build reproducible Machine Learning pipelines using DVC. You can check the final code here. Enough said, let’s get started!
What we are going to build
Let’s build a model to classify the 20newsgroups dataset. To keep the evaluate phase simple we’ll only use two categories. This is the main script to do that:
A good workflow to write clean and maintainable code is to always keep improving the quality of the code. It’s ok to make experiments and see if it works (like we did on main.py
above). Now that we know what we’re going to build, we can take the next step and make the code more maintainable.
💡 The principle to build pipelines using DVC
If we take a closer look to main.py
we can break down the script into those famous Machine Learning steps:
1 — Gather data
2 — Generate the features
3 — Train the model
4 — Evaluate the model
Now that we have the steps, these are the principles to build maintainable pipelines using DVC:
- Write a python script for each of these steps
- Save the parameters each script uses in a
yaml
file - Specify the files each script depends on
- Specify the files each script generates
Let’s install DVC and see how to implement those steps.
🔨 Installing DVC
I’m using Linux and we’re going to install DVC as a Python library. To follow along you should have Python 3, pip and Git installed in your environment. We’re going to create a virtual environment for the project (tip: if you care about managing your projects you should always do this). DVC works best in a Git repository, so we’ll initialize one before initiating the DVC project:
$ mkdir dvc_tutorial
$ cd dvc_tutorial$ python3 -m venv .env
$ source .env/bin/activate
(.env)$ pip3 install dvc
(.env)$ git init
(.env)$ dvc init
And we’re ready to go. Now let’s implement each step.
📍 1 —The ‘gather data’ step
We’re using Scikit’s fetch_20newsgroups()
method to load data to memory, but I want to save it to a file so I can use it as a dependency for the next step of the pipeline. So in this script I’ll gather the data and save it to a csv file.
We’ll set the name for this stage as prepare.
I was using three categories at first (['comp.graphics', 'sci.space', 'rec.sport.basecball']
), but then when I got to the evaluation phase I thought that the tutorial would be simpler if I used only two categories. I then had to build the dataset again, specifying the new list with these categories. This showed me that it would be a good idea to use categories as a parameter for this script. DVC uses a params.yam
file as the default parameters file, so let’s create one and define the categories there:
# file params.yamlprepare:
categories:
- comp.graphics
- sci.space
prepare
is the name of the stage, categories
is the name of the parameter and to create a list using yaml
we add a -
before each item.
I want to save the data to a data/prepared
folder, so we’ll use the script to do that. Here is the final prepare.py
file:
To execute this stage of the pipeline we only depend on the script code file, and we will save the files generated by the script on the data/prepared
folder. Now we have all the components to build this step:
- Write a python script:
prepare.py
- Save the parameters:
categories
insideparams.yaml
- Specify the files the script depends on:
prepare.py
- Specify the files the script generates: the folder
data/prepared
To keep things organized we’ll save the scripts inside a src
folder, so let’s create one:
(.env)$ mkdir src
(.env)$ cd src # now save the prepare.py file here
The contents of your dvc_tutorial
folder should look like this:
├── params.yaml
└── src
└── prepare.py
All set, now let’s learn how to build this stage using DVC.
⏺ dvc run — Building stages using DVC
DVC saves the pipeline stages on a dvc.yaml
file (human readable) and a dvc.lock
(this is for DVC use only). To create a pipeline stage we use the dvc run
command. These are the main options:
-n <stage>: specify a name for the stage generated by this command-p [<path>:]<params_list>: specify a set of parameter dependencies the stage depends on-d <path>: specify a file or a directory the stage depends on-o <path>: specify a file or directory that is the result of running the command
After setting these options we add a command argument where we specify how to actually run this step of the pipeline. In this step the command will be python3 src/prepare.py
. Let’s first install the dependencies prepare.py
needs:
(.env)$ pip install pyyaml scikit-learn pandas
And now let’s run the dvc run
command to generate the stage:
(.env)$ dvc run -n prepare -p prepare.categories -d src/prepare.py -o data/prepared python3 src/prepare.pyRunning stage 'prepare' with command:
python3 src/prepare.py
Creating 'dvc.yaml'
Adding stage 'prepare' in 'dvc.yaml'
Generating lock file 'dvc.lock'To track the changes with git, run:git add dvc.yaml data/.gitignore dvc.lock
Now your folder show look like this:
├── data
│ └── prepared
│ ├── test.csv
│ └── train.csv
├── dvc.lock
├── dvc.yaml
├── params.yaml
└── src
└── prepare.py
And these are the contents of the dvc.yaml
file auto-generated by DVC:
stages:
prepare:
cmd: python3 src/prepare.py
deps:
- src/prepare.py
params:
- prepare.categories
outs:
- data/prepared
Pretty straightforward, right?
🔬 dvc dag — Visualize the pipeline using DVC
The dvc dag
command displays the stages of a pipeline. We only have one stage so far but let’s see it:
(.env)$ dvc dag
+---------+
| prepare |
+---------+
~
~
/tmp/tmpixsrsfo0 (END)
You can hit q
to hide the visualization.
⏯ dvc repro — Reproducing the pipelines using DVC
The dvc repro
command reproduces complete or partial pipelines by executing commands defined in their stages. As the docs says:
DVC caches relevant data artifacts along the way and recursively searches in pipeline stages to determine which ones have changed. Then it executes the corresponding commands. Outputs are deleted from the workspace before executing the stages command that produces them. — repro docs
Cool, let’s test it then (I didn’t make any changes yet):
(.env)$ dvc reproStage 'prepare' didn't change, skipping
Data and pipelines are up to date.
What if I change the name of a category?
# file params.yamlprepare:
categories:
- comp.graphics
- rec.sport.baseball # it was 'sci.space'
And then run the command again?
(.env)$ dvc reproRunning stage 'prepare' with command:
python3 src/prepare.py
Updating lock file 'dvc.lock'To track the changes with git, run:git add dvc.lock
It runs the stage again because we added -p prepare.categories
as a parameter for this stage. DVC then saw that we changed this parameter and ran the stage again. The dvc.yaml
file is still the same but if you check the dvc.lock
file you’ll see that the parameters changed there. Amazing right?
So these are the basics for creating, running and visualizing pipeline stages:
dvc run
dvc dag
dvc repro
Now let’s go to the next pipeline step.
📍 2 — The ‘generate features’ step
We’ll set the name for this stage as featurize.
In this step, we’ll use Scikit’s TfidfVectorizer
and save the transformed matrix into pickle files inside the data/features
folder. We’ll depend on the script file and the data/prepared
folder, so this is our recipe:
- Write a python script:
featurize.py
- Save the parameters: (we don’t need any)
- Specify the files the script depends on:
featurize.py
anddata/prepared
- Specify the files the script generates: the folder
data/features
This is the final featurize.py
file:
You should save it inside the /src
folder and then create the stage using DVC like this:
(.env)$ dvc run -n featurize -d src/featurize.py -d data/prepared -o data/features python3 src/featurize.py data/prepared data/features
Go ahead and run the dvc dag
command to check the new step of the pipeline.
📍 3 — The ‘train the model’ step
In this stage, we’ll finally train the model and save it to a pickle file. We’ll use the Naive Bayes classifier and only set the alpha parameter. This is the recipe:
- Write a python script:
train.py
- Save the parameters:
alpha
insideparams.yaml
- Specify the files the script depends on:
train.py
anddata/features
- Specify the files the script generates: the file
model.pkl
First, let’s add the new alpha
parameter:
# file params.yamlprepare:
categories:
- comp.graphics
- sci.space
train:
alpha: 0.1
Then save the train.py
script inside /src
:
And finally, create the stage using DVC:
(.env)$ dvc run -n train -p train.alpha -d src/train.py -d data/features -o model.pkl python3 src/train.py data/features model.pkl
📍 4 — The ‘evaluate the model’ step
In this step, we’ll get to know two new dvc run
parameters: --metrics
and --plots
:
-m <path>: specify a metrics file produced by this stage. This option behaves like -o but registers the file in a metrics field inside the dvc.yaml stage--plots <path>: specify a plot metrics file produces by this stage. This option behaves like -o but registers the file in a plots field inside the dvc.yaml stage
We’ll use the Area Under the Curve (AUC) metric and compute the precision-recall pairs for different probability thresholds to plot the graphics. To do that we’ll create a script to save a scores.json
file with the AUC score and a plots.json
file with the precision/recall/threshold pairs:
Go ahead and save this evaluate.py
script inside the /src
folder.
Regarding the metrics and plots files, we have two options:
- Let DVC track the metrics/plots files or
- Track the metrics/plots files using Git
Since on this tutorial we’ve not seen how to use DVC to track files, we’ll go with the second option and track the files ourselves using Git. To accomplish that we’re going to use the --metrics-no-cache
and the --plots-no-cache
commands. This is the recipe for this stage:
- Write a python script:
evaluate.py
- Save the parameters: (we don’t need any)
- Specify the files the script depends on:
evaluate.py
,model.pkl
anddata/features
- Specify the files the script generates: (none)
- (NEW) Specify the metrics and plots files:
scores.json
andplots.json
Ok, now let’s create this last step:
(.env)$ dvc run -n evaluate -d src/evaluate.py -d model.pkl -d data/features --metrics-no-cache scores.json --plots-no-cache plots.json python3 src/evaluate.py model.pkl data/features scores.json plots.json
🏋🏿 dvc metrics— Comparing metrics using DVC
The dvc metrics
command let us display and compare metrics. dvc metrics show
prints the metric values and dvc metrics diff
shows the difference between metrics values when there are metric file changes (before committing them with Git). Let’s see this in action.
The dvc metrics show
allows us to see current scores:
(.env)$ dvc metrics showscores.json:
auc: 0.9993366236676577
(☕️ hmm that smell of overfitting… but that’s not out point to today so let’s go on 😜)
Let’s check the parameters we’re currently using:
# file params.yamlprepare:
categories:
- comp.graphics
- rec.sport.baseball
train:
alpha: 0.1
The dvc metrics diff
command is calculated between a previous commit and a current state, so let’s first commit this experiment:
(.env)$ git add src/ params.yaml dvc.yaml dvc.lock scores.json plots.json
(.env)$ git commit -m "exp: alpha=0.1"
Now let’s first change the alpha parameter:
# file params.yamlprepare:
categories:
- comp.graphics
- rec.sport.baseball
train:
alpha: 0.9
And then run everything again:
(.env)$ dvc reproStage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Restored stage 'train' from run-cache
Skipping run, checking out outputs
Updating lock file 'dvc.lock'Restored stage 'evaluate' from run-cache
Skipping run, checking out outputs
Updating lock file 'dvc.lock'To track the changes with git, run:git add dvc.lock
We can see the params diff with dvc params diff
:
(.env)$ dvc params diffPath Param Old New
params.yaml train.alpha 0.1 0.9
And finally, see how the scores change with dvc metrics diff
:
(.env)$ dvc metrics diffPath Metric Value Change
scores.json auc 0.99869 -0.00064
📊 dvc plots — Visualize and compare metrics using DVC
The dvc plots
command generates plots as HTML files that can be open with a web browser. These HTML files use Vega-Lite. Let’s plot a precision recall curve. The plots.json
file looks like this:
# file plots.json{
"proc": [
{
"precision": 0.927570093457944,
"recall": 1.0,
"threshold": 0.4513363759032511
},
{
"precision": 0.927400468384075,
"recall": 0.9974811083123426,
"threshold": 0.45201756623495926
},
# [...]
]
}
And we want a graph with precision on the y-axis and recall on the y-axis, so let’s build one:
(.env)$ dvc plots show -y precision -x recall plots.json
We can even plot the difference between the precision scores for alpha=0.1 and alpha=0.9
(.env)$ dvc plots diff --targets plots.json -y precision
And that’s the end of our tour. You can check more plots options and configurations here: https://dvc.org/doc/command-reference/plots.
The final code for this tutorial is here
Final remarks
Using DVC to track experiments and manage Machine Learning pipelines can really take our projects to the next level. The key is to make your ML projects reproducible is to create single Python scripts for each step and specify the parameters, inputs and outputs used by each script. You can do that with a simple dvc run
command and use dvc repro
to run the pipeline as you like.
Besides experiments and pipelines management, DVC also provides Version Control and the Deployment and Collaboration features. You can read more about them here and here.
That’s it for today, thanks for reading! 😁