Version Control ML Model

Tianchen Wu
Towards Data Science
3 min readAug 11, 2019

--

dvc workflow from https://dvc.org

Machine Learning operations (let’s call it mlOps under the current buzzword pattern xxOps) are quite different from traditional software development operations (devOps). One of the reasons is that ML experiments demand large dataset and model artifact besides code (small plain file).

This post presents a solution to version control machine learning models with git and dvc (Data Version Control).

And following features will come along with this solution:

  1. ML models are persisted with scalability, security, availability, virtually unlimited storage space
  2. ML model are under version control i.e. any specific version of it can be conveniently tagged and accessed
  3. Auditability, transparency, reproducibility of ML model can be ensured by version controlling dataset, analytical code alongside it
  4. New experiemts can be initialized quickly and collaboratively based on the existing model setting

Git and DVC

The solution includes two layers of version control:

  1. git: handles code and metadata (of dataset and model artifact)
  2. dvc: handles large dataset and model artifact

First we have the project folder prepared and the tools installed.

# Download code
git clone https://github.com/iterative/example-versioning.git
cd example-versioning
# install and initialize git
# install and initialize dvc (https://dvc.org/doc/get-started/install)

then connect dvc to backend storage, where dataset and model artifact will be actually stored (in this example AWS S3).

dvc remote add -d s3remote s3://my-test-bucket-dvc/myproject 

Now in the example-versioning folder, where our ML experiments will happen, should contain two subfolders for metadata.

.dvc/
.git/

Workflow

Next step, we train a model using data and script from dvc.org

cd example-versioning# Install dependencies
pip install -r requirements.txt
# Download data
cd example-versioning
wget https://dvc.org/s3/examples/versioning/data.zip
unzip data.zip
rm -f data.zip
# Build the ML model
python train.py

After getting the model (model.h5), put it under version control using dvc + git workflow.

step 1: add model metadata to dvc

dvc add model.h5

output:

It can be observed that:

  • the “real” model is stored under .dvc/cache/40
  • model metadata model.h5.dvc records where it is

step 2: persist model by pushing it to backend storage

dvc push model.h5.dvc

in s3, we can check the model is stored exactly under the instruction of model metadata

step 3: persist model metadata with git

It is the model metadata that can lead us to the real model object which is stored in backend storage. To prevent from losing the metadata, it should be added to version control using git.

git add .gitignore model.h5.dvc data.dvc metrics.json
git commit -m "model first version, 1000 images"
git tag -a "v1.0" -m "model v1.0, 1000 images"

“git tag” can be used here to record version of the model.

step 4: access the model anytime

It is easy to fetch a specific version of the model by searching for the tag on git branch. From git we can check out the model metadata.

git checkout tags/<tag_name> -b <branch_name>

Following the metadata we can find the model object and download it into current workspace with command

dvc pull model.h5.dvc

Conclusion

In a similar way, the problem of version controlling large dataset for machine learning experiment can also be solved. Other comparable tools to solve ML pipeline challenges are for example mlflow, datanami and sacred.

--

--