How to Track Machine Learning Experiments using DagsHub

Tutorial on using DagsHub for enhancing the machine learning model training pipeline using experiment tracking

Bijil Subhash
Towards Data Science

--

Source: Unsplash (Scott Graham)

Table of Contents

  1. Motivation
  2. How do we Track Machine Learning Experiments?
  3. Tutorial
  4. Final Thoughts

1. Motivation

All machine learning problems are unique. That’s why when it comes to building machine learning models, there is no one-size-fits-all solution.

Building a machine learning model is a time consuming process, often requiring hundreds of iterations (experiments). Each of these experiments involves sifting through multiple data preprocessing steps, model architectures, and hyperparameter tuning with the hope of finding an ideal configuration that satisfies the desired outcome.

If you live in a world where time and money spent on computation is not a concern, then this is hardly an issue worth addressing. But that is not true, necessitating the need to find ways in which we can optimize the model development pipeline.

An obvious solution, one that is in practice at many data-driven organizations, is to put systems in places to track the experiments. By tracking different configurations and its outcome, we can:

  • Design better experiments, driven by the insights from previous experiments
  • Improve reproducibility by keeping track of configuration that had ideal performance

In the instance where we have online learning, we could use the experiment tracking pipeline to catch early onset of model degradation from concept drift, allowing us to generalize the model better to live data.

The end benefit of a system that can track machine learning experiments is improved productivity.

2. How do we Track Machine Learning Experiments?

An easy solution here would be to put together a spreadsheet and use that to track the experiments. I could think of numerous reasons why this is inefficient, circling around the idea that manually tracking hundreds of parameters is cumbersome with high likelihood of user mistakes and frustration.

An alternative is to use Github as it is after all designed for version control. But Github is not friendly for storing large amount of data. A workaround here would be to use Git Large File System (LFS), but even that has shown signs of not being an amenable solution for managing big data. We could use Data Version Control (DVC), a Python library that helps to store data and model files seamlessly out of git, while preserving almost the same user experience as if they were stored in git. This, however, requires the need to setup cloud services to store the data, requiring considerable know-how in DevOps.

What if we could have it all and more using one platform, powered by Git-like command line experience.

Enter DagsHub.

DagsHub is a web platform based on open source tools, optimized for data science and oriented towards the open source community. - dagshub.com/about

In other words, DagsHub have a few tools up its sleeve that can make your life easy if you are a data scientist or machine learning practitioner. Going into the weeds of each DagsHub functionality is outside the scope of this article. I am more interested in demonstrating how we can use DagsHub to log and visualize the experiments. More specifically, this blog is meant to serve as a practical tutorial, complementing DagsHub documentation, focusing on how to optimize the machine learning model building pipeline with experiment tracking.

3. Tutorial

In this tutorial, we will use DagsHub logger for tracking experiments. The workflow here assumes that you have a basic understanding of git and sklearn. We will start of with some basic requirements in sections 3.1 and 3.2, followed by the demonstration of how to use DagsHub for experiment tracking in section 3.3.

3.1 File Management System

You do not need a specific file management system to use DagsHub. I only have this for creating something coherent that is easy to follow throughout this article. With that out of the way, lets get into the structure of file management system that we will be using for the rest of the article.

ª   model.ipynb
ª
+---data
ª
+---models
ª

In the working directory, we have model.ipynb, which has the function (more on this later) for generating different models. In addition to that we also have two folders, named data and models inside our working directory. Within data, we will have the dataset that we will be working with and our models folder is a placeholder to save the pickle files of various models that we will be creating during each experiment run.

3.2 Experiment Pipeline

In this section, we are going to build a function that can churn out different models for our experiments. The dataset we are going to use is Iris dataset, a dataset that classifies an entry into one of the three classes of iris plants based on physical characteristics. Shown below is the function I am using for our experiment pipeline.

The code above is divided into three sections.

Dependencies is importing the Python packages that is needed for running the function.

import pandas as pd
import dagshub
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score
import pickle

Function to log the model parameters has few moving parts.

It starts with reading the data using Pandas, followed by splitting the dataset into their respective features and labels.

#reading the data    
df = pd.read_csv('data\iris.csv')
#splitting in features and labels
X = df[['SepalLengthCm','SepalWidthCm','PetalLengthCm']]
y = df['Species']

The dataset is then split into train and test data where 30% of total data is used for testing using sklearn’s test_train_split.

#test train split    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)

Up next, we introduce the DagsHub logger for recording the model hyperparameters and metrics. The model fit function is called on training set and saved using the Python pickle module into models folder.

#model defnition        
model = model_type(random_state=42)
#log the model parameters logger.log_hyperparams(model_class=type(model).__name__) logger.log_hyperparams({'model': model.get_params()}) #training the model
model.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
#log the model's performances logger.log_metrics({f'accuracy':round(accuracy_score(y_test, y_pred),3)})
#saving the model
file_name = model_type.__name__ + '_model.sav' pickle.dump(model, open('models/'+ file_name, 'wb'))

We can call the function using the following command.

#running an experiment
model_run(RandomForestClassifier)

If everything has gone as it should, we should have metrics.csv, params.yml and a .sav file in the models folder. Below is the file structure to verify.

ª   model.ipynb
ª metrics.csv
ª params.yml
ª
+---data
ª iris.csv
ª
+---models
ª RandomForestClassifier_model.sav

3.3 Experiment Tracking using DagsHub Logger

Before we start pushing our files to DagsHub, we need to establish some baseline understanding of how files are tracked on DagsHub. DagsHub uses git and data version control (DVC) for tracking the files. DVC allows us to store large data and models in DVC remote storage. The data and models stored on DVC remote storage is tracked using light-weight .dvc file that is stored in GitHub. In other words, .dvc file holds the metadata that corresponds to a data and model stored outside of GitHub. By using DagsHub, we can store both .dvc files, and the associated data and models in one place by using DagsHub as the remote storage.

DagsHub operate quite similar to Github, in the sense that you could manage the DagsHub repo using command line. Before we start pushing our files to DagsHub, we need to setup a remote repository and initialize both git and DVC in our working folder.

To set-up a DagsHub repository, click on create, located top right, on the DagsHub dashboard.

Image obtained from https://dagshub.com/dashboard

Follow the instruction on page to create the repository. You can leave most fields empty except ‘Repository Name’.

Image obtained from https://dagshub.com/repo/create

Once the remote repository is set up, we can use the following commands to initialize git on the working directory.

git init
git remote add origin https://dagshub.com/bijilsubhash/dagshub-tutorial.git

Assuming you have installed DVC using pip, we can proceed to initialize DVC with few additional steps to configure DagsHub as DVC remote storage.

pip install dvc
dvc init
dvc remote add origin https://dagshub.com/bijilsubhash/dagshub-tutorial.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user bijilsubhash
dvc remote modify origin --local password your_token

The step above will create DVC configuration files, which for the purpose of experiment tracking, must be tracked using git.

git add .dvc .dvcignore
git commit -m 'dvc initialized'
git push -u origin master

Up next, we can push the files associated with our most recent experiment to git.

git add metrics.csv params.yml
git commit -m 'RandomForestClassifier'
git push -u origin master

We can proceed to add model and data files to DVC remote storage on Dagshub

dvc add 'data/iris.csv' 'models/RandomForestClassifier_model.sav'
dvc push -r origin

After adding files to DVC, you will notice few .dvc files and .gitignore files in data and models folder. You have to push those to git as well. The .dvc files that we are pushing to git serve as the metadata that maps to the corresponding data and model stored in DVC remote storage.

git add 'data\iris.csv.dvc' 'models\RandomForestClassifier_model.sav.dvc' 'models\.gitignore' 'data\.gitignore'
git commit -m '.dvc and .gitignore'
git push -u origin master

If you go to the Dagshub, you will notice that all the files are updated and available in the repository.

Image obtained from https://dagshub.com/bijilsubhash/dagshub-tutorial

At the same time, you will notice an ‘Experiments’ tab at the top left, which will have our first experiment using random forest classifier as the first entry. Each row corresponds to an experiment and columns are usually a hyperparameter, metric or metadata for that experiment.

Image obtained from https://dagshub.com/bijilsubhash/dagshub-tutorial/experiments

There is no limit on how many experiments you can run. For the sake of brevity of this tutorial, I am going to run two more models (logistic regression and support vector machine).

Here is the file structure in our working directory after running two more models, and after pushing them to DagsHub.

ª   model.ipynb
ª .dvcignore
ª metrics.csv
ª params.yml
ª
+---data
ª iris.csv
ª .gitignore
ª iris.csv.dvc
ª
+---models
ª RandomForestClassifier_model.sav
ª .gitignore
ª RandomForestClassifier_model.sav.dvc
ª LogisticRegression_model.sav
ª LogisticRegression_model.sav.dvc
ª SVC_model.sav
ª SVC_model.sav.dvc

As you can see, we have two more .sav files along with associated .dvc files for logistic regression and support vector machine models.

In the repository, under the ‘Experiments’ tab, we can also notice 2 more entries from the most recent experiments.

Image obtained from https://dagshub.com/bijilsubhash/dagshub-tutorial/experiments

Within the ‘Experiments’ tab, you can start investigating (visualize, compare, label etc.) each experiment in more detail but I will leave that for you to explore on your own terms. Here is a link to detailed documentation outlining the functions of various features that you can find in ‘Experiments’ tab.

Lastly, here is the link to DagsHub repository that was used in this tutorial for your reference - https://dagshub.com/bijilsubhash/dagshub-tutorial

4. Final Thoughts

In this article, you are introduced to the idea of using DagsHub for experiment tracking using git and DVC. Though we were limited to just 3 experiments with just one pipeline of data preprocessing, you can easily scale this workflow to 100s of experiments with multiple versions of data, all seamlessly managed using DagsHub, making it an ideal tool for collaboration and improved productivity when it comes to building machine learning models. You may have noticed that we had to push git and DVC separately, which may seem repetitive and error prone. To circumvent this issue, we could use FastDS, but that is a topic for another day.

--

--

Data engineer from Sydney, Australia. I write about data and automation.