Getting Started

How I Learned to Stop Worrying and Track my Machine Learning Experiments

Keep your machine learning projects under control

Felipe de Pontes Adachi

Published in

Towards Data Science

12 min readNov 17, 2020

To track and reproduce

From my personal experience, one thing I realized is that tracking machine learning experiments is important. This realization was eventually followed by another one: tracking machine learning experiments is hard.

Consider these situations:

You are tuning your model. During the process, you find an error in your training pipeline. Or maybe you get hold of a bigger, improved input dataset. You can’t compare apples to oranges, and so you have to repeat all your previous experiments. Repeat that 3 or 4 times, and you can see how easy it is to end up with a bunch of results without being able to remember the exact combination of parameters that generated each of them.
Some months later, someone (or yourself) wants to pick up where you left off and try to reproduce the previous results. Of course, you don’t remember what was the last version of the input data or of the training code. You also can’t remember the exact training environment you used. Ah, well, it’s better to begin from scratch.
Your model is finally deployed. Maybe you passed it to another team, and eventually, a few months later, they begin to notice changes in the model’s prediction behavior. They ask for your help, but you don’t remember your training environment, and therefore can’t tell if it matches the production environment. You also didn’t store your feature transformation steps, and can’t compare to the current version. The actual problem might be none of that, but not being able to rule it out certainly doesn’t make the problem any easier to solve.

In this article, I want to share my take on tracking machine learning experiments. Everything discussed here can be found at the project’s Github repository.

Nowadays, you have at your disposal various platforms to track your experiments. Here is an article comparing some of those. This is one way among several to do it, so consider this as a beginner’s attempt to do things organized, and please share your thoughts if you think of any improvements to be made!

The task at hand

First, we need an experiment to track. For this purpose, I decided to train a fake news classifier model, using a Kaggle dataset that you can find here. In it, you have csv files for both fake and real news articles containing information such as title, text, subject, and date of each article. All of the articles were posted between the years 2015 and 2018.

This is, therefore, a text classification task.

What to keep track of

Let’s first define what is important to keep track of. We should ask ourselves this: Which piece of information I have to have available so that, in the future, I’ll have everything I need to debug and/or reproduce this text classification model?

The answer I gave to myself was this:

Code: This involves the whole training pipeline related to your experiment: from fetching the data, preprocessing features, splitting the data, tuning your model, etc. You should also know the exact environment you used when you trained your model: operating system, Python version, and all the packages installed, including the version of each package.
Data: The dataset used to train and test your model. If not already split into the relevant groups (train-test or train-eval-test, maybe), then your code should have the info on how you split it, and it should be deterministic (fix your seeds!).
Model: Any file that defines your final model: that could be a pickle or joblib file, or maybe weights and architecture files for neural networks. You can also save intermediary models, such as when doing a grid search, but the final model that you’ll eventually use in production is the critical one.
Metrics: Every kind of metric you think is relevant for your task: accuracy, f-score, confusion matrices, classification reports in general. This can also include plots, graphs, CPU/GPU/network usage, etc etc. You should keep track of those for every model you train, during every validation and test step.
Metadata: Any additional metadata, such as date, version identifiers, tags, and whatnot.

Overview

This is a very simple diagram, just to show the main components I used in this project. I store my artifacts — dataset and models — in an S3 bucket. I used W&B as a central dashboard to log and visualize my experiments. In there, we’ll have our plots, metrics, and everything we mentioned previously. Since I want to have everything centralized at W&B, I also logged the artifacts in there, but as references to objects located at my bucket. During my experiments, when I need to fetch data, I get the references I need from W&B.

The experiments are run in my local notebook, but I could have used Google Colab or any other cloud environment.

Uploading the dataset to S3

The first thing we should do is to get our dataset ready. Initially, we have from Kaggle one file for fake news and another for real news (True.csv and Fake.csv). The next step would be an exploratory data analysis, but I will skip it in this article. Not because it is not important (it is!), but because it has already been thoroughly done in this Kaggle notebook.

From there, we should arrange the data to put it in the desired structure and split it into a train and test dataset. The following piece of code does exactly that. A categorylabel is created to be used as our target, and the input is the article’s text, which is concatenated with the title, forming the textfeature.

At this point, we have our train/test splits in our local folder. Each run with possible variations would overwrite existing files, so we need to upload that to S3. To do that I used AWS’s SDK for Python — Boto3. Aside from following the Quickstart section of the docs, I also created the bucket at AWS’s console. Then I created an IAM User and gave it permission to access the bucket. You can find more about that in AWS’s documentation here. When creating your bucket, make sure to enable bucket versioning!

The following piece of code uploads the train-test files to the bucket. Additionally, it stores information about the file’s version ID and access URL that will be used in the next step, when logging the artifacts at W&B.

At our bucket, you can check that the files were uploaded, and are properly being versioned:

Note: Since I’m working with publicly available data and this is a personal project, data privacy issues are not being considered. As you might have noticed, to make the artifacts downloadable by W&B, the artifacts from my S3 bucket were set to public read.

If privacy is an issue for you, W&B also tracks references to private buckets, as stated in the official documentation. It might be a good idea to get in contact with them to get more info.

Logging artifacts at W&B

Now we’re set to log those objects as artifacts at W&B. If you don’t have an account yet, the wandb.login() command will prompt you to create one. Then, a run is initiated at your project of choice, which will be created for you if it doesn’t exist already.

You can see above that we can define a job type, which is useful to distinguish each job according to its function. In this case, it’s a dataset producer. The unprocessed tag is added to the dataset because I decided to preprocess the text at a later stage, during the grid search, as I wanted to assess the impact of it on the performance. Additionally, W&B also allows inserting metadata, if you want to store any additional information about the artifact.

At our project’s dashboard, we can check the artifact and its versions:

W&B performs a checksum and automatically creates new versions when needed. As it goes along, the latest tag is also created automatically, so you can use an artifact directly by its version or tags associated with it.

The metadata is also nicely displayed on your dashboard. I added the S3’s version ID’s to make sure the artifacts are appropriately synced between S3 and W&B.

In the Graph view tab, there is a graph denoting the relationship between artifacts and runs. In this example, you can see that one run of the dataset_producer type outputs an artifact of the dataset type, which in turn is used as input to several jobs of the types grid_search and final_model_trainer. You can also explode this view so you can see the names of each run.

Parameter Tuning

Alright, now we have our dataset under control. Time to use it to start training our models. Since performance is not my main concern here, I’ll leave the fancy transformers models aside for the moment, and experiment with some classifiers available in sklearn.

In this experiment, I chose two parameters to assess: the type of model and the text preprocessing step. The text preprocessing can be done in several different ways, so I figured it would be nice to include it as a varying parameter. But in here, I just tested with or without, for demonstration purposes. I didn’t include the code for the preprocessing in this article, but you can check at the project’s repository under src/features/denoise, and it was mainly copied from here.

The models I’m testing are:

sgd: Regularized linear model with stochastic gradient descent learning
nb: Multinomial Naive-Bayes
svc: Support vector classification
rf: Random forest

The dataset is fairly balanced, so I adopted accuracy as a performance score to log to W&B and compare the different combinations. I left out the test set to be used only for the final model, so for the tuning stage, I did a k-fold (k=5) cross-validation using only the train split. The logged accuracy is an average of the k iterations.

The train_and_log function takes as parameters the data, type of model, and whether to apply the denoising function or not. Every time the function is called, it logs the resulting score to the dashboard.

The function is supposed to be called by the script below, which is run in the command line like this:

python -m src.models.grid_search --model svc --denoise True

grid_search.py will then fetch the train-test-dataset to access its metadata and get the required URL from the bucket, and proceed to call the training function.

Well, calling grid_search directly will only run one combination. To actually run the grid search, we use W&B Sweeps. We define the parameters of the sweep with a .yaml configuration file, like this:

This tells us that we’ll be making a sweep using the grid search strategy by calling grid_search, varying the model and denoise parameters as defined above. The goal is to maximize accuracy_score, and the sweep should be linked to the fn_experiments project.

Now we can run the following command:

wandb sweep src/models/sweep.yaml

Which will launch the sweep server at W&B. That will give you a sweep_id, which you’ll use to run an agent in your local notebook. That agent will interact with the server, asking for the set of parameters to try next. To run the agent, in my case, the command was:

wandb agent felipeadachi/fn_experiments/<sweep_id>

That will start the sweep, and by the end of it, you should have some nice plots displayed on your dashboard, like this one:

This is a parallel coordinates plot and maps parameter values to the metric of interest. From this, we can clearly see the best model types, and how much of an impact our preprocessing step has on the outcome. Apparently, the degree of influence of the preprocessing varies with the model type.

Some other examples of plots:

In this case, since we didn’t use that many combinations, the correlations are pretty obvious, but to more complex setups the parameter importance plot above can be very helpful.

Now that we found that svc yielded the best results for our case, we could perform further experiments to tune parameters of the svc model, like the kernel type and strength of regularization. Since the accuracy is already pretty good, I chose to train the final model as is.

Those are just some examples of plots available at W&B. You can plot learning curves, calibration curves, and a lot more.

Train Final Models

Once the model and parameters are defined, we can finally train the final model. This time, we’ll use the whole train set to train the model. It’s not that different from what we did previously. Particularly in this case, if you think you might need the prediction probabilities in the future, just remember to set probability=True when training the model. It might come in handy in the future, for monitoring purposes.

For the final model, I plot the confusion matrix and the classification reports given by sklearn. There are other options, but these are enough for me.

Then I fetch the data as before and call the training function above. I can get the name of the current run with wandb.run.name, which I will use to name the generated model, and save it as a joblib file.

As with the dataset, I can upload the model to my bucket and reference it at W&B:

Let’s see the results on our dashboard. Here’s an example of the model versioning:

And the confusion matrix according to the category label:

Another important feature of W&B is that it stores your run’s logs. So I just printed the classification reports during the run, and it persists on my dashboard:

We have some additional information about the run, such as the date, OS, and Python version. It also creates a branch in your project’s git, letting you see the exact state of your code when the run was executed. That was all a part of our requirements, so that’s nice.

In the files section, we have access to the requirements.txt, which gives us more information about the environment. If you want exclusively the packages that are needed for your experiment, it is recommended to create a new environment for it.

So, I think that’s it. There’s much more information on the dashboard, but I think that covers all of our requirements: we have data, code, and models appropriately traced. If I forgot about something, please let me know!

Next Steps

In this article, I wanted to talk about tracking your ML experiments. This is a very important stage of your ML project’s life cycle. You carefully tracked the steps of your project during its infancy, but now it’s time for you to let it reach adulthood and go into production.

But just to deploy it is not enough. You have to constantly monitor it after it is deployed, and much of what was done in this article is precisely to make the monitoring process easier.

Technical debt is an important matter in machine learning systems. Tracking your experiments will certainly help to reduce this debt, but there are also other aspects involved. This is brilliantly discussed in this paper by Google: Hidden Technical Debt in Machine Learning Systems.

That’s it for now! I intend to keep going with this project to learn more about deploying and monitoring, and when I do, I’ll make sure to share my attempts!