
What is this about?
As I’m writing this, the model library on Huggingface consists of 11,256 models, and by the time you’re reading this, this number will only have increased. With so many models to choose from, it is no wonder that many get overwhelmed and don’t know any more which model to choose for their NLP tasks.
It’d be great if there was a convenient way to try out different models for the same task and compare those models against each other on a variety of metrics. Sagemaker Experiments does exactly that: It lets you organize, track, compare, and evaluate NLP models very easily. In this article we will pit two NLP models against each other and compare their performances.
All the code is available in this Github repository.
Data Preparation
The data preparation for this project article can be found in this Python script. We will use the IMDB dataset from Huggingface, which is a dataset for binary sentiment classification. The data preparation is pretty standard, the only thing to note is that we need to tokenize the data for each model separately. We will then store the data in S3 folders, one per model.
The models we are comparing in this article will be distilbert-base-uncased and distilroberta-base. Obviously, Sagemaker Experiments is not limited to two models and actually allows to track and compare several NLP models.
Metric definitions
First, it is important to understand how Sagemaker Experiments will the metrics which we will then use to compare the models. The values for these metrics are collected from the logs that are produced during model training. This usually means that the training script has to write out these metrics explicitly.
In our example we will use Huggingface’s Trainer object which will take care of writing the metrics into the log for us. All we have to do is to define the metrics in the training script. The Trainer object will then automatically write them out into the training log (note that the loss metric is written out by default and that all metrics have the prefix "eval"_):

That means we can capture these metrics during the training job via regular expressions, which we can define as follows:
We will pass those to the estimator we will create further down below to capture these metrics, which will allow us to compare the different NLP models.
Running a Sagemaker Experiment
To organize and track the models we need to create a Sagemaker Experiment object:
Once that is done, we can kick off the training. We use ml.p3.2xlarge for the Sagemaker Training jobs which will complete the fine-tuning in about 30 minutes. Note that we create a Trial object for each training job. These trials get associated with the experiment we created above which will allow us to track and compare the models:
The code above kicks off two training jobs (one for each model) in parallel. However, if that is not possible on your account (maybe the number of training job instances is restricted in your AWS account), you can also run these training jobs sequentially. As long as they get associated with the same experiment via the Trial object you will be able to evaluate and compare the models.
Comparing the models
After around 30 mins both models have been trained and it is time to retrieve the results:
The resulting dataframe holds all the information required to compare the two models. For example, we can retrieve the average values for all the metrics we defined like this:

We can see that distilroberta-base performed slightly better with respect to recall and distilbert-base-uncased performed better with respect to F1 score, precision, and accuracy. There are many more columns in the dataframe which I will leave to the reader to explore further.
Conclusion
In this article we have created a Sagemaker Experiment to track and compare NLP models. We have created Trials for each of the models and collected various evaluation metrics. After the models have been fine-tuned we were able to access these metrics via a Pandas dataframe and compare the models in a convenient way.