The world’s leading publication for data science, AI, and ML professionals.

A better comparison of TensorBoard experiments

An existing TensorBoard limitation is that it only considers the last epoch when ranking experiments. Here is how to better evaluate…

Image from Pixabay
Image from Pixabay

If you use TensorFlow for the development and experimentation of Machine Learning algorithms, you have most likely come across TensorBoard and use it to record the results of each run and have a better way to compare them visually. You may even have done some hyperparameter-tuning by recording various metrics to compare your results.

However, on the day I am writing this post, TensorBoard is only able to compare the results based on the metrics obtained in the last epoch. Despite having all historical values for each run, when sorting the results by a specific metric, it will only consider the last epoch for comparison. Depending on the use case, this could end up in an incorrect ranking of the model runs on different hyperparameters.

The purpose of this article is to present a better way of comparing the results logged in Tensorboard, not only involving hyperparameter-tuning but also comparisons of normal scalar runs. Specifically, we will access these metrics through the Python TensorBoard library, transform them into a Pandas DataFrame, and from there group the runs and compare according to a sub-selection of metrics.

Uploading the experiments to TensorBoard.Dev

First, we will upload our directory of experiments to TensorBoard.dev. For those of you who are not familiar with this tool, it allows us to upload and share our ML experiments with anyone, and the best of all: with zero setup.

If you do not have it installed, you can do it through:

pip install -U tensorboard

Then, you need to run the following command specifying your logdir and optionally a name of the experiment you want to upload:

tensorboard dev upload – logdir REPLACE_LOG_DIR

  • name "(optional) My latest experiment"

The first execution of that command will prompt some authentication stage. After it is completed, you will be able to have a link that looks like "https://tensorboard.dev/experiment/EXPERIMENT_ID".

Loading the experiments into a Pandas DataFrame

The next step is to load the uploaded experiments into a local Pandas DataFrame, this can be currently done by passing the experiment id to the Tensorboard API call _tensorboard.data.experimental.ExperimentFromDev()_, and then fetching the experiment scalars.

Loading TensorBoard.Dev experiment into a Pandas DataFrame. Image by author.
Loading TensorBoard.Dev experiment into a Pandas DataFrame. Image by author.

To convert the DataFrame to a more usable format (wide-form), we could call get_scalars with _getscalars(pivot=True). However, this will not work if we are not using the same metrics for all the experiments and we want to compare them even when some of the runs do not have a certain metric available. To do this, we have to manually pivot the table by running:

Converting DataFrame to wide-form format. Image by author.
Converting DataFrame to wide-form format. Image by author.

Ranking the results

Once we have the DataFrame prepared and correctly formatted, the only thing left to do is to rank the results considering not only the last but all the epochs of each run. Also, in the case of using multiple metrics to compare the runs and would like to find the model that works best in conjunction with those metrics, we will use the harmonic mean as it would fit the ratios better than the simple mean.

For this, I have created a gist available below where all these functionalities are included in a method. To call it, only two arguments are needed, the id of the experiment we uploaded to TensorBoard.Dev and the metrics to be used to compare the results. There are also optional arguments to only consider the validation runs, to sort the values based on the overall performance, or to format the percentage of metrics of the returned DataFrame.

As an example, here is a run on some experiments I performed on the DeepFashion dataset.

Example of call on DeepFashion experiments. Image by author.
Example of call on DeepFashion experiments. Image by author.
Output of executing previous code. Image by author.
Output of executing previous code. Image by author.

Final Words

In this article, I have presented you with a more accurate way to evaluate your TensorFlow experiments, even when looking at multiple metrics, by manually accessing to TensorBoard data. I hope you liked this article and it was helpful for you! 😄 You can also check my latest work in:


Related Articles