The world’s leading publication for data science, AI, and ML professionals.

Preparing the ImageNet dataset with TensorFlow

Conveniently setting up 150 GB of image data with

Without a doubt, the ImageNet Dataset has been a critical factor in developing advanced Machine Learning algorithms. Its sheer size and many classes have been challenging to handle. These problems led to better data-handling tools and novel Neural Networks architectures.

Photo by Hunter Harritt on Unsplash
Photo by Hunter Harritt on Unsplash

TensorFlow Datasets is such a dataset handling tool. With its help, you can conveniently access a variety of datasets from many categories. In most cases, you can download the data directly from TensorFlow.

However, ImageNet is an exception; it requires a manual setup. While there seem to be some instructions on achieving that, they are somewhat vague. As a result, it took me some time to prepare the dataset, but I ended up with a concise script. In the following, I’ll guide you through it to experiment with the ImageNet dataset as well.

Downloading ImageNet

Before we do any preparation, we need to obtain the dataset. To do so, go to the sign-up page and create an account. After you have done this and applied to using the ImageNet dataset, proceed to the download page. Under "ImageNet Large-scale Visual Recognition Challenge (ILSVRC)", select the 2012 version. This will lead you to a new page:

ImageNet download page. Screenshot by the author.
ImageNet download page. Screenshot by the author.

On this page, we need to download the "Training Images (Task 1 & 2)" and "Validation Images (all tasks)" files. Because they are a combined 150 GB large, this will take some time. If you have access to some University-provided internet, I recommend utilizing their network. Often, they have excellent download speeds.

Afterward, you have two files. The first one, _ILSVRC2012_imgtrain.tar, contains the training images and their labels. The second one, _ILSVRC2012_imgval.tar, contains the validation images and their labels. With these archives at hand, we can now prepare the dataset for actual use.

Preparing the ImageNet dataset

The full script to prepare the dataset is shown below. Adapt any directory paths to your case:

To install TensorFlow datasets, run

pip install tensorflow-datasets

After the necessary installations and imports, we define the path where the _ILSVRC2012_imgtrain.tar and _ILSVRC2012_imgval.tar files are located:

Then, we set some configuration parameters. The _manualdir parameter is the key here. It ensures that TensorFlow Datasets searches for the downloaded files at our specified location. The _extracteddir is A temporary directory used during dataset preparation. You can delete it afterward:

Lastly, we implement the actual building process. The preparation is done with a DatasetBuilder object, which "knows" how to set up a particular dataset. For example, to get the builder for ImageNet, we instantiate it by passing "imagenet2012".

We then call the actual preparation method, _download_andprepare(), on this object. The only thing that we do here is to pass our configuration object:

That’s all we have to do on the python side.

Running the script

To run the script, we type

python .py

This builds the ImageNet dataset in the default directly, _~/tensorflowdatasets/. To change this, we can call the script with

TFDS_DATA_DIR= python .py

We are prepending the TFDS_DATA_DIR to set the environment variable responsible for the build location to a directory of our choice. This is mainly useful in compute clusters, where multiple workers access the same datasets.


Summary

We have worked through setting up the ImageNet dataset. Unfortunately, we cannot set up the test dataset as conveniently. Further, no labels are provided, as to prevent unfair techniques from being used. Therefore, the only way to assess one’s model is by uploading a file with image->predicted label mapping to the grading servers. To get the test images, download the test archive from the same download page as before. Then, extract it to a directory of your choice. Any further processing then follows a typical data pipeline. However, this is out of this article’s scope. To go further, look at Keras’ functionality to achieve this or get an impression of working with datasets in this post.


Related Articles