When you have a good working algorithm and you want to test your masterpiece on some dataset, almost always have to spend quite a lot of time on actual loading and preprocessing of data. It would be quite nice if we could have all the data in one single format and consistent way of accessing data (e.g. always store training images under the key "train/image").
Here I’ll be sharing a github repo written by me that converts several popular datasets into HDF5 format. Currently supports following datasets.
- ILSVRC ImageNet
- CIFAR 10 and CIFAR 100 Datasets
- SVHN Dataset
What does this code do?
So this repository does quite a few things. First let me tell you the organization. Code base is pretty simple. It has a single file for each dataset to preprocess data and save as HDF5 (e.g. for Imagenet we have preprocess_imagenet.py
, Cifar-10 and CIFAR-100 we have preprocess_cifar.py
and for SVHN we have preprocess_svhn.py
). Essentially each file does the following:
- Load the original data into the memory
- Perform any reshaping required to get data to the proper dimensionality (e.g. cifar Dataset gives image as a vector so need to bring that to a 3-dimensional matrix)
- Create a HDF5 file to saved data in
- Use Python multiprocessing library and process each image according to the user specifications
Below I’m gonna tell what the Imagenet file does. This is the most complicated file and the others are quite straight-forward.
Here I discuss what preprocess_imagenet.py
file does. This basically saves a subset of ImageNet data as a HDF5 file. This subset is a data belonging to number of natural classes (e.g. plant, cat) and artificial classes (e.g. chair, desk). Furthermore you can normalize data while saving data.
The save_imagenet_as_hdf5(...)
function takes over as soon as you run the script. This function first create a mapping between the valid dataset filenames and labels (i.e. build_or_retrieve_valid_filename_to_synset_id_mapping(...)
. Next it isolates the classes related to the classification problem of the ImageNet dataset (1000 classes) with write_art_nat_ordered_class_descriptions(...)
or retrieve_art_nat_ordered_class_descriptions(...)
. Then we write the selected artificial and natural class information to an xml file using write_selected_art_nat_synset_ids_and_descriptions(...)
method.
Next we sweep through all the subdirectories in the training data and get all the related data points into the memory. Next we create HDF files to save data. this is done with the save_train_data_in_filenames(...)
function. The data will be saved under the following keys:
/train/images/
/train/images/
/valid/images/
/valid/images/
Accessing and Loading Data Later
You can access this saved data later as:
dataset_file = h5py.File("data" + os.sep + "filename.hdf5", "r")
train_dataset, train_labels = dataset_file['/train/images'], dataset_file['/train/labels']
test_dataset, test_labels = dataset_file['/test/images'], dataset_file['/test/labels']
Code and Further Reading
The code is available here and you can view a full description about what the code does and how to run in my blog post.
Note: If you see any issues or errors you get while running the code, please let me know as a comment or by opening an issue on Github page. That’ll help me to improve the code getting rid of any pesky bugs.
Cheers!