HDF5 Datasets For PyTorch

Branislav Holländer
Towards Data Science
4 min readApr 12, 2019

--

If you work in the area of Computer Vision, you have certainly heard of HDF5. The Hierarchical Data Format (HDF) version 5 is a popular format for storing and exchanging unstructured data such as images, videos or volumes in raw format for use in various areas of research or development. It is supported by many programming languages and APIs and is therefore becoming increasingly popular. This also applies to storing data for use in machine learning workflows. In this post I present a possible approach (including ready-to-use code) to using HDF5 data for training deep learning algorithms in PyTorch.

HDF5 File Format

A HDF5 file consists of two major types of objects: Datasets and groups. Datasets are multidimensional arrays of a homogeneous type such as 8-bit unsigned integer or 32-bit floating point numbers. Groups on the other hand are hierarchical structures desgined for holding datasets or other groups, building a file system-like hierarchy of datasets. Additionally, groups and datasets might have metadata in the form of user-defined attributes attached to them.

Python supports the HDF5 format using the h5py package. This package wraps the native HDF C API and supports almost the full functionality of the format, including reading and writing HDF5 files.

If you need to view or edit your HDF5 files in a visual editor, you can download the official HDFView application. The application supports viewing datasets of different formats in a tabular way or as an image. Furthermore, it enables you to edit the file by creating new groups and datasets and renaming and moving the existing ones.

PyTorch datasets

In order to load your data to PyTorch efficiently, PyTorch requires you to write your own Dataset class (or use one of the predefined ones). This is accomplished by inheriting from torch.utils.data.Dataset, overriding the functions __len__ (so that callling len() on the Dataset returns the length of the dataset) and __getitem__ (to enable indexing).

The class torch.utils.data.DataLoader is then used to sample from the Dataset in a predefined way (e.g. you may shuffle the Dataset randomly, choose the batch size etc). The main advantage (and the magic) of data loading in PyTorch lies in the fact that the data loading may happen in a parallel fashion without you ever having to deal with multiple threads and synchronization mechanisms. This works by simply setting the parameter num_workers in the DataLoader constructor to the desired number of threads. As an example for using the Dataset and DataLoader classes in PyTorch, look at the code snippet below, showing how to use the HDF5 Dataset in your program. We will look at how to actually implement the Dataset in the next section.

The HDF5 Dataset Class

I designed the HDF5 Dataset class with multiple goals in mind:

  1. Use folders (including subfolders) containing HDF5 files as a data source,
  2. maintain a simple HDF5 group hierarchy in the Dataset,
  3. enable lazy data loading (i.e. upon request by the DataLoader) in order to allow working with Datasets which do not fit into memory,
  4. maintain a data cache to speed up the data loading process, and
  5. allow custom transforms of the data.

I decided to enforce upon the dataset a simple structure consisting of the individual datasets being placed in separate groups like this:

This reflects the usual structure of data for many machine learning tasks. Usually, there is one dataset in every group containing the data and one or more datasets containing the labels. For instance, a dataset for image segmentation might consist of the image to be segmented (one dataset) as well as the ground truth segmentation (another dataset). These are placed in a group in order to determine which labels belong to which data. Furthermore, higher-level semantical hierarchies might be constructed by placing different types of data in different HDF5 files (e.g. segmentations performed by different users).

Without further ado, here is the actual code:

As you can see, the Dataset is initialized by searching for all HDF5 files in a directory (and sub-directories) and a data_info structure is built, containing infos about each chunk of data such as which file it comes from, which type it has (‘data’ or ‘label’ in this example, but you can define others) and its shape. The shape is often useful to determine the size of the dataset so it is an important information to store. Additionally, for each chunk we also store its data cache index. The index is ≥0 if the data is currently loaded and -1 if we did not load it yet.

If the DataLoader now requests some data, the __getitem__ function is called, which in turn calls the get_data function. Note that we cannot just index some array here, because we first have to make sure that the data is actually in memory. In get_data, we therefore look up the cache to find the data or, if it is not in the cache, we load it and return it to the caller. This happens in the _load_data function, which does two things: it loads the data and adds it to the cache, and it removes a random chunk of data from the cache if data_cache_size was exceeded.

After obtaining the data, it has to be transformed according the transformation you provided in the constructor and converted to a torch.Tensor type.

Conclusion

In this post I presented a simple but powerful HDF5 Dataset class which you can use for loading HDF5 datasets in PyTorch. I hope it will be of some use to you. If you happen to have any questions or further suggestions, do not hesitate to drop a comment below.

--

--