
Introduction
When training a neural network one of the most common speed-related bottlenecks is represented by the Data loading module. If we are bringing the data over the network, besides prefetching and caching there aren’t any other easy optimizations that we can apply.
However, if the data is in local storage we can optimize the file reading operations by combining the entire dataset into a single file that we can then map into the main memory, this way we don’t need to make costly system calls for each file read, instead we let the virtual memory manager handle the memory access.
The stops on this short journey:
- What is a memory-mapped file
- What is a PyTorch Dataset
- Implementing our custom Dataset
- Benchmark
- Conclusion
What is a memory-mapped file?
We call a memory-mapped file, a file that has its contents directly assigned to a segment of virtual memory, this way we can perform any operations on that segment just like on any other portion of main memory we have access to in the current process.
Due to the additional abstraction layer represented by the virtual memory, we can map into memory files that are much larger than the physical capacity of our machine. The segments of memory (called pages) that are required by the running process are fetched from external storage and copied into the main memory automatically by the virtual memory manager.
Benefits of using a memory-mapped file:
- Increased I/O performance, a normal read/write operation through a system call is much slower than a change in the local memory
- The file is loaded in a "lazy" fashion, usually just one page at a time, thus the actual RAM utilization is minimal even for larger files.
What is a PyTorch Dataset
Pytorch provides two main modules for handling the data pipeline when training a model: Dataset and DataLoader.
DataLoader is mainly used as a wrapper over the Dataset, which provides a lot of configurable options like batching, sampling, prefetching, shuffling, etc., and abstracts a lot of complexity.
The Dataset is the actual part where we have most of the control and where we actually have to write the way in which we make available the data for the training process, this includes loading the samples into memory and applying any necessary transformations.
From a high-level perspective, we have to implement tree functions: init, len, and getitem; we will see a concrete example in the next section.
Implementing our custom Dataset
Next, we will see the implementations for the three functions mentioned above.
The most important part is in init, we will be using the np.memmap() function from the numpy library to create a ndarray backed by a memory buffer that is mapped to a file.
The ndarray will be populated from an iterable (preferably a generator to keep memory utilization to a minimum), this way we keep a high level of adaptability regarding the modality and type of data the Dataset supports. We can also provide a transformation function that will be applied to the input data when is retrieved from the dataset.
For a more comprehensive view and other examples, the actual project is also on Github here.
There are also two utility type functions that we used in the code presented above.
Benchmark
To actually present a real example of the performance gain, I compared the memory-mapped Dataset implementation with a normal one that reads the files in the classic lazy fashion. The dataset used here is composed of 350 jpg images. The code for the benchmark can be seen here.
From the results below, we can see that our Dataset is over 30 times faster than the normal one:

Conclusion
The implementation presented in this article is by no means production grade but the idea behind is quite valid and there are many more usages one can find for a memory map file approach when consuming medium to large files.
Thank you for reading, I hope you will find this article helpful and if you want to stay up to date with the latest programming and Machine Learning news and some good quality memes :), you can follow me on Twitter [here](https://www.linkedin.com/in/tudor-marian-surdoiu/) or connect on LinkedIn here.