Introduction to Memory Mapped IO

Super-fast and parallel file writing with memory mapped IO

Anuradha Wickramarachchi

Published in

Towards Data Science

5 min readMay 30, 2021

Photo by Possessed Photography on Unsplash

What is a Memory Mapped File?

A memory mapped file is a segment in the virtual memory. Virtual memory is an abstraction on physical memory which is provided to the process by the operating system (OS). Should the OS run out of memory or see a process idling for long, such virtual memory segments will be “paged” on to the physical disk location (the “swap”). So “swap” is essentially the portion of virtual memory on disk.

More precisely, a memory mapped file is a mirror of a portion (or entire) file on virtual memory managed completely by the operating system.

In other words, whenever a file is memory mapped, the operating system will immediately map file memory to a region on virtual memory. You can think of this as some RAM space. Whenever the OS dumps the file content to disk (to accommodate another task) a page fault can happen. That’s when the OS goes back and reloads content back to the RAM. But our programs will never see this change, but rather face occasional delays in memory mapped IO.

In this article, I will only talk about Memory Mapped Writing. Reading for another time! I will take a Bioinformatics related task as the running example. Feel free to transform it to your own domain and play around.

Perks of Memory Mapped File Writing

There are plenty of reasons to and not to use memory mapped files. I will only talk about the plusses!

Parallelism: Memory mapped region can be thought of as a huge byte array. So you can write/read, from many threads as possible. Or even shared between processes. As long as race conditions do not occur.

Performance: Memory mapped writing is often fast as no stream/file buffers are used. OS does the actual file writing, usually in blocks of several kilo bytes at once.

One downside would be, unless you’re writing sequentially there could be page faults slowing down your program.

Let’s Start Coding!

In this example our task is as follows;

Given 1 million DNA sequences, vectorise them and write in text form for future machine learning tasks. Note that each vector has a fixed size of 136 dimensions and obtained using function vectorize(sequence) as how it is done is not relevant. To make it more interesting let's incrementally improve our program starting with single threaded conventional file writing.

Single Threaded Operation

We will simply read the sequence and vectorize. The resulting vector will be converted to a string and written off to the output buffer. The summary C++ code is available here.

while (seq := has_sequence(input_file))
    vector := vectorize(seq)
    output << to_string(vector)

Multi Threaded Operation

In multi threaded operations, we need to ensure that our outputs are in the same order as input. Otherwise, we might lose track of which data-point is which. Therefore, I will be using a data queue to read and a consumer to make a batch and process in parallel. These two happen asynchronously so that queue will be always full.

Queue = {}thread_1
while (seq := has_sequence(input_file) and Queue.size() < 20000)
    Queue := Queue + seqthread_2
while Queue.size() != 0
    batch := dequeue(Queue, 10000 items)
    batch_vectors := vectorize_batch(batch)
    output << to_string(batch_vectors)

Note that for clarity I have omitted mutexes and other fine grained synchronisation tricks. The summary C++ code is available here. I have used condition variables to access the queue and terminate threads.

Memory Mapped Operation

There are few things to keep in mind with memory mapped operations. When a new memory mapped file is created, we must specify a size. This is because we cannot dynamically grow the regions assigned to us by OS (we can increase the size by copying, like array resizing. This can be painfully slow). Therefore, we need to have an idea about the total output size. This is easy for us as we know the vector size. We can also get the total count of data items by quickly iterating once over the dataset. So our code would be as follows;

num_sequences := iterate_once_get_seqs()
vector_size := 136
float_size := 8 
size_of_line := vector_size * (float_size + 1)
size_of_file := size_of_line * num_sequences
mmap_file := mmap.open(output, size_of_file, write_mode)

Note that at this point we have made a memory mapped file. We fix the floating point value length (either padding zeros or truncating the decimal precision). Makes life easy for us when writing in a fixed width fashion.

vectorize_function:
   lock(mutex)
   seq := has_sequence(input_file)
   unlock(mutex)
   if (seq):
       vec := vectorize(seq)
       mmapfile.write(offset, to_string(vec))thread_pool(many threads)for _ in range(num workers)
    thread_pool.add_worker(vectorize_function, mmapfile)

Note that we have an offset parameter. This can be obtained by the following equation;

offset := size_of_line * sequence_id

Sequence id is the index of the sequence in dataset starting from zero. We will increment our memory map file pointer from beginning to offset and copy the string vector at that offset. The operating system takes care of the rest.

Note that we only have a mutex at file reading! No more blocked calls or batching of sequences.

Performance of Each Method

I used /usr/bin/time -v program.o command to obtain the CPU and Wall times used by each method. The results were as follows.

Note that the CPU usage has increased from 80% to 420%. It has almost doubled from batch-wise multi-threaded approach too. CPU time has remained almost the same since we have more-or-less the same amount of computation to do. Wall time has reduced to be one forth. However, RAM usage has increased from 2MB to almost 1GB given that OS keeps the memory mapped file in RAM. Since I have sufficient RAM, OS seems to have maintained the file in RAM for the entire time. Results were verified using linux diff command.

We have almost used the entire 4 cores of my computer! That's a win! And have saved a lot of time as well.

Concluding Remarks

We can further optimize this program by using various other tricks (like using memory maps for reading file as well). However, the gains come at the cost of extra work and coding. In this example, I used C++ Boost library (#include <boost/iostreams/device/mapped_file.hpp>). Boost has a portable memory maps extension which should ideally help you forget the platform you run.

Python also has a similar implementation (refer to the docs). So it should be fairly simple even for a python programmer.

I hope you enjoyed reading the article and believe this knowledge will assist in writing better and fast programs. Find the complete project (intended for bioinformaticians) below;

anuradhawick/seq2vec

This tool is intended to be used for data generation in Bioinformatics Machine Learning related tasks. You can use…

github.com

Have a nice day! 😃