A hands-on guide to TFRecords
TensorFlow’s custom data format TFRecord is really useful. The files are supported natively by the blazing-fast tf.data API, support distributed datasets, and leverage parallel I/O. But they are somewhat overwhelming at first. This post serves as a hands-on introduction.
Overview
In the following, we’ll use artificial data to go over the concept behind TFRecord files. With this in mind, we can then go on to work with images; we will use both a small and a large dataset. Expanding our knowledge, we then work with audio data. The last large domain is the text domain, which we’ll cover as well. To combine all this, we create an artificial multi-data-type dataset and, you guessed it, write it to TFRecords as well.
TFRecord’s layout
When I started my Deep Learning research, I naively stored my data scattered over the disk. To make things worse, I polluted my directories with thousands of small files, in the order of a few KB. The cluster I was then working on was not amused. And it took quite some time to get all these files loaded.
This is where TFRecords (or large NumPy arrays, for that matter) come in handy: Instead of storing the data scattered around, forcing the disks to jump between blocks, we simply store the data in a sequential layout. We can visualize this concept in the following way:

The TFRecord file can be seen as a wrapper around all the single data samples. Every single data sample is called an Example and is essentially a dictionary storing the mapping between a key and our actual data.
Now, the seemingly complicated part is this: When you want to write your data to TFRecords, you first have to convert your data to a Feature. These features are then the inner components of one Example:

So far, so good. But what is now the difference to storing your data in a compressed NumPy array or a pickle file? Two things: The TFRecord file is stored sequentially, enabling fast streaming due to low access times. And secondly, the TFRecord files are natively integrated into TensorFlows tf.data API, easily enabling batching, shuffling, caching, and the like.
As a bonus, if you ever have the chance and the computing resources to do multi-worker training, you can distribute the dataset across your machines.
On a code level, the feature creation happens with these convenient methods, which we will talk about later on:
To write data to TFRecord files, you first create a dictionary that says
I want to store this data point under this key
When reading from TFRecord files, you invert this process by creating a dictionary that says
I have this keys, fill this placeholder with the value stored at this key
Let us see how this looks in action.
Image data, small

Images are a common domain in deep learning, with MNIST [1] and ImageNet [2] being two well-known datasets. There is a multitude of getting your images from the disk into the model: writing a custom generator, using Keras’ built-in tools, or loading it from a NumPy array. To make loading and parsing image data-efficient, we can resort to TFRecords as the underlying file format.
The procedure is as follows: We first create some random images – that is, using NumPy to randomly fill a matrix of given image shape: width, height, and colour channels:
The output is as expected; we have 100 images of shape 250×250, with three channels each:
(100, 250, 250, 3)
We also create some artificial labels:
As a result, we have a label array of shape (100,1), storing one label per image. The first ten labels are printed out:
(100, 1)
[[2] [4] [3] [3] [2] [4] [2] [3] [3] [0]]
To get these {image, label} pairs into the TFRecord file, we write a short method, taking an image and its label. Using our helper functions defined above, we create a dictionary to store the shape of our image in the keys height, width, and _depth – w_e need this information to reconstruct our image later on. Next, we also store the actual image as _rawimage. For this, we first serialize the array (think building a long list) and then convert it to a _bytesfeature. Lastly, we store the label for our image.
All these key:value mappings make up the features for one Example, as described above:
Now that we have defined how we can create an Example from one pair of {image, label}, we need a function to write our complete dataset to a TFRecord file.
We begin by creating a TFRecordWriter, which is subsequently used to write the Examples to disk. For each image and corresponding label, we then use the function above to create such an object. Before writing it to disk, we have to serialize it. After we have consumed our data, we close our writer and print the number of files we have just parsed:
That is all that’s required to write images to a TFRecord file:
The output is as expected, as we have just parsed one hundred {image, label} pairs:
Wrote 100 elements to TFRecord
Having this file on our disk, we might also be interested in reading it later on. This is also possible and goes the other way:
Earlier, we defined a dictionary that we used to write our content to disk. We use a similar structure now, but this time to read the data. Previously, we said that the key width contains data of type int. Consequently, when we create our dictionary, we assign a placeholder of type int as well. And because we are dealing with features of fixed length (which we work with most of the time; sparse tensors are infrequently used), we say:
Give me the data that we have stored in the key ‘width’, and fill this placeholder with it
Similarly, we define the key:placeholder mappings for the other stored features. We then let the placeholders be filled by parsing our element with _parse_singleexample. Given that we are dealing with a dictionary, we can afterwards normally extract all values by accessing the corresponding keys.
In the last step, we have to parse our image back from a serialized form to the (height, width, channels) layout. Notice that we want the out_type to be int16, which is required since we created the images with int16, too:
To create a dataset out of the parse elements, we simply leverage the tf.data API. We create a TFRecordDataset by pointing it to the TFRecord file on our disk and then apply our previous parsing function to every extracted Example. This returns a dataset:
We can explore the content of our dataset by taking a single data point:
The output is
(250, 250, 3)
()
The first line is the shape of one image; the second line is the shape of a scalar element, which has no dimension.
This marks the end of parsing a small dataset. In the next section, we have a look at parsing a larger dataset, creating multiple TFRecord files on the way.
Image data, large

In the previous section, we wrote a fairly small dataset to a single TFRecord file. For larger datasets, we might consider sharding our data across multiple such files.
First, let’s create a random image dataset:
The corresponding labels are created in the next step:
Since we are dealing with a larger dataset now, we first have to determine how many shards we even need. To calculate this, we need both the number of total files and the number of elements that we want to store within a single shard. We also have to consider the case where we have, e.g., 64 images and 10 files per shard. This would lead to 6 shards (6×10) but misses the last 4 samples. We simply avoid this case by adding an additional shard in advance, and removing it if we have 60 files, since then 60//10 leaves no remainder:
In the next step, we iterate over the splits/shards. We create a new file and writer for each split, updating the name of the file accordingly. The naming follows {out_dir}{current_shardnumber}{number_total_shards}{filename}.
For each shard, we create a temporary count to keep track of all the elements we have stored within it. The next {image, label} pair is determined by calculating _splitnumber x _maxfiles + _current_shardcount. For our first shard, the index would go from 0 to 9; for the second shard, the index would go from 10 to 19, and so on. If the index equals the number of elements, we simply break the loop:
With our index ready, we can get both the image and the label from the corresponding arrays. We reuse the _parse_singleimage function that we wrote earlier since we have only changed the dimensions of our dataset, not the layout. In the next step, we write the returned Example object to the TFRecord file. Lastly, we increase the current and the global counter; and finally, close our writer:
As previously, we can create the TFRecord files with a single function call:
The output from the print statements is
Using 17 shard(s) for 500 files, with up to 30 samples per shard
100%|██████████| 17/17 [00:03<00:00, 5.07it/s]
Wrote 500 elements to TFRecord
Akin to our small dataset, we can extract the larger files from the disk. Since we have not changed the keys that we stored, we reuse our _parse_tfrelement() method. The only difference is that we have multiple TFRecord files rather than only one. We can handle this by getting a list of all files that fit a pattern; we simply search for all TFRecord files that contain the string _largeimages:
We can get a dataset and query one element with the following code:
The output is as expected, one image is of shape (400, 750, 3), and the label is a scalar, having no shape:
(400, 750, 3)
()
This marks the end of parsing a larger image dataset into multiple TFRecord files and getting the data out, too. In the next section, we cover storing audio data.
Audio data

Audio is a second frequently used data type; there is a variety of large datasets available. Christopher Dossman made a list of more than 1 TB of audio data – and that’s only some of the larger publicly available data:
For this section, we however won’t deal with TBs right from the start. Instead, we will focus on a smaller dataset.
Let’s begin by creating it, which we need the librosa package for. We use librosa’s provided example files and assign some artificial labels. For each sample, we store the raw audio and the sampling rate:
As we have done previously, we write a short method that helps us get the data into the TFRecord file. Since we packed the audio data and the sampling rate into a common array, we have to simply query the entries: the first entry is the audio data, the second entry holds the sampling rate.
With this at hand, we then create and return an Example object. There is nothing completely new here; the approach is similar to before:
The previous function returns a single sample, ready to be written to TFRecord. The next function iterates over all samples and writes them to disk:
We then simply call
to write our complete audio dataset to disk.
Having it stored on a file, we can proceed as before: We write a function that inverses the procedure of writing an Example to TFRecord, reading it instead. This closely follows the functions that we used to parse images; only the keys have different names:
As before, to create a dataset, we simply apply this parsing function to every element in the TFRecord file:
To query our dataset, we then call this function and inspect the first element:
The output is
(117601,)
tf.Tensor(0, shape=(), dtype=int64)
The first entry is the shape of an audio file; the second entry is the corresponding label.
That marks the end of working with audio data and TFRecord files. In the next section, we have a look at handling text data.
Text data

As the last large domain, we have text data. Considering the success of the NLP research in the last three, four years – Transformer [3], GPT [4], …— that’s no wonder.
We begin by creating a dummy dataset:
We then create our dataset and query the first five elements:
This gives us
['Hey, this is a sample text. We can use many different symbols.',
'A point is exactly what the folks think of it; after Gauss.',
'Hey, this is a sample text. We can use many different symbols.',
'A point is exactly what the folks think of it; after Gauss.',
'Hey, this is a sample text. We can use many different symbols.']
Now we write a function that creates an Example object from our text data. The procedure is as previously: We store the non-scalar data as a byte feature, the label as a scalar.
With the next function, we iterate over the text dataset and the labels and write them to a single TFRecord file:
Getting our artificial text dataset to disk is then a single call:
With the next function, we reverse this process, getting the data out of the TFRecord file. A notable difference is that we want our feature – the text data – to be of type string; we, therefore, set the _outtype argument to tf.string:
As before, we map every single element to this function:
We then get a dataset and expect the first two elements:
The output is
b'Hey, this is a sample text. We can use many different symbols.'
tf.Tensor(0, shape=(), dtype=int64)
b'A point is exactly what the folks think of it; after Gauss.'
tf.Tensor(1, shape=(), dtype=int64)
This marks the end of writing and reading audio data in the context of TFRecord files. In the next section, we will combine all previous domains.
Multiple data types
We have examined single domains so far. Of course, there’s nothing that speaks against combining multiple domains! For the following, consider this outline:
We have multiple images:
Secondly, we have a short description of each image, describing the scenery that the image shows:
Lastly, we also have an auditive description of the scenery. We reuse the dummy audio data from above:
Now, let’s combine them into the TFRecord files. We write a function that takes these data types and returns an example object. This is a further benefit of the TFRecord format: Even though we are dealing with multiple data types, we can store everything together in one object:
As previously, we iterate over all data samples and write them to disk:
Creating a TFRecord file is only a single function call:
Now that we wrote such an Example to disk, we read it back by extracting the Features. The key difference to the previous section is that we have multiple features – text, image, and audio data -, so we have to parse them separately:
The code to get a combined dataset is then quite simple:
Let’s have a look at the first element in the dataset:
The output is
(<tf.Tensor: shape=(256, 256, 3), dtype=int16, numpy=
array([[[160, 224, 213],
...
[189, 253, 65]]], dtype=int16)>,
<tf.Tensor: shape=(), dtype=string, numpy=b'This image shows a house on a cliff. The house is painted in red and brown tones.'>,
<tf.Tensor: shape=(), dtype=int64, numpy=3>,
<tf.Tensor: shape=(117601,), dtype=float32, numpy=
array([-1.4068224e-03, -4.4607223e-04, -4.1098078e-04, ...,
7.9623060e-06, -3.0417003e-05, 1.2765067e-05], dtype=float32)>,
<tf.Tensor: shape=(), dtype=int64, numpy=0>)
The first element is the image, the second element is the textual description of this image, and the third element is the text’s label. The last two elements are the audio data and the label for the audio data.
That marks the end of the section on writing multiple data types to TFRecord files.
Summary
We covered writing image, audio, and text data to TFRecord files. We also covered reading this data back.
Regardless of the actual content, the procedure is always as follows:
- Define a dictionary for the data that gets stored in the TFRecord file
- Reconstruct the data by replicating this dictionary when parsing the data
- Map every element to the parsing function
Slight modifications are only required when you are dealing with large datasets. In this case, you have to write your data to multiple TFRecord files, which we have covered in the section on dealing with large image data.
A Colab notebook with all the code is available here.
If you are interested in seeing this file format in action, you can read my post on classifying custom audio data. There, I use TFRecord files to store my dataset and train a neural network directly on them:
Literature
[1] Y. LeCun et al., Gradient-based learning applied to document recognition (1994), Proceedings of the IEEE
[2] J. Deng et al., Imagenet: A large-scale hierarchical image database (2009), IEEE conference on computer vision and pattern recognition
[3] A. Vaswani et al., Attention is all you need (2017), NIPS
[4] A. Radford et al., Improving language understanding by generative pre-training (2018), OpenAI