Feeding the Beast: The Data Loading Path for Deep Learning Training

Optimize your deep learning training process by understanding and tuning data loading from disk to GPU memory

Published in

Towards Data Science

11 min readMar 16, 2021

Deep learning experimentation speed is important for delivering high-quality solutions on time.

The data loading path — i.e. getting training examples from storage into the GPU is an often overlooked area, despite having a major impact on experimentation speed.

In this post, I will provide an introduction to components of the data loading path, how to formalize their performance goals, and how to design a solution which meets these performance goals.

Working backwards — from the GPU to the storage

Rationale — Keeping the GPU busy

GPUs are expensive to buy/lease. So, during training, you want to ensure they are in 100% utilization, crunching those matrixes and updating weights.

GPU’s work most efficiently in mini-batches — i.e. applying the network operators to a number of training examples at the same time by representing them as a multi-dimensional tensor. From a pure computation efficiency perspective, bigger batches are better (the main limitation is GPU memory)

To keep the GPU’s at high utilization, we need to make sure that by the end of a training step on a mini-batch(forward + backward pass), the next mini-batch will be ready to transfer into the GPU’s memory.

Drops in GPU utilization indicate data loading is too slow. Image by the author

Data loading performance requirements (for a single GPU)

Define:

n = mini-batch size
t= mini-batch GPU processing time

In a typical training regime, these values are fixed for the entire training process.

Goal:

99% of mini-batches should load n examples in t seconds or less.
An upper bound which is simple to measure can be :

max latency (single example) ≤ t
min throughput (examples/s) ≥ n/t

What impacts n and t?

First and foremost, n is a hyperparameter that you need to tune so that your network actually converges and learns what you want it to learn.

Typically you will choose lower batch sizes (e.g. 16/32) if you have a small amount of training data, and larger values if you have a lot of training data.

Choosing the right batch size causes the network to converge faster. Image by author

t is a function of the amount of computation (FLOPs) the GPU needs to perform on a mini-batch; it is dependent on the GPU model, network complexity and n.

Lastly, n is capped by the amount of available GPU memory. The memory needs to hold the state of the network for the entire mini-batch.

The memory footprint for a network state grows linearly with the size of the batch, as it contains activations and features per training example, needed for back-propagation

Confused? let’s look at some examples.

Imagenet on Resnet 50

Data representation — 224X224X3, double precision (32 bit)
Computation for single example — 4 GFLOPs (~4 billion floating-point operations)
GPU model — V100 with 16 GB. Capable of 7 TFLOPS (~7 trillion floating points operations per second)
Size of single example state in GPU memory: 103 MB
Size of model params in memory: 98MB (~25 million trainable params)

Analysis:

V100 should be able to process per second 7 TFLOPs.
As a result, it should handle 7 TFLOPS/4GFLOPs = 1750 examples/s.
Note: The number as published by NVIDIA is 1457, probably due to the overhead of getting examples from CPU into GPU memory.
Size in memory for n 128 = 103MBX128 + 98MB = 12.97 GB.
Which means that n=256 would not fit in the GPU memory.
result: n=128, t = 128/1457 = 0.087s

It follows that to train imagenet on V100 with Resnet 50 network, we require our data loading to provide us the following:

t = Max Latency for single image ≤87 milliseconds
n/t = Read throughput of ~1457 images/second

Imagenet on RestNet 152

To give you an idea about the performance impact of changing the network architecture, here is the same experiment repeated with Resnet 152:

Size of a single example in memory: 219 MB (double that of Resnet 50)
Size of model params in memory: 209 MB(~50 million trainable params)
Single example processing cost = 8 GFLOPs
n = 64 (64*219MB = 13.8 GB)
t = 0.087s (~same amount of computation per batch in total)

t = Max Latency for single image ≤87 milliseconds
n/t = Read throughput of ~730 images/second

So, a “small” architecture change can lead to a dramatic change in data loading performance requirements.

Rules of thumb for guesstimating data loading needs

Small data
In case you have little data, you will use a smaller capacity network or fine-tune a large network, but for relatively few epochs.

To make the most out of each training example, you want smaller n values.

In this case:

t will be very short (tens of millis)
n will be small (e.g. 16/32)
throughput (n/t) can still be high

Large data

In case you have a lot of data, you typically use a network with a larger capacity and train it for more epochs. In this case, you want a large n. Your network can take it without getting confused, plus, otherwise, training will take forever.

While you want a high n, you’re faced with a problem — the size of each example in GPU memory will be large (high complexity network) — which limits n (and prolongs t).

To address this, common solutions are to:

Use distributed training, where the overall large batch is processed using multiple GPUs (same machine/cluster)
Reduce the memory footprint of the network (e.g. mixed-precision)

As a result:

t can be relatively relaxed (say 2 seconds)
n can be very large (say 1024)

Breaking down the data loading process

Since the GPU must be kept busy doing the learning itself, an obvious choice would be to load and prepare the data using the CPUs — which are attached to every machine anyway and have almost nothing to do otherwise.

Loading a mini-batch using CPU workers, while the previous batch is being processed on the GPU. Image by author

Preparing the mini-batch for the GPU includes the following steps:

Deciding which examples need to be loaded (typically employing shuffling of the datasets)
Loading examples from the storage ( IO)
Transforming such as pre-processing or augmentation (CPU)
Storing them in RAM (CPU)
Transferring tensors into the GPU memory (CPU)

Using parallelism to achieve throughput
A large amount of I/O, medium-high latency per example, and strong machines with many idle cores suggest it’s wise to use parallelism in data loading.

To further offload and reduce sensitivity to spikes in latency, there’s an opportunity to eagerly load examples from the next batch if we have idle workers (making sure not to blow up the RAM…)

We will now drill deeper into the main parts of the data loading, working backwards from CPU processing of examples to the actual IO for reading data off of the storage.

For this post, we will not cover the step of moving data from RAM to GPU (5) — while it is also important, and has multiple optimizations such as PCIe, NVLink, pinned memory etc.

Stage 3 — Raw input examples to training examples

The GPU expects to receive a tensor for it’s mini-batch.
However often times what we’ve loaded from the storage is not a tensor, but some other binary representation of the data, e.g. a .jpg file.

In the simple case, transforming the raw input example into a training example is as simple as decoding a .jpg into pixels. In other cases, we do various transformations (e.g. downsampling) and augmentations.

The transformations may even include a many-to-many relationship between the raw input examples and the training examples. The sky is the limit, but of course, all these cost us time on the CPU.

There is also a memory footprint that needs to be considered, especially if you cannot downsample the data while still in its compressed form, but we’ll leave this for now.

source: albumentations project (MIT License)

In case your processing is heavy, you can optimize this by doing the processing ahead of time in batch, and storing the processed examples on disk, thus reducing the CPU and maybe even file size on disk.

Stage 2 — Loading examples from storage

In the majority of cases, I/O is the largest cost in data loading.
Generally storage parameters we need to pay attention to are:

The total size of storage in GB = based on our training set size
Read IOPS = n/t (assuming CPU processing time is negligible)
Block size (= typical example size on disk in MB, denote es)
Read bandwidth (MB/s) = es * n/t

Note: In case we are thinking of using shared storage for multiple concurrent experiments:

For the total size of training data (in GB) — use a union on all datasets you plan to train on
Select a challenging combination of t, n, es
from one of the training workloads you are planning
Use n = # concurrent experiments * typical batch size

Example — Four v100 training Imagenet with Resnet 50

Total dataset size = ~150 GB
n = 128*4 = 512
t = 0.087
es = 469x387 pixels jpg, say 64 KB.

Result:

IOPS = 512/0.087 = 5885
Block size = 64 KB
Read bandwidth = 367 MB/s

These requirements fit a modern SSD storage device.
To achieve similar performance on spinning disks, we would need to spread the load on ~45 disks concurrently.

IOPS and block size for modern spinning disks. source: wikipedia.com

When you’re a single researcher, training a network on a stable dataset of a few tens of GB — local SSD drives are your friend. They are not expensive, you can get the data on them in a reasonable time, and you get great performance, without you needing to think much about I/O.

Scaling the storage with your training operations

What happens when you have a team of 15 engineers and researchers, running tens of experiments concurrently on a training cluster, and leveraging TB’s of training data per month?

Copying datasets to an attached storage device (e.g. local disk or block device) before the training is not really an option.

Either the data is too big, or takes too long to copy
If you’re training on a cluster you need to be able to “move” machines easily, something that with attached storage you will not be able to do.

Larger DL training operations require a remote storage solution which is “mounted” or accessed using an API from the training machines.
Typical solutions include scale-out NAS storage with a combination of HDD and SSD. These solutions offer parallel reading at high throughput and in many cases, acceptable latency.

Decent NFS can work with larger files and relaxed latency

Let’s take a look at a theoretical DL shop’s storage needs:

5–10 TB of data (only data that is in active use)
5–10 concurrent experiments
# GPU per experiment = 4
n = single GPU= 64
es = 500K
t = 2s (heavier network)

These will require :

IOPS — 640–1280
Read bandwidth — 600 MB/s — 1.28 GB/s

On cloud, This type of performance is possible with NFS storage offerings, e.g. GCP Filestore or AWS EFS.

Small inputs / short latency is an expensive problem

There is a proverbial problem with storage called “the small file problem”. Essentially what it means is that doing fast random access and returning small blocks is expensive.

Here is an example:

5–10 TB of data (only data which is in active use)
5–10 concurrent experiments
# GPU per experiment = 4
n = single GPU= 256
es = 64K
t = 0.2s

This will require:

IOPS = 25600–51200
Read bandwidth = 1.6–3.2GB/s

Loading many small files very fast is problematic on standard NFS solutions — you either need to spread your data on a very large number of drives, or go for expensive SSD arrays and other high-end solutions

Understanding the workload characteristics

From a storage perspective, data loading for training has quite a unique set of characteristics:

The size of a single “raw input example” can vary wildly depending on the modality of the data — from tens of K’s to many MB’s per raw input example.
Data is read, processed and written in “bulk”, then read many times
Reads is a little more relaxed around latency spikes, and more concerned with throughput
Data is accessed at “random” — as mini-batches shuffle the examples included in each mini-batch
Once a single data example data is read, it’s almost guaranteed it will be read again and again within a relatively short timeframe (per epoch)

These can be leveraged to optimize the storage in creative ways.

Leveraging load characteristics to optimize storage

To optimize the storage solution, consider using some or all of the following:

First — Use a compact representation format, which reduces single example size as much as possible. No need to support high bandwidth if the data is going to be downsampled a second later!

Second — you can use a local, partial caching solution (e.g. SSD)— exploiting the guarantee for a cache hit in future epochs. If you can cache this way random 20% of the dataset (lazily), you can get a speedup without much effort.

Lastly — Try and move away from very short latency/many random IOPS to longer latency and sequential reads.

The extreme case discussed above (very short latency, many small random IOPS) should frankly be avoided — e.g. by using weaker GPU’s, or by sharing GPU’s between different training jobs.

But even for more moderate cases — what if we could batch examples together on disk? instead of reading n small files and doing many seeks, we would read 1 larger file in a sequence?

This is an attractive idea; the challenge is to support random shuffle per epoch — but it turns out this can be solved if you try hard enough…
I’ll try and cover these in a separate post in the near future.

Summary

Keeping the GPU at full utilization requires understanding and optimizing the data loading path — from disk to GPU memory.

The specifics of doing so depends on various parameters such as the size of your data, as well as the compute resources you have and the networks you are training.

In this post, I’ve tried to cover the basics of data loading flow and optimization, and offer some solutions.

Stay tuned for the next instalment, which will focus on storage optimization techniques.