Diving into DALI: How to Use NVIDIA’s GPU-Optimized Image Augmentation Library

Published in

Towards Data Science

12 min readJun 18, 2019

*Salvador Dalí. The Persistence of Memory. Credit:* *The Museum of Modern Art*

Deep learning image augmentation pipelines typically offer speed or flexibility, but never both at the same time. Computationally efficient, production-ready computer vision pipelines tend to be written in C++ and require developers to specify all the nuts and bolts of image transform algorithms to such an extent that these pipelines end up not terribly amenable to further on-the-fly tweaking. On the other end of the spectrum, popular Python libraries like Pillow offer high-level APIs that let practitioners choose from seemingly unlimited combinations of tweaks that can be applied to a vast repository of image transform algorithms. Unfortunately, this freedom carries with it the cost of a steep drop-off in performance.

The DALI Library attempts to give practitioners the best of both worlds. Its image transform algorithms are themselves written in C++ code that squeezes every last drop of performance out of NVIDIA GPU chips, making it possible to perform image transforms in parallel on a per-batch basis, across however many GPUs a user has access to. The C++ source code is wired up to a user-friendly Python API, through which practitioners can define image transform pipelines that play nice with both the PyTorch and TensorFlow frameworks.

In an attempt to ascertain whether DALI indeed delivers both the speed and flexibility it advertises, I spent the better part of one week running a series of my own experiments with the library. Spoiler alert: while DALI absolutely brings the speed, flexibility is still somewhat lacking.

DALI’s Promise

Nonetheless, taking the time to get acquainted with DALI is absolutely worthwhile. The benefits of doing image augmentation on the GPU are self-evident, and my DALI image pipeline ran much faster than any other comparable image augmentation pipeline I’ve ever written.

Moreover, I just completed the most recent offering of part II of the fast.ai deep learning course, and attempted to build a DALI pipeline that is compatible with the new and improved version of the fastai library that we built from scratch as part of our coursework. This turned out to be a meaningful exercise because, according to core fastai developer Sylvain Gugger, the soon-to-be-released version 2 of the official fastai library will contain many paradigms that were introduced in our class, such as a training loop with a far more flexible callback integration.

Over the next few paragraphs I’ll walk through the ABCs of building DALI pipelines, and point out how to connect them to fastai’s v2.0 training loop. You’ll get to see just how fast DALI runs (it’s really impressive), as well as one really weird workaround I came up with in an attempt to get around a striking shortcoming of the DALI library.

Setting the Stage

My DALI augmentation pipeline includes the random crop & resize, flip, perspective warp, and rotation transforms that I learned to implement from scratch in the 2019 fast.ai part II course. To set a baseline and gauge whether each DALI transform helps improve results, I created a simple, four layer CNN model whose task was to perform image classification using the Imagenette dataset. Imagenette was created by Jeremy Howard to be a much slimmed-down version of ImageNet that allows practitioners to get a feel for how their model would perform if trained on ImageNet, without actually having to train all of ImageNet from scratch. I like to use Imagenette to perform quick sanity checks during early iterations on prototypes and it’s become an indispensable part of my experiments.

How to Build a DALI Pipeline

The backbone of all DALI pipelines is a Python class called Pipeline. I decided to create specialized pipeline classes for my training and validation data that each inherit from this class. To create each class I had to define two methods. The first method, __init__(), is the place to specify the hyperparameters of each and every image transform operation. In addition to augmentations like image rotations and flips, these operations can also include initial image loading, resizing, normalizing, tensor reshaping, and data type casting.

The second method, define_graph(), is where you define the order in which you want your pipeline's image transforms to be executed. This method is also where you will want to call DALI random number generators so that you can pass them as arguments image transform operations that support randomly generated augmentations. define_graph() will return a tuple containing transformed images and their corresponding labels.

Example: Including a Rotation Transform

Here's how I used DALI's ops.Rotate function to add randomly generated image rotations to my training pipeline:

Step 1

In my pipeline class’ __init__() method I created variables for the rotation operation, ops.Rotate, as well as for two random number generators. The first random number generator, ops.Uniform, will produce a list that’s as long as my batch size. This list will contain floats that specify the angles (in degrees) by which ops.Rotate will rotate the batch's images. Each angle is randomly chosen from a uniform distribution that spans the range [-7, 7]. The second random number generator, ops.CoinFlip, will create a list containing ones and zeros that is also as long as the batch size. The ones appear at random indices with an overall frequency of 7.5%. Passing this list to the rotation transform will ensure that any image in a batch will have a 7.5% chance of getting rotated:

self.rotate = ops.Rotate(device=’gpu’, interp_type=types.INTERP_NN) self.rotate_range = ops.Uniform(range = (-7, 7))
self.rotate_coin = ops.CoinFlip(probability=0.075)

Step 2

Inside the define_graph() method is where I actually call the ops.Uniform and ops.CoinFlip random number generators to create fresh sets of random numbers for each batch:

angle_range = self.rotate_range()
prob_rotate = self.rotate_coin()

Still inside define_graph(), I call ops.Rotate at the point my pipeline where I’m ready to perform image rotations, passing the above two lists of random numbers to its angle and mask attributes, respectively:

images = self.rotate(images, angle=angle_range, mask=prob_rotate)

DALI will now rotate about 7.5% of the images in each training batch by an angle between -7 and 7 degrees. All image rotations happen in parallel at the same time!

Here are my training and validation pipeline classes, in their entirety:

DALI Imagenette Train & Val Pipelines

Building DALI Data loaders

Once training and validation pipeline classes have been written, all that’s left to do is create their respective data loaders (DALI calls them “iterators”). It takes just three lines of code to build a data loader that works with PyTorch:

pipe = ImagenetteTrainPipeline()
pipe.build()
train_dl = DALIClassificationIterator(pipe, pipe.epoch_size('r'), stop_at_epoch=True)

DALI Speed Test

DALI pipeline objects have a run() function that grabs a batch of images, sends it through the pipeline, and returns the transformed images and their labels. Timing this function is the easiest way to measure DALI’s speed.

I ran my speed test on an AWS p2.xlarge compute instance with a single GPU, and a mini-batch size of 64 images. I found that my Imagenette training pipeline, which contains twelve image operations, runs in just over 40 ms! This works out to 625 µs per image for all twelve operations that are in the pipeline. By comparison, in the image augmentation lesson of the fast.ai course, we saw that the main choke point of using Pillow for image transforms was the 5 ms it took for Pillow to load a single image.

We also used PyTorch JIT to implement an image rotation algorithm that, similar to DALI, transforms batches on the GPU. It ran around 4.3 ms per batch. Assuming that a JIT implementation of any transform would take the same duration (possibly a stretch), a quick back-of-the-envelope calculation indicates that JIT performance is likely similar to DALI (4.3 x 12 = 51.6 ms). The beauty of DALI is that while it took twelve lines of code to define the script that carries out our JIT rotation transform, DALI gives us all the same functionality and speed with just a single function call!

DALI + fastai v2.0

For folks going through the 2019 fast.ai deep learning part II course, here are three tricks to get DALI’s data loaders to mesh seamlessly with the new-and-improved Learner() objects.

Trick 1

Modify the Learner class so that it properly indexes into the tensors returned by DALI data loaders. Images and labels are contained under 'data' and 'labels' keys, respectively:

xb = to_float_tensor(batch[0]['data'])
yb = batch[0]['label'].squeeze().cuda().long()

Also be sure to reset DALI train and val data loaders after each epoch:

self.data.train_dl.reset()
self.data.valid_dl.reset()

Trick 2

Change the AvgStats class so that the all_stats() method returns self.tot_loss and not self.tot_loss.item().

Trick 3

Cap the maximum value of the variable that combine_scheds(), the hyperparameter schedule builder, uses when tracking the current iteration's position relative to the training cycle's length: pos = min(1 — 1e-7, pos).

The original intention was that this value would always be below 1.0 during training. However, when using DALI its value at the beginning of the final iteration would at times be 1.0 or slightly greater. This causes an IndexError, as the scheduler is forced to index into a schedule phase that actually doesn't exist!

Feel free to refer to my notebook to see a working version of a training loop that includes these three modifications.

DALI’s Most Glaring Deficiency

I’ll wrap up this dive through the DALI library by spending some time on what I believe is its most notable shortcoming: some of its image transform operations are not capable of generating randomized outputs. I found this to be particularly ironic in light of the fact that the DALI website devotes an entire section to preaching the benefits of image augmentations that are able to randomly perturb input images, stating:

“Rotating every image by 10 degrees is not that interesting. To make a meaningful augmentation, we would like an operator that rotates our images by a random angle in a given range.”

If that’s the case, I suppose that DALI’s warp affine image transform ought to be deemed “not that interesting,” seeing as how it actually can’t generate random image warps. What’s even more frustrating is that, though I wrote my own logic that generates random affine transforms according to the convention expected by the matrix parameter of DALI's warp affine operation, there was absolutely no way that I could coax my DALI pipeline to execute this logic for a mini-batch's images at runtime.

Unsurprisingly, someone requested support for randomized warp affines but a DALI team member explained that warp affine wasn’t currently a priority, as the team was focused “on providing operators that are used in the most common networks.” Now, as someone who was a software product manager in another life, I’m certainly sympathetic to the idea of prioritizing features. However, seeing as how the DALI team wasn’t too hesitant to be loud and proud about the benefits of randomized image rotations, it’s hard for me to see how randomized warp affines couldn’t be a priority.

Now that said, one saving grace was that the DALI team member did encourage open source contributions to make up this feature deficit. This is a good thing, and perhaps one day soon I’ll try to port my random affine transform logic and submit a pull request.

“Synthetic Random” Warp Affine Transforms

I wasn’t ultimately content to omit perspective warps from my augmentation pipeline, nor was I okay with applying the same single, solitary warp affine transform to any image in any batch. After trying and failing to execute the logic that would randomize the affine transform performed by DALI’s ops.WarpAffine operation, I decided to try out an admittedly unconventional workaround that suddenly popped into my brain. I’ve taken to calling this a “synthetic random” warp affine transform. Here’s how it works:

Write a function that generates random affine transformations that can be passed to ops.WarpAffine. My algorithm ensures that a randomly generated affine transform will tilt an image's perspective, but won't unnaturally squish or stretch the contents of the image.
Add somewhere between two and twenty DALI ops.WarpAffine operations to my pipeline. (I performed experiments to determine the right amount and found that seven worked best.)
Generate a unique affine transformation for each ops.WarpAffine operation that I include in my pipeline.
Apply each of the pipeline’s warp affine transforms to a particular image with a probability somewhere between 0.3 and 0.025. (I found that 0.025 worked best.)

My intuition was that with an adequately chosen number of warp affine operations, balanced with a suitable probability that each operation would be applied, I could simultaneously:

Maximize the variety of perspective warp transforms applied to a mini-batch’s images.
Adequately minimize the chance that a single image would have two or more warp transforms applied to it during each mini-batch.

By way of a sequence of experiments that are chronicled in my notebook, I found that including seven consecutive warp affine transforms, each with a 0.025 probability of being applied to any image in a batch, led to the largest increase in average validation accuracy over ten runs. This regime’s performance exceeded that of a baseline that didn’t contain any warp affine transforms. My “synthetic random” warp affines also bested a pipeline that contained just one warp transform that tilted any image in every batch exactly the same way, which would seem to be how DALI currently expects practitioners to use this operation.

One bright side throughout all of this was, again, DALI’s speed: adding an extra two, or twenty, warp affine operations to my pipeline didn’t appreciably lengthen the time it took to process each mini-batch.

Please know that I don’t write about my “synthetic random” warp transform intending that other practitioners try a similar approach. Rather, I hope to express that to whatever extent my workaround may seem unconventional, the fact that DALI’s warp transform doesn’t support randomization is similarly unconventional.

Three Smaller Quibbles

Prospective DALI users who’ve become used to the dynamic nature of PyTorch, should expect a decidedly static TensorFlow v1.0-like experience. The DALI team wasn’t kidding around when they named the Pipeline class’ core method define graph, so don’t expect to be able to run any custom arithmetic inside of it, as I attempted when trying to add randomness to ops.WarpAffine. The current recommended approach is to create and compile a custom C++ operator, instead. This doesn't strike me as being terribly "flexible," and hopefully DALI will expand its breadth of augmentation options, which should obviate the need for practitioners to create custom ops.
Speaking of which, DALI lacks support for reflection padding. My hypothesis is that this is a big reason that adding rotation transforms to my pipeline didn’t improve model performance until I curtailed the range of my rotation angle from [-30, 30] degrees to within [-7, 7] degrees. While DALI does allow practitioners to specify the single color that will be used to the pad empty pixels found in image corners after a rotation, I’m skeptical that using, say, all-green or all-white padding instead of the all-black default would result in a meaningful improvement in my model’s performance.
I had intended to center-crop and then resize validation set images. While DALI’s ops.Crop operation allows us to set the coordinates of the crop-window's upper left-hand corner relative to an input image's width and height, there doesn't appear to be any way to make the width and height of the cropping window also scale relative to each input image's width and height.

In Conclusion

DALI offers a concise Python API that plays nice with PyTorch, TensorFlow, and with just three tweaks, also works swimmingly with what’s shaping up to be the training loop that we’ll see in version 2.0 of the fastai library. By running image augmentations in parallel using GPU-optimized code, DALI more than delivers on its promise of speed, and obviates the need for writing several-line JIT scripts, which was the only other way I knew how to get image augmentations running in batches on the GPU. Unfortunately, not all DALI image transforms support randomization, which is a feature that ironically even the DALI team acknowledges is a must-have. While DALI claims to be flexible, my attempt to build randomness into DALI’s warp affine operation revealed that this flexibility only extends toward folks willing and able to subclass and compile C++ classes. In 2019 I’m not sure that anything that requires dinking and dunking with C++ can still be termed “flexible.”

Even so, while DALI’s narrower feature set may make it harder for final versions of my models to reach SOTA or climb a Kaggle leaderboard, I still plan to use the DALI library in the earlier stages of my model prototyping process. Its Python API is easy to use and DALI just runs so darn fast. After all, we’re talking about augmenting images in batches on the GPU, here! I’m hopeful that the library will continue to fill in its gaps and improve.

References

Feel free to view the notebook where I experimented with the pipeline discussed in this article.
The PyTorch ImageNet training example on DALI’s GitHub page, created by Janusz Lisiecki, Joaquin Anton, and Cliff Woolley, was indispensable as a template for helping me figure out how to write my own training and validation pipeline classes from scratch.