Discontinuity in CNN Training Time with Increase Batch Size

Yuqi Li
Towards Data Science
5 min readJun 8, 2022

--

Credit from: Xin Zhang & Yuqi Li @ Aipaca Inc.

Photo by Emily Morter on Unsplash

Background

Model training time analysis is one of the important topics nowadays with the larger size of new machine learning models. Aside from those supergiant models like GPTs, computer vision models are slow to train for regular end users like data scientists and researchers. A computer vision model’s training time could range from a couple hours to a couple weeks depending on the task and data.

In this article, we discuss one of many interesting findings from our research on model training time. To be clear about what we mean by training time, we want to know how long it takes for a GPU configuration to training a model for a batch of data. Obviously, it depends on many variables, such as model structure, optimizer, batch size etc. However, given enough knowledge about the configuration and model setup, and if the training time for a batch is known, we are able to calculate the training time for an epoch, thus also the overall training time given the number of epochs.

As we get the training time data for CNN models, we fixed model structure, input and output size, optimizer and loss function, but not batch size. In other words, we want to learn how increasing batch size affects training time with everything else fixed.

Before we were doing this, what we know for sure is there is a positive correlation between batch size and batch training time. What we weren’t sure is that is it a linear relation or non-linear relation? If it is linear, what is the slope? If it is non-linear, is it a quadratic or cubic relation? With these questions in mind, we did the experiment and observed something we didn’t think of.

Experiment

We ran TensorFlow VGG16 on a Tesla T4 cloud instance, with default input shape (224, 224, 3), optimizer, and CategoricalCrossentropy for loss function. We step batch size from 1 to 70. The experiment result is shown below, the x-axis is batch size, y-axis shows the corresponding batch training time. What is interesting is that our expectation is partially correct, we indeed observe a positive linear relation between batch size and batch training time. However, at batch sizes equal to16, 32, 48, and 64 we observe a “jumping” in batch training time.

Image Made by the Author

This is a plot of batch size vs batch time of VGG16 on Tesla T4, we observe the overall slope of the relationship is pretty much unchanged, which means most likely a linear relation is confirmed. However at certain batch sizes, in particular 16, 32, 48 and 64, the linear relation breaks down, and discontinuities happen at these locations.

We can say for sure the values 16, 32, 48, and 64 do not appear as random, they are multiples of 16, which happen to be the same value of PCIe link max width 16x of the GPU. PCIe is the short for PCI Express, a quote from wiki “The PCI Express electrical interface is measured by the number of simultaneous lanes. (A lane is a single send/receive line of data. The analogy is a highway with traffic in both directions.)”. In simple words, the wider the PCIe, the more traffic of data transferring can happen at the same time.

Our assumptions for the training process are as follows. In the training period for the VGG16, for each batch training step, every data point in the batch is assigned to use one of the PCIe lane, if the batch size less than or equal to 16, no additional round is needed, the results from each PCIe lane is combined thus we have a linear relation. When the batch size is bigger than 16 but less than 32, another round is needed to compute the whole batch, which causes a “jump” in the training time due to a new round assignment (We assume there is some additional time needed for a new round causing a shift or “jump” of the curve).

Trails on different GPUs

In order to validate our observation on the above experiment, we conducted the same experiments but on different GPU setups, the diagrams below show the results from Tesla K80, Tesla P100, and Tesla V100.

What we can learn from the plots is, firstly, the speed of V100 > P100 > T4 > K80, since for the same batch size, the batch time of V100 < P100 < T4 < K80. Secondly, they all have the “jump” at 16, 32, 48, and 64. And for all four GPUs, they all have PCIe link max width of 16x. (We wanted to compare the result from a GPU with PCIe link max width is not 16x, however, all GPU cloud instances we can find on Google are all with PCIe link max width 16x).

Trails on different model structures

To test out our findings on different models, we run the same experiment on V100 for VGG19 MobileNet and ResNet50.

The results are interesting. For VGG19, we can still find exactly the same pattern as VGG16, but with slightly longer training time which is expected. However, for both MobileNet and ResNet50, we do not observe this pattern anymore. In fact, the training time for both MobileNet and ResNet50 fluctuate much more compared to VGGs.

We do not have a good explanation for this phenomenon yet. What we can say now is that for regular CNN structures similar to VGGs, this “jumping” behavior is true. For other different CNN structures, we do not observe it anymore. Further investigation and research are in progress.

Image Made by the Author

Final Remarks

This research came from an open source research project called Training Cost Calculator (TCC). The project goal is to understand the factors that impact machine learning training time(TT) by producing a huge ML experiment database. Based on the database, TCC is able to predict a training job’s TT on different cloud servers, thereby matching the optimal server for your specific ML model. If this field interests you, please join us as a contributor.

In the article we show the “jumping” phenomenon for VGG-like CNN models. Our explanation is that PCIe lane assignment causes this.

But there still are Issues with our current explanation: why don’t we see a double in training time from our explanation, since if batch size is 32, the GPU needs to do 2 rounds of the same computation. And why does this only occur for VGG16 and VGG19, but not MobileNet and ResNets. These questions need further investigation and research.

Code to replicate the experiments

--

--