The world’s leading publication for data science, AI, and ML professionals.

How to reduce training parameters in CNNs while keeping accuracy >99%

Good-performing neural networks do not have to be unnecessarily large

Photo by Uriel SC on Unsplash
Photo by Uriel SC on Unsplash

Research in the field of image recognition is focused on Deep-Learning techniques and has made good progress in recent years. Convolutional networks (CNNs) are very effective in perceiving structure in images and thus in automatically extracting unique features. Nevertheless, the usually very large networks require the availability of massive computational power and a long training time in order to achieve the highest possible accuracy to be achieved.

In this work, I will show you 3 approaches to keeping the number of parameters in a convolutional network as small as necessary without compromising the accuracy.

We will use the "Modified National Institute of Standards and Technology (MNIST)" (source) dataset in this experiment.

Hint: For a better understanding of this experiment a basic knowledge of Machine Learning is needed. We will deal for example with pooling, normalization, regularization, and pruning. So it would be advisable to familiarize yourself a little with these techniques before continuing.


Motivation

Most of the existing approaches that effectively solve the MNIST dataset have high accuracy, but the parameter count of the respective models is often in the high range of 100,000–200,000 or even higher [1, 2]. The reason for this is that these approaches usually have a very high number of convolutional layers with large feature maps as feature extractors, while the dense layers are used for classification generation. Accordingly, the architectures result in many parameters. A typical structure of a CNN for classification is shown in the following figure:

Fig. 1: Typical CNN architecture, source [2]
Fig. 1: Typical CNN architecture, source [2]

More specifically, the challenge of this experiment is to reduce the model size to a parameter number of less than 10,000 (i.e., less than 10% of most architectures) while maintaining accuracy in the 99%+ range.


Implementation

Preparation of the data set

The training data is first normalized to a certain format, as is expected by the model. The so-called one-hot encoding is used for this. Then the data set is loaded into memory for better performance. Then the training data is shuffled so that the same order of the training data set is not always taken per training run. Then the data is divided into equal-sized batches to get unique batch sizes in each epoch. The same procedure, except for the shuffling of the data, is done with the test data as well.

Approaches

In this section, the architectures used for the experiment are presented. The resulting models are then trained on the MNIST dataset training and test accuracy is measured after each epoch. These are observed for 40 epochs in each approach.

The different architectures are created in each case, which are then tested for their performance. As a requirement of the experiment, each architecture has less than 10,000 trainable parameters (Trainable parameters in Keras are defined in such a way that they can change in the course of training. Excluded from this are, for example, parameters in activation layers, MaxPooling, Flattening, and Dropout).

1. Approach

This approach follows the "typical architecture" (see Figure 1) of CNNs in a reduced variant. Due to the restriction on the number of trainable parameters only convolutional layers were used for feature extraction and only one single Dense layer for classification. As a result, this model has only 7968 trainable **** parameters.

Architectural details:

Figure 2: Keras Model summary of approach 1, Source: own figure
Figure 2: Keras Model summary of approach 1, Source: own figure

The model consists of 11 layers in total. The data is acquired through the input layer in the form (28,28,1) since the images have a size of 28×28 pixels and a grayscale channel for the color value. This is followed by two successive blocks, each consisting of the following layers: A Conv2D layer with ReLU activation, followed by Batch Normalization, and followed by the MaxPooling layer.

MaxPooling contributes greatly to parameter reduction, as it downsamples the input data. The convolutional layer in the first block consists of 32 different filters with a very small kernel size of 3×3. In the second block, the number of filters is reduced to 14. No padding was used since it is assumed that the essential parts of the image are located in the center rather than at the edges. The step size (strides) was set to 1. ReLU was used as the activation function since it has become state-of-the-art. The batch normalization provides a regularization to prevent overfitting. The MaxPooling Layer with a pool size of 2 reduces the results of the previously applied filters by a factor of 2. As a transition to the Fully Connected Neural Network now follows a Flatten Layer, which reduces the tensor to a vector of length 126 (14x3x3). Afterward, a Dropout of 10% is added, which is also used for regularization. The last layer is the Dense layer with a Softmax activation. The model has a total of 8,060 parameters, of which 7,968 are trainable.

_Configuration:_In order not to train the model more than necessary, early stopping is used. The Test Accuracy is observed over 10 epochs. The lower limit of 98% should not be fallen below. If the test accuracy does not improve a restore of the best weights is initiated ("restore best weights" by Keras). Reduce Learning Rate on Plateau is set for 4 epochs at a time. This leads to that a steadily worsening Test Accuracy after 4 epochs causing a reduction of the Learning Rate by a factor of 0.5. The loss is calculated with the method of Categorical Crossentropy, the optimizer was Adam. The batch size was 125 so that the data is available in even batches and the learning rate set at the beginning is 0.01. By using Batch normalization, the accuracy converges faster, i.e., the learning process is accelerated and a lower learning rate is not necessary.

Results:

Figure 2: Training curve plot of approach 1, Source: own figure
Figure 2: Training curve plot of approach 1, Source: own figure

On average, the test accuracy already reaches over 98% after the 2nd epoch and subsequently converges towards 99%. Using Reduce Learning Rate on Plateau stabilizes the training process. Early Stopping continued on average between the 30 and 40 epochs. The network produced the best test accuracy of 99.24%. Thus, from the result of the experiment, it can be concluded that the network has high generalization capability and, despite a smaller number of learning parameters, exceeds the to be achieved despite the smaller number of learning parameters.


2. Approach

In this approach, only convolutional layers were used for feature extraction. layers were used for feature extraction. For the classification, the method Global Average Pooling was used. Pooling was used for the classification, which replaces the Dense-Layer for the task of classification. This saved a significant number of parameters that are normally generated by the Dense layer is normally generated. It has already been mentioned in [4] and [5] that narrow and deep networks should have a better generalization capability. In order to compare to approach 1, this architecture was designed with more layers compared to approach 1.

Architectural details:

Figure 2: Keras summary of approach 2, Source: own figure
Figure 2: Keras summary of approach 2, Source: own figure

As mentioned earlier, a very deep network was chosen for the architecture in this approach. It consists of 5 defined "conv-blocks" each, which always consist of the following consecutive layers: Convolutional Layer, ReLU, Batch Normalization. The number of filters in the conv layers is defined as 4–8–16–10. The kernel sizes are always 3×3. No padding was used and the step size is 1. Max Pooling is performed after the second block, which helps to reduce the parameters. The following two blocks have the most parameters, due to the high number of filters. However, these are important for the featur extraction and resulting accuracy of the model. The last block has only 10 filters, which according to several tests is quite sufficient. The last layer of the last three blocks is a 10% dropout and is used for regularization. The model has a total of 5,490 parameters, of which 5,382 can be trained.

_Configuration:_In this network, the configuration of Early Stopping, Reduce Learning Rate on Plateau, Loss, Optimizer, Batch Size, and Learning Rate is identical to the configurations of Approach 1.

Training curve plot of approach 2, Source: own figure
Training curve plot of approach 2, Source: own figure

Observation: After initial fluctuation of the test accuracy, it slowly converges towards 99% after 10 epochs. The network produced the best test accuracy of 99.47%.

3. Approach

The architecture is essentially similar to approach 2. In this approach, too, only convolutional layers are used for feature extraction and the global average pooling method was also used for classification. However, it consists of fewer layers and has no dropout. The configuration was not changed.

Architecture details:

Screenshot from Keras of approach 3, source: own figure
Screenshot from Keras of approach 3, source: own figure

The architecture is also a deep network with 5 blocks each consisting of a convolutional layer with ReLU activation and subsequent Batch Normalization. In this architecture the filters were defined again as follows: The filter sizes are defined in ascending order with 2–4–8–16–10. Thus, the number in the first three convolutional layers is halved and the last two are identical to approach 2. The second major difference to approach 2 is the removal of the Dropouts since it was found that these caused the training and test accuracy to drop significantly in both cases. It can be concluded that too frequent use of dropouts in small networks can lead to too much regularization. The model has a total of 3,170 parameters, of which 3,090 can be trained.

Figure 2: Training curve plot of approach 3, Source: own figure
Figure 2: Training curve plot of approach 3, Source: own figure

_Observation:_The reduction of the filter sizes in the first three convolutional layers does not have as high an impact on the accuracy as expected. The network still achieves accuracy of over 99%.

comparison of approaches, source: own figure
comparison of approaches, source: own figure

Conclusion: In this section, 3 models with different architectures were tested for their learning success. The model with approximately 5,000 parameters achieved the best accuracy of 99,47%. Furthermore, it has enough filters to extract features efficiently. The use of regularization and normalization ensures a stable training process.


Pruning with the help of the "lottery ticket hypothesis"

The purpose of this section is to conduct a short experiment using the "Lottery ticket hypothesis" and evaluate the results. This hypothesis is based on the assumption that a subnetwork exists within the original network, which is responsible for the majority of the result. This means that the accuracy is primarily determined by this subnetwork. Therefore, it should be tested whether a "winning ticket" exists for the network in approach 2.

It is referred to as the method "Global Pruning", which was also used there for Deep Convolutional Networks. There it is described that the pruning with deeper networks takes place globally by removing the smallest weights collectively across all convolutional layers. With deeper networks, some layers have far more parameters than others. When all layers are reduced at the same rate, these smaller layers become bottlenecks that prevent identifying the smallest possible "winning tickets." Global pruning can avoid this pitfall [3, p.7]. The network in this experiment also has large differences in the number of parameters in the layers, so this method is also used here as well.

_Iterative pruning:_The network from approach 2 is trained with randomly initialized weights. As described in the paper by Frankle & Cabin [3], it is trained until it has the best accuracy. After that, pruning takes place over n rounds. At each round, the remaining weights are pruned with p 1/n %, and a mask is calculated. The remaining weights are reset to the initial values and trained again.

This experiment is conducted over two iterations. Here, Early Stopping and Reduce Learning Rate on Plateau are used to create the same conditions as in the previous experiments. experiments. During the training process, the number of reduced weights per iteration to compare with the original network (see Table 2). Only the weights of the convolutional layers are affected by the pruning process (prunable parameters). In the following table, parameters are listed as "parameters", and the term "reduction" refers to the number of reduced parameters in this iteration.

Source: own figure
Source: own figure

Observation: At each iteration, the accuracy decreases by about 0.1%. The figure below shows that at the beginning of the training process in the first iteration (20% pruning) there is a higher fluctuation than in the original, but this stabilizes somewhat more after 15 epochs. In the second iteration (55.77% pruning), even higher fluctuations in test accuracy are observed at the beginning, but these also stabilize more and more in the later course. In this iteration, early stopping started after 31 epochs. Overall, the desired test accuracy of >99% is achieved in each iteration. The paper mentions that pruned nets learn faster [3, p.4]and can also outperform the accuracy [3, p.5]. In contrast, the accuracy of the pruned network does not increase faster and it does not outperform the network. The reduction in large networks is considerably larger and therefore also has a correspondingly and therefore has a correspondingly higher effect on the training, which only takes place to an only to a small extent.

Comparing all learning curves (blue=original, orange=20%, green=55.78%), Source: own figure
Comparing all learning curves (blue=original, orange=20%, green=55.78%), Source: own figure

Results: Starting from 5220 parameters that could be used for pruning, a reduction to 2309 prunable parameters was achieved. This results in 2471 trainable parameters (Calculated using the summed-up difference of the "trainable parameters" of 5380 calculated by Keras and the weights reduced by pruning to 2309 parameters) of the final model after pruning and test accuracy of 99.2%.

Summary and outlook

In practice, neural networks tend to be overparameterized. For problems like the MNIST dataset used here, fewer parameters are needed than are used in most architectures. This saves training time and computational power. As shown in this work, a suitable architecture can help to extract enough information while not sacrificing (much) accuracy. In addition, methods like pruning can help to limit the remaining weights so that only weights in such a way that only the essential part of the network is left behind, the so-called "winning ticket".

References

[1] Ahlawat, Savita ; Choudhary, Amit ; Nayyar, Anand ; Singh, Saurabh ; Yoon, Byungun: Improved handwritten digit recognition using convolutional neural networks (CNN). In: Sensors 20 (2020), Nr. 12, S. 3344

[2] Siddique, Fathma ; Sakib, Shadman ; Siddique, Md Abu B.: Recognition of handwritten digit using convolutional neural network in python with tensorflow and comparison of performance for various hidden layers. In: 2019 5th International Conference on Advances in Electrical Engineering (ICAEE) IEEE (Veranst.), 2019, S. 541–546

[3] Frankle, Jonathan ; Carbin, Michael: The lottery ticket hypothesis: Finding sparse, trainable neural networks. In: arXiv preprint arXiv:1803.03635 (2018)

[4] Goodfellow, Ian ; Bengio, Yoshua ; Courville, Aaron ; Bengio, Yoshua: Deep learning. MIT press Cambridge, 2016

[5] Simonyan, Karen ; Zisserman, Andrew: Very deep convolutional networks for large-scale image recognition. In: arXiv preprint arXiv:1409.1556 (2014)


If you like my content about Data Science, don’t forget to follow me here on Medium.

Want to connect? You can also reach me on social media (link ). I’m happy to connect with new people and I’m always open for a quick chat :).


Related Articles