Different Types of Normalization in Tensorflow

Learn about the batch, group, instance, layer, and weight normalization in Tensorflow with explanation and implementation.

Vardan Agarwal
Towards Data Science

--

Source

When normalization was introduced in deep learning, it was the next big thing and improved performances considerably. Getting it right can be a crucial factor for your model. Ever had those questions whatever the hell is batch normalization and how does it improve performances. Also are there any substitutes for it? If you like me had these questions but never bothered and just used it for the sake of it in your models this article will clear it up.

Table of Contents

  • Batch Normalization
  • Group Normalization
  • Instance Normalization
  • Layer Normalization
  • Weight Normalization
  • Implementation in Tensorflow

Batch Normalization

Photo by Kaspars Upmanis on Unsplash

The most widely used technique providing wonders to performance. What does it do? Well, Batch normalization is a normalization method that normalizes activations in a network across the mini-batch. It computes the mean and variance for each feature in a mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation. It also has two additional learnable parameters, the mean and magnitude of the activations. These are used to avoid the problems associated with having zero mean and unit standard deviation.

All this seems simple enough but why did have such a big impact on the community and how does it do this? The answer is not figured out completely. Some say it improves the internal covariate shift while some disagree. But we do know that it makes the loss surface smoother and the activations of one layer can be controlled independently from other layers and prevent weights from flying all over the place.

So it is so great why do we need others? When the batch size is small the mean/variance of the mini-batch can be far away from the global mean/variance. This introduces a lot of noise. If the batch size is 1 then batch normalization cannot be applied and it does not work in RNNs.

Group Normalization

Photo by Hudson Hintze on Unsplash

It computes the mean and standard deviation over groups of channels for each training example. So it is essentially batch size independent. Group normalization matched the performance of batch normalization with a batch size of 32 on the ImageNet dataset and outperformed it on smaller batch sizes. When the image resolution is high and a big batch size can’t be used because of memory constraints group normalization is a very effective technique.

Instance normalization and layer normalization (which we will discuss later) are both inferior to batch normalization for image recognition tasks, but not group normalization. Layer normalization considers all the channels while instance normalization considers only a single channel which leads to their downfall. All channels are not equally important, as the center of the image to its edges, while not being completely independent of each other. So technically group normalization combines the best of both worlds and leaves out their drawbacks.

Instance Normalization

Photo by Eric Ward on Unsplash

As discussed earlier it computes the mean/variance across each channel of each training image. It is used in style transfer applications and has also been suggested as a replacement to batch normalization in GANs.

Layer Normalization

Photo by Free To Use Sounds on Unsplash

While batch normalization normalizes the inputs across the batch dimensions, layer normalization normalizes the inputs across the feature maps. Again like the group and instance normalization it works on a single image at a time, i.e. its mean/variance is calculated independent of other examples. Experimental results show that it performs well on RNNs.

Weight Normalization

Photo by Kelly Sikkema on Unsplash

I think the best way to describe it would be to quote its papers abstract.

By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

Implementation in Tensorflow

What’s the use of understanding the theory if we can’t implement it? So let’s see how to implement them in Tensorflow. Only batch normalization can be implemented using stable Tensorflow. For others, we need to install Tensorflow add-ons.

pip install -q  --no-deps tensorflow-addons~=0.7

Let’s create a model and add these different normalization layers.

import tensorflow as tf
import tensorflow_addons as tfa
#Batch Normalization
model.add(tf.keras.layers.BatchNormalization())
#Group Normalization
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu'))
model.add(tfa.layers.GroupNormalization(groups=8, axis=3))
#Instance Normalization
model.add(tfa.layers.InstanceNormalization(axis=3, center=True, scale=True, beta_initializer="random_uniform", gamma_initializer="random_uniform"))
#Layer Normalization
model.add(tf.keras.layers.LayerNormalization(axis=1 , center=True , scale=True))
#Weight Normalization
model.add(tfa.layers.WeightNormalization(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu')))

When assigning the number of groups in group normalization make sure its value is a perfect divisor of the number of feature maps present at that time. In the above code that is 32 so its divisors can be used to denote the number of groups to divide into.

Now we know how to use them why not try it out. We will use the MNIST dataset with a simple network architecture.

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(16, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu'))
#ADD a normalization layer here
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu'))
#ADD a normalization layer here
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss=tf.keras.losses.categorical_crossentropy,
optimizer='adam', metrics=['accuracy'])

I tried all the normalizations with 5 different batch sizes namely 128, 64, 32, 16, and 8. The results are shown below.

Training Results
Testing Accuracies

I won’t go into deep with the results because of discrepancies like dataset bias and luck! Train it again and we will see different results.

If you want to read about these in more detail or discover more normalization techniques you can refer to this article which was a great help to me in writing this. If you would like to further improve your network then you can read this:

--

--