
Hi Guys, today I am going to talk about how to use a VGG Model as a pre-trained model. Let’s take tiny steps
What are these VGG Models?
- VGG models are a type of CNN Architecture proposed by Karen Simonyan & Andrew Zisserman of Visual Geometry Group (VGG), Oxford University, which brought remarkable results for the ImageNet Challenge.
- They experiment with 6 models, with different numbers of trainable layers. Based on the number of models the two most popular models are VGG16 and VGG19.
Before, we proceed, we should answer what is this CNN Architecture and also about ImageNet.
For interested readers, you can refer to the following table to know about all the ConvNet families that the authors experimented with.

What is this CNN Architecture?
Well, CNN is a specialized deep neural network model for handling image data.
- It does not need the traditional image processing filters like the edge, histogram, texture, etc., rather on CNN, the filters are learnable. So, these need not be determined through trial and error.
- CNN has two parts, the first part is a feature learning part and then there is a classification layer (Often referred to as the Fully Connected Layer)
- The main two building blocks of the feature learning part are the convolution layer and pooling layers
- Convolution Layer: The learnable filters or the feature extractors we talked about.
- Pooling Layer: This does some spatial compression also brings about invariance. A car will be a car, even if it is rotated a little bit.
Figure 2, gives an architectural overview of CNN. Convolutions create feature maps, Pooling is achieved through subsampling.
In case you need a more detailed explanation, you can look here.

Why and what of the pre-trained model?
- These are models, which are networks with a large number of parameters ( A Case in point is VGG16, which has 138 Million Parameters)
- Generally, training such a network is time and resource-consuming
- The pre-trained models for CV mostly are pretty general-purpose too
- We can use directly use these models if we pick up any of the 1000 classes it is trained with
- Even if it’s a little bit different, we can remove the top layer and train the weight of that layer only (Transfer Learning)
What is this ImageNet dataset?
This was an initiative taken by Stanford Professor Fei-Fei Li in collaboration with wordnet from 2006. The image annotations were crowdsourced. This actually made the testbed of computer vision tasks really very robust, large, and expensive. Based on ImageNet a 1000 class classification challenge started with the name ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
Actually, this competition is responsible for the birth of most of the prominent CNN models.

Now the implementations
Step 1: Import the model
from keras.applications.Vgg16 import VGG16
model = VGG16(weights='imagenet')
print(model.summary())
There are many other CNN models are available, which can be found here.

Step 2: Loading a sample image
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input,decode_predictions
import numpy as np
img_path = '/kaggle/input/images/dog.jpg'
#There is an interpolation method to match the source size with the target size
#image loaded in PIL (Python Imaging Library)
img = image.load_img(img_path,color_mode='rgb', target_size=(224, 224))
display(img)
The test image, that we are using a Golder Retriever, also please note the image is loaded in a Python Image Library (PIL) format

Step 3: Making the image size compatible with VGG16 input
# Converts a PIL Image to 3D Numy Array
x = image.img_to_array(img)
x.shape
# Adding the fouth dimension, for number of images
x = np.expand_dims(x, axis=0)
Here, the PIL Image is converted to a 3d Array first, an image in RGB format is a 3D Array. Then another dimension is added for a number of images. So, the input is actually a 4D array.
Step 4: Making the prediction
#mean centering with respect to Image
x = preprocess_input(x)
features = model.predict(x)
p = decode_predictions(features)
In this step a simple pre-processing of mean centering is done, then the prediction is made, and finally, the prediction, which is a probability distribution is decoded to comprehensible class names. We have used this in the default top-5 probable class mode.
Output
[[('n02099601', 'golden_retriever', 0.8579672),
('n02099267', 'flat-coated_retriever', 0.018425034),
('n04409515', 'tennis_ball', 0.01615624),
('n02099712', 'Labrador_retriever', 0.015078514),
('n02099849', 'Chesapeake_Bay_retriever', 0.012522769)]]
If we use a bar chart, this is how it will look like

So, without creating a model and training it, we could classify an image of Golder Retriever perfectly.
Endnote:
- The pre-trained models are like magic, we can just download the models and start using them, even without any data and training.
- If the source task and the target task is different then there is some similarity between the domains then we may have to train few layers, but still, it will not be so extensive as training from scratch and will need much less data
Reference:
[1] https://www.kaggle.com/saptarsi/using-pre-trained-vgg-model
[2] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).