
This work will introduce you to single object localization using pre-trained CNN and a few additional interesting adaptations to find the best performing model in them according to the context.
Let’s start by explaining localization. In simple words, to locate an object in an image. Localization means entailing an object to a bounding box or precisely to a rectangle for localizing the concerned object after identifying it. There are single-object localization, multiple object localization and semantic segmentation for doing things of similar means with different forms of doing the purpose.
Here I will stick to single-object localization, which will identify the needed object in the image and then locate the same using CNN. Also, notice that I will use Mobile Net, ResNet and Xception separately as pre-trained convolutional neural networks and will perform the whole classifier and localizer for each. On the way, IOU (Intersection over Union) will be made familiarise and will print out the same for each of them and finally, we will see which pre-trained network performs well for the dataset we use.
For this project, the Oxford Pet dataset is suitable: You can download the same from the link below.
http://www.robots.ox.ac.uk/~vgg/data/pets/
Now let’s analyse the dataset. The dataset contains images of animals, and each image contains a single animal. We could see that these animals are cats and dogs having different types. Notice that the image’s alignment, position, and structure are different in each image, which can help us have a good training set for more accurate results. Going with the above link, we can download the dataset and the ground truth data. Once we download the data, we will end up in two files: images and annotations. We can get the xml annotation, class lists and all in the annotations folder. Once we have all these in hand, let’s move into Object Localization using different pre-trained CNN models.
Before we typically commence, we will introduce IOU as a metric here itself. Intersection over Union(IOU) helps to understand how much the predicted bounding box vary from the real one. It’s a good measure for understanding how our prediction goes…
PS: for all the pictures printed with bounding boxes after training separately with pre-trained networks, we will print IOU below each picture…..
First, we need to import all the necessary libraries and packages.
from collections import namedtuple
import csv
import tensorflow as tf
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from tensorflow.keras.applications.xception import Xception, preprocess_input
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout, Flatten
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np
import os
# import the necessary packages
from collections import namedtuple
import numpy as np
import cv2
# define the `Detection` object for IOU(
Detection = namedtuple("Detection", ["image_path", "gt", "pred"])
from PIL import Image, ImageOps
# importing XML parsing library for parsing the data
import xml.etree.ElementTree as ET
Now, let’s import the data – I normally use google colaboratory. Thus, I mount my google drive to colab( You can import the data in any way by saving it wherever as per your convenience).
Also, here we can set the target size as (224,224), and we will use Mobile Net, ResNet and Xception as pre-trained networks to compare each of them.
from google.colab import drive
drive.mount('/content/drive')
data_images = '/content/drive/MyDrive/AI_dataset_pets/images'
data_ClassList = '/content/drive/MyDrive/AI_dataset_pets/annotations/list.txt'
data_xmlAnnotations = '/content/drive/MyDrive/AI_dataset_pets/annotations/xmls'
TARGET_SIZE = (224, 224)
Now define the bounding box to locate the animal in the image to it.
#BoundingBox
Bounding_Box = namedtuple('Bounding_Box', 'xmin ymin xmax ymax')
# The following function will read the xml and return the values for xmin, ymin, xmax, ymax for formulating the bounding box
def building_bounding_box(path_to_xml_annotation):
tree = ET.parse(path_to_xml_annotation)
root = tree.getroot()
path_to_box = './object/bndbox/'
xmin = int(root.find(path_to_box + "xmin").text)
ymin = int(root.find(path_to_box + "ymin").text)
xmax = int(root.find(path_to_box + "xmax").text)
ymax = int(root.find(path_to_box + "ymax").text)
return Bounding_Box(xmin, ymin, xmax, ymax)
So, let’s do padding for making the image to a perfect square and apply the needed changes to the bounding box according to the padding and rescaling.
In the following code, normalising has also been done.
def resize_image_with_bounds(path_to_image, bounding_box=None, target_size=None):
image = Image.open(path_to_image)
width, height = image.size
w_pad = 0
h_pad = 0
bonus_h_pad = 0
bonus_w_pad = 0
#the following code helps determining where to pad or is it not necessary for the images we have.
# If the difference between the width and height was odd((height<width)case), we add one pixel on one side
# If the difference between the height and width was odd((height>width)case), then we add one pixel on one side.
#if both of these are not the case, then pads=0, no padding is needed, since the image is already a square itself.
if width > height:
pix_diff = (width - height)
h_pad = pix_diff // 2
bonus_h_pad = pix_diff % 2
elif height > width:
pix_diff = (height - width)
w_pad = pix_diff // 2
bonus_w_pad = pix_diff % 2
# When we pad the image to square, we need to adjust all the bounding box values by the amounts we added on the left or top.
#The "bonus" pads are always done on the bottom and right so we can ignore them in terms of the box.
image = ImageOps.expand(image, (w_pad, h_pad, w_pad+bonus_w_pad, h_pad+bonus_h_pad))
if bounding_box is not None:
new_xmin = bounding_box.xmin + w_pad
new_xmax = bounding_box.xmax + w_pad
new_ymin = bounding_box.ymin + h_pad
new_ymax = bounding_box.ymax + h_pad
# We need to also apply the scalr to the bounding box which we used in resizing the image
if target_size is not None:
# So, width and height have changed due to the padding resize.
width, height = image.size
image = image.resize(target_size)
width_scale = target_size[0] / width
height_scale = target_size[1] / height
if bounding_box is not None:
new_xmin = new_xmin * width_scale
new_xmax = new_xmax * width_scale
new_ymin = new_ymin * height_scale
new_ymax = new_ymax * height_scale
image_data = np.array(image.getdata()).reshape(image.size[0], image.size[1], 3)
# The image data is a 3D array such that 3 channels ,RGB of target_size.(RGB values are 0-255)
if bounding_box is None:
return image_data, None
return (image_data, Bounding_Box(new_xmin, new_ymin, new_xmax, new_ymax))
So, from the input data, we have reshaped the image as well as the bounding box.
def setting_sample_from_name(sample_name):
path_to_image = os.path.join(data_images, sample_name + '.jpg')
path_to_xml = os.path.join(data_xmlAnnotations, sample_name + '.xml')
original_bounding_box = get_bounding_box(path_to_xml)
image_data, bounding_box = resize_image_with_bounds(path_to_image, original_bounding_box, TARGET_SIZE)
return (image_data, bounding_box)
Notice that The yellow box is the predicted bounding box, and the blue box is the ground-truth bounding box is the real one.
Now let’s write the function to plot the image data along with the bounding box and find the Intersection over Union-IOU of the two boxes. It can be computed as IOU = Area of Overlap/Area of Union.
_The code is below in the function ‘plot_withbox’.
def plot_with_box(image_data, bounding_box, compare_box=None):
fig,ax = plt.subplots(1)
ax.imshow(image_data)
# Creating a Rectangle patch for the changed one
boxA = patches.Rectangle((bounding_box.xmin, bounding_box.ymin),
bounding_box.xmax - bounding_box.xmin,
bounding_box.ymax - bounding_box.ymin,
linewidth=3, edgecolor='y', facecolor='none')
# Add the patch to the Axes
ax.add_patch(boxA)
#Creating another Rectangular patch for the real one
if compare_box is not None:
boxB = patches.Rectangle((compare_box.xmin, compare_box.ymin),
compare_box.xmax - compare_box.xmin,
compare_box.ymax - compare_box.ymin,
linewidth=2, edgecolor='b', facecolor='none')
# Add the patch to the Axes
ax.add_patch(boxB)
#FOR FINDING INTERSECTION OVER UNION
xA = max(bounding_box.xmin, compare_box.xmin)
yA = max(bounding_box.ymin, compare_box.ymin)
xB = min(bounding_box.xmax, compare_box.xmax)
yB = max(bounding_box.ymax, compare_box.ymax)
interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)
boxAArea = (bounding_box.xmax - bounding_box.xmin + 1) * (bounding_box.ymax - bounding_box.ymin + 1)
boxBArea = (compare_box.xmax - compare_box.xmin + 1) * (compare_box.ymax - compare_box.ymin + 1)
iou =interArea/float(boxAArea+boxBArea-interArea)
#By intersection of union I mean intersection over union(IOU) #itself
print('intersection of union =',iou)
plt.show()
Now, let’s plot a random image and see what all have happened and check the working of the predicted bounding box.
sample_name = 'Abyssinian_10'
image, bounding_box = setting_sample_from_name(sample_name)
plot_with_box(image, bounding_box)

Thus we have the bounding box in the object.
Now, let’s make all of our data gets processed. Also, let’s remove all those images which have no annotations. And turn it into a Numpy Array.
data_pros = []
with open(data_ClassList) as csv_list_file:
csv_reader = csv.reader(csv_list_file, delimiter=' ')
for row in csv_reader:
if row[0].startswith('#'): continue
# Unpack for readability
sample_name, class_id, species, breed_id = row
# Not every image has a bounding box, some files are missing.So, lets ignore those by the following lines
try:
image, bounding_box = setting_sample_from_name(sample_name)
except FileNotFoundError:
# This actually happens quite a lot, as you can see in the output.
# we end up with 7349 samples.
print(f'cannot find annotations for {sample_name}: so skipped it')
continue
# cat = 0 and dog = 1.
data_tuple = (image, int(species) - 1, bounding_box)
data_pros.append(data_tuple)
print(f'Processed {len(data_pros)} samples')
data_pros = np.array(data_pros)
Now, once we get this done, let’s test the whole with 6 random images
#for checking lets print 6 of them
for _ in range(6):
i = np.random.randint(len(data_pros))
image, species, bounding_box = data_pros[i]
if species == 0:
print(i, "it is cat")
elif species == 1:
print(i, "it is dog")
else:
print("ERROR FOUND: This is of invalid species type")
plot_with_box(image, bounding_box)
And the results look like this.

Splitting the given data for bounding box predictions.
x_train = []
y_class_train = []
y_box_train = []
x_validation = []
y_class_validation = []
y_box_validation = []
validation_split = 0.2
for image, species, bounding_box in processed_data:
if np.random.random() > validation_split:
x_train.append(preprocess_input(image))
y_class_train.append(species)
y_box_train.append(bounding_box)
else:
x_validation.append(preprocess_input(image))
y_class_validation.append(species)
y_box_validation.append(bounding_box)
x_train = np.array(x_train)
y_class_train = np.array(y_class_train)
y_box_train = np.array(y_box_train)
x_validation = np.array(x_validation)
y_class_validation = np.array(y_class_validation)
y_box_validation = np.array(y_box_validation)
Going to use some pre-trained models using transfer learning.
First, I’m using Mobile Net, and I’m going to perform both classifier and localizer.
base_model = MobileNetV2(weights='imagenet', include_top=False, input_shape=(TARGET_SIZE[0], TARGET_SIZE[1], 3))
chopped_mobilenet = Model(inputs=[base_model.input], outputs=[base_model.layers[90].output])
classification_output = GlobalAveragePooling2D()(chopped_mobilenet.output)
classification_output = Dense(units=1, activation='sigmoid')(classification_output)
localization_output = Flatten()(chopped_mobilenet.output)
localization_output = Dense(units=4, activation='relu')(localization_output)
model = Model(inputs=[chopped_mobilenet.input], outputs=[classification_output, localization_output])
model.summary()
Once printing the above, we will obtain a detailed summary of the model build using MobileNet.
Now, plotting the accuracy and loss of the model in every epoch.
plot_training_history(history1, model)
And the plots are:

The true box is in blue, and the predicted box is in yellow
for _ in range(18):
i = np.random.randint(len(processed_data))
img, species, true_bounding_box = processed_data[i]
pred = model.predict(np.array([preprocess_input(img)]))
if pred[0][0] < .5:
print("it is a Cat")
else:
print("it is a dog")
plot_with_box(img, Bounding_Box(*pred[1][0]), true_bounding_box)
Results are:


Notice that here all the images are detected as cats using MobileNet. Taking some random samples for checking to know how well the model is:
some_random_samples = ['Abyssinian_174','american_bulldog_59']
for sample_name in some_random_samples:
path_to_image = os.path.join(data_images, sample_name + '.jpg')
print(path_to_image)
img, _ = resize_image_with_bounds(path_to_image, target_size=TARGET_SIZE)
pred = model.predict(np.array([preprocess_input(img)]))
if pred[0][0] < .5:
print("Yes,Its a Cat")
else:
print("Yes Its a dog")
plot_with_box(img, Bounding_Box(*pred[1][0]),true_bounding_box)

IOU values while using MobileNet are not bad….But in pictures that have some small anonymity, The IOU values are too small…..
Let’s see how it is using ResNet and Xception.
Now, try the same by using ResNet pre-trained network
base_model1 = ResNet50(weights='imagenet', include_top=False, input_shape=(TARGET_SIZE[0], TARGET_SIZE[1], 3))
chopped_resnet1 = Model(inputs=[base_model1.input], outputs=[base_model1.layers[90].output])
classification_output1 = GlobalAveragePooling2D()(chopped_resnet1.output)
classification_output1 = Dense(units=1, activation='sigmoid')(classification_output1)
localization_output1 = Flatten()(chopped_resnet1.output)
localization_output1 = Dense(units=4, activation='relu')(localization_output1)
model1 = Model(inputs=[chopped_resnet1.input], outputs=[classification_output1, localization_output1])
model1.summary()
Please go through the summary once you get it printed by this; it will help you to gain a clear understanding of the network.
Now, we will move on to compiling and fitting the model made with Resnet.
model1.compile(optimizer='adam', metrics=['accuracy'],loss=['binary_crossentropy', 'mse'],loss_weights=[800, 1] )
#lets run it through 10 epochs
history2=model1.fit(x_train, [y_class_train, y_box_train], validation_data=(x_validation, [y_class_validation, y_box_validation]),epochs=10,verbose=True)
history2
I do not include the summary and the validation accuracy and loss in every epoch. You can see them once you implement them, and the plots of accuracy and loss are given below for every epoch.
def plot_training_history(history, model):
plt.plot(history.history['dense_3_accuracy'])
plt.plot(history.history['val_dense_3_accuracy'])
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['training', 'validation'], loc='best')
plt.show()
plt.plot(history.history['dense_3_loss'])
plt.plot(history.history['val_dense_3_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['training', 'validation'], loc='best')
plt.show()
plot_training_history(history2, model1)

Let’s print the images after the changes and trained using ResNet and print the IOU (Intersection over Union) of the two boxes to see how good our prediction is.
for _ in range(3):
i = np.random.randint(len(processed_data))
img, species, true_bounding_box = processed_data[i]
pred = model1.predict(np.array([preprocess_input(img)]))
if pred[0][0] < .5:
print("it is a Cat by ResNet")
else:
print("it is a dog by ResNet")
plot_with_box(img, Bounding_Box(*pred[1][0]), true_bounding_box)

It is detecting a dog as a cat, but IOU values are pretty good!
Let’s try out Xception pre-trained network now. The code for its implementation is given below.
base_model2 = Xception(weights='imagenet', include_top=False, input_shape=(TARGET_SIZE[0], TARGET_SIZE[1], 3))
chopped_Xception = Model(inputs=[base_model2.input], outputs=[base_model2.layers[90].output])
classification_output2 = GlobalAveragePooling2D()(chopped_Xception.output)
classification_output2 = Dense(units=1, activation='sigmoid')(classification_output2)
localization_output2 = Flatten()(chopped_Xception.output)
localization_output2 = Dense(units=4, activation='relu')(localization_output2)
model2 = Model(inputs=[chopped_Xception.input], outputs=[classification_output2, localization_output2])
model2.summary()
Compiling and fitting the model by Xception network:
model2.compile(optimizer='adam', metrics=['accuracy'],loss=['binary_crossentropy', 'mse'],loss_weights=[800, 1] )
#lets run it through 10 epochs
history3=model2.fit(x_train, [y_class_train, y_box_train], validation_data=(x_validation, [y_class_validation, y_box_validation]),epochs=10,verbose=True)
history3
Plotting the accuracy and loss for it.
def plot_training_history(history, model):
plt.plot(history.history['dense_9_accuracy'])
plt.plot(history.history['val_dense_9_accuracy'])
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['training', 'validation'], loc='best')
plt.show()
plt.plot(history.history['dense_9_loss'])
plt.plot(history.history['val_dense_9_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['training', 'validation'], loc='best')
plt.show()
plot_training_history(history3, model2)


Let’s print the images after the changes and trained using Xception and print the IOU of the two boxes.
Let’s see the images processed using Xception:
for _ in range(6):
i = np.random.randint(len(processed_data))
img, species, true_bounding_box = processed_data[i]
pred = model2.predict(np.array([preprocess_input(img)]))
if pred[0][0] < .5:
print("it is a Cat by Xception")
else:
print("it is a dog by Xception")
plot_with_box(img, BoundingBox(*pred[1][0]), true_bounding_box)
Results:

Now, we will test the model with few random samples.
#testing with rand
Some_Random_samples = ['Abyssinian_174','american_bulldog_59']
for sample_name in Some_Random_samples:
path_to_image = os.path.join(data_images, sample_name + '.jpg')
print(path_to_image)
img, _ = resize_image_with_bounds(path_to_image, target_size=TARGET_SIZE)
pred = model2.predict(np.array([preprocess_input(img)]))
if pred[0][0] < .5:
print("Yes,Its a Cat by Xception")
else:
print("Yes Its a dog by Xception")
plot_with_box(img, Bounding_Box(*pred[1][0]),true_bounding_box)
Results:

Xception performs well and is giving quite accurate predictions.
The IOU values look good while trying with Xception and MobileNet, and Resnet.
The final validation accuracy at the final layer obtained for MobilNet= 0.8125
The final validation accuracy at the final layer obtained for ResNet= 0.7969
The final validation accuracy at the final layer obtained for Xception= 0.8438
(You can get the exact final validation accuracy when you obtain the results for each epoch in the training of each model).
All the pre-trained networks such as MobileNet, ResNet, and Xception performed satisfactorily.
But accuracy-wise MobileNet and Xception did well, but in terms of IoU, the prediction fluctuated in all these networks.
When it comes to pictures that have some sort of anonymity, it varies so.
But the IOU was quite good in most of the pictures.
The measurement using IOU clearly makes us understand in which picture we got bad IOU and up to what extent the predicted bounding box differs from the real one.
Observations from accuracy plots and loss plots of each model
For the first pre-trained model, MobileNet, I found that the training accuracy goes above the validation accuracy after the initial epochs while plotting the accuracy and loss in every epoch. The validation loss was also high in the model.
For the second pre-trained model, ResNet, while plotting from the initial or the starting epoch itself, the training accuracy was exceedingly above the validation accuracy. The validation loss was too high!
For the third pre-trained model, Xception, while plotting, the training accuracy was found above the validation accuracy from the 4th epoch itself. Also, the validation loss was high.
That is, the models are getting overtrained.
I felt that according to IOU, all these models performed quite well except for a few confusing images.
And, in Xception, some pictures give good IOU values too!
Overall, Xception was observed to be performing well for this dataset.
I hope that from this article you would have got an idea of how to do object localization in a dataset and experiment with various pre-trained networks such as Mobile Net, ResNet and Xception.
The results might vary according to the dataset you will be taking, but this will definitely help you make use of the data by performing various experiments or tests and coming up with the best pre-trained network according to the context of your study.