
Object Detection is one of the most popular and used computer vision methods nowadays, where the intention is not only to determine whether the object is found or not in the image in the same way as most common classification problems but also point the location of these objects of interest, being the necessary approach for situations where multiple objects may appear simultaneously in the image.
One of the challenges of this method is to create the dataset, once it’s necessary to manually set the positions of all objects in the image, spending a lot of time to do so in a large number of observations.
This process is inefficient, expensive, and time-consuming, mainly in some problems that are required to label dozens of objects in each image or demand specialized knowledge.
Based on this, I created a TensorFlow Semi-supervised Object Detection Architecture (TSODA) to interactively train an object detection model, and use it to automatically label new images based on a confidence threshold level, aggregating them to the later training process.
In this article, I’ll show you the necessary steps to reproduce this approach in your object detection project. With this, you’ll be able to create labels in your images automatically while measuring the model performance!
Table of contents:
- How TSODA Works
- Example Application
- Implementation
- Results
- Conclusion
How TSODA works
The working is similar to any other semi-supervised method, where the training is done with labeled and unlabeled data, unlike the most common supervised approach.
An initial model is trained using strongly labeled data done by hand, learns some features from these data, and then create inferences in the unlabeled data to aggregate these new labeled images to a new training process.
The whole idea can be illustrated by the following image:

This operation is done until the stop criterion is reached, either the number of executions or no remaining unlabeled data.
As we saw in the schema, a confidence threshold of 80% was initially configured. This is an important parameter once the new images will be used to a new training process and if incorrectly labeled could create undesirable noise, undermining the model performance.
The propose of TSODA is to introduce a simple and fast way to use semi-supervised learning in your object detection project.
Example Application
To exemplify the approach and test if everything is working properly, a random sample of 1,100 images of the Asirra dataset was done in a proportion of 50% per class.
The images were labeled manually to a later comparison, you can download the same data on Kaggle.
I used Single Shot Multibox Detector (SSD) as the object detection architecture and Inception as the base network instead of VGG 16 like in the original paper.
SSD and Inception have a good trade-off between training speed and accuracy, so I think it’s a great start point, mainly because in each iteration the TSODA needs to save a checkpoint of the trained model, infer new images and load the model to train it again, so a faster training is good to iterate more and aggregate these images to the learning.
Testing performance
To test TSODA performance just 100 labeled images of each class were provided to split into training and test while 900 were let as unlabeled, simulating a situation where just a little time was spent creating the labeled dataset. the obtained results were compared to a model trained with all the manually labeled images.
The data were randomly split into 80% of images for training and 20% for testing.
Implementation
As the name suggests, the whole architecture is done using the TensorFlow environment, in version 2.x.
This new TF version is not yet fully compatible with object detection, and some parts were difficult to adapt, but in the next months this will be the default and more used version of TF in all projects, that’s why I think it’s important to adapt the code to use it.
To create TSODA, new scripts and folders were added in a fork of TF Model Garden repository, so you can easily clone and with just small modifications run your semi-supervised project, besides be a familiar structure for those who work with TF.
You can clone my repository to easily follow these steps or adapt your TF model repository.
The work was done inside _models/research/objectdetection, where you will find the following folders and files:
- _inference_frommodel.py: This file will be executed to use the model to infer new images.
- _generate_xml.py and generate_tfrecord.py: Will both be used to create the train and test TF records used in the training of the object detection model (these scripts are adapted from raccoon dataset_).
- _test_images and train_images_ folder: Have the JPG images and XML files that will be used.
- _unlabeled_images and labeled_images_ folder: Contains respectively all images without labels and the images automatically labeled by the algorithm that will be divided into training and test folder to keep the proportion ratio.
Inside utils folder we also have some things:
- _generate_xml.py_: This script is responsible to get the model inference and generate a new XML that will be stored inside the _labeledimages folder.
- _visualizationutils.py: This file also has some modifications in the code to capture the model inference and pass to the "generateXml" class.
That’s it, this is all you need to have in your repository!
Preparing Environment
To run this project you will need nothing!?
The training process is in a Google Colab Notebook, so it’s fast and simple to train your model, you will literally just need to replace my images by yours and choose another base model if you and.
Make a copy of the original Colab Notebook to your Google Drive and execute it.
If you really want to run TSODA in your machine, at the beginning of the Jupiter notebook you’ll see the installation requirements, just follow it but don’t forget to also install TF 2.x. I recommend creating a virtual environment.
Understanding the code
The _inference_from_model.py_ was responsible to load the _savedmodel.pb that was created in the training and use it to make new inferences in the unlabeled images. Most of the code was got from the object_detection_tutorial.ipynb found in the _colabtutorials folder.
If you don’t want to use Colab for training you’ll need to replace the paths at the beginning of the file.
Another important method in this file is the _partition_data_ which is responsible to split the inferred images (that will be in the _labeledimages folder) into training and test to keep the same ratio.
A change that you may want to do is in the split ratio, in my case, I chose an 80/20 proportion, but if you want something different, you can set it in the method parameter.
The visualization_utils.py is where the bounding boxes are drawn into the image, so we use this to get the boxes’ positions, class name, file name, and pass it into our XML generator. The following code shows the most of the process:
The XML is generated if a box is detected into the image with a higher confidence level than specified.
All the information arrives in the generate_xml.py and the XML is created using ElementTree.
Inside the code, there are comments that will help you to understand how everything is working.
Results
To evaluate the model performance was used the mean Average Precision (mAP), if you have some doubt about how it works, check out this.
The first test was done training a model by 4,000 epochs, using all the images strongly labeled.
The training took about twenty-one minutes and the results are shown in Table 1.

As expected, the model got a high mAP, mainly in a lower UoI rate.
The second test was done using the same configurations but with TSODA considering just 100 labeled images. In each iteration, the model was trained by 1,000 epochs and then used to infer and create new labeled images. The results are shown in Figure 2.

The whole training process took thirty-eight minutes, about seventeen minutes more than the previous one, and the model reached a worse final mAP, as shown in Table 2:

As Table 3 reveals, most images were successfully annotated in the first iteration, being aggregated in the training. This could mean that the minimum confidence threshold isn’t high enough, as in the first thousand iterations the model doesn’t converge properly yet, possibly creating wrong annotations.

TSODA requires more time and epochs to improve model performance and get close to the original method. This happens because the addition of new images in the training set leads to a loss in mAP once the model needs to learn how to generalize new patterns as proved in figure 2, where the mAP decreases as new images are included before starting increasing again when model learns new features.
In Figure 3 there are some examples of images automatically annotated. Notably, some labels are not so well marked, but it’s enough to guarantee more information to the model.

Some new experiments were performed considering a different epoch increment behavior as well as a higher confidence threshold. The result is present in Table 4:

Setting a confidence threshold to 90% ensures a higher chance of a correct label in predictions, being an important factor for model convergence. Although the training was done for 2,500 epochs in the initial iteration instead of just 1,000 once the first iteration is where most images are labeled, being necessary to the model learn more features and be able to beat the higher confidence. After the first iteration, the subsequent ones increment one 1,500 epochs until a limit of 8,500. These new configurations improved the final results.
TSODA may perform differently based on the kind of object of interest and it’s complexity. The results could be improved if trained by more epochs or set a higher confidence threshold with the drawback to increasing the training time. Also, the epochs increment by iteration must change depending on the problem, to control the model convergence based on the number of unlabeled images and threshold.
Nevertheless, this is a good alternative, once training time is cheaper than the manually labeling time that requires a human, and the TSODA was constructed in a manner that with just a few modifications it’s possible to train a completely new large-scale model from scratch.
The auto-created labels could also be manually adjusted in some images, which can improve the overall performance and is faster than creating all the labels manually.
Conclusion
The proposed TSODA can achieve satisfactory results in creating new labels to unlabeled images, reaching similar results to a strongly-labeled training approach, but with considerably less human effort. The solution also is adaptable for any other CNN detector architecture and is easy and fast to implement, helping the dataset creation process while measuring the overall object detector performance.
References
For more details and context about this semi-supervised project, see my preprint.