The world’s leading publication for data science, AI, and ML professionals.

“Nano-YOLO” – insights on the multi-part loss function of a simplified YOLO v1

In this article, we will describe our approach and learnings to train a simplified YOLO v1 model (we call it "Nano") to detect objects…

Hands-On Tutorials

What we would have loved to know when starting to experiment with object detection

(Image by author)
(Image by author)

In this article, we will describe our approach and learnings to train a simplified YOLO v1 model (we call it "Nano") to detect objects. Our main driver for doing so was "curiosity" – we wanted to see if we were able to successfully dig that deep into object detection to build and train our own model with our own loss function on our own data. We used Tensorflow 2.3.0 for building and training the model.

Since we wanted to learn and try out things, we chose to make our own decision whenever we did not fully understand what the original YOLO v1 paper aimed for. We choose an input size of 112×112 pixel to train the model (the YOLO v1 used 448×448), we adapted the model architecture to the changed size, and we changed parts of the Loss Function. Our goal was to detect three simple objects on our desk – a toy car, a sharpener and a toolset.

In the following sections, we describe how we

  • generated our datasets for training and validation including automated labeling,
  • build the model and implemented the loss function,
  • trained the model to achieve our desired accuracy.

With our "Nano-Yolo" model, we were able to detect three different objects (toy car, sharpener and toolset) on a constant plain background (our apartment floor) with an accuracy of 0.81 (VOC2007 [email protected]). That is not much, but for us – this gave us massive insights and understanding on how object detection works and we what to share this with everyone who has not made this step yet.

Some words on YOLO v1 (if you haven’t heard before)

YOLO v1 tries to detect objects within an image with a single forward pass through a model. Predicting the object class, center position and its bounding within an image.

This is done by breaking down the whole image into grid cells (typically 7×7) and answering for each of these grid cells which object class it holds and a best guess (from the viewpoint of that cell) where the object is centered within that cell and bound across the whole image. As many grid cells might claim larger images, a post-processing is needed to suppress grid cells which claim to have an object detected but are not the best ones compared to others.

Leading to quite good detections of objects within an image.

Creating the training set

We created our own training set with pictures of the objects (toy car, toolset and sharpener) and different pictures of the background (apartment floor). As we wanted to work with a minimal set of images, we worked with data generation to place the same objects on random positions on the background to create a set of 6000 images.

To achieve this, we cutout the objects with OpenCV methods (not part of this article), chose a random part of a background picture and placed the object on a random position on that picture. The information of object position, the placed object type and the bounding box is stored in a text file to be used as our label later.

Additionally, we augmented the created images to increase the robustness of the trained model. We used the imgaug library, which has got a very good documentation showing the effect of the different types of augmentations. Caution is given to not augment in a way making created labels obsolete – e.g., perspective transformation. Or which generates images which will never occur in the intended usage of a model – e.g., flipping upside down as our toy car will not be glued to a ceiling.

We used the following augmentations:

  • Flipping the object left to right (before placing on the background)
  • Motion blur to apply effects of movement to the image
  • Brightness to change colors slightly to appear in different light settings
  • Sharpen edges of the object.

The following picture shows the Images created as original and with the augmentations blur, color, sharpen

Images: original, motion blur, color brightness and sharpen (Image by author)
Images: original, motion blur, color brightness and sharpen (Image by author)

Each augmentation is applied with a certain probability to each image. Each generated image is unique with a high probability.

For our example training, we used 4 original images – one image of the apartment floor and one for each object. To achieve a high quality in image generation, the images were generated at 448×448 pixel and afterwards downsized to the desired image size of 112×112.

Model setup

We implemented the model architecture roughly as described in the original YOLO v1 paper (link) – though we applied some changes:

  1. We reduced the input size of the images and adapted the number of layers and max poolings accordingly to get from 112×112 (the image size) to 7×7 (the grid size)
  2. We added a class for "empty" to depict our background without any object on it and changed the loss function accordingly
  3. We removed most of the hyperparameters in the loss function and replaced those with calculated parameters based on statistics of the current batch

In the following, we will describe in detail how we set up the model and how we understood and implemented the loss function.

Model architecture

We set up 2 models:

  • "classification", for classification pretraining
  • "detection", for detecting the objects

Both models share a "common base" model, which makes up the biggest part of the graph. While the head for classification or detection is just added to that common base model (watch out: the ‘head’ – which generates the output of a model – is typically shown on the bottom of a drawn model graph or on the right if the graph is drawn horizontally).

"Common base" model for classification and detection (Image by author)
"Common base" model for classification and detection (Image by author)

Our common base model uses stride 1 in convolutions followed by a max pooling to reduce the matrix. This is the technique for reduction as used in the original YOLO v1 paper. One could also achieve this reduction by working with stride 2 convolutions.

Adding the classification head

For the classification model, we flattened the output of the common base model and applied a dropout with 0.5 before finishing with a dense layer of 4 (our classes "nothing", "toy car", "sharpener", "toolset") with a softmax activation.

The softmax was possible as we did not generate images with more than one object on it.

The original YOLO v1 paper featured an additional dense layer and an average pooling layer for the head, which we tried first but skipped – we did not see any benefit or improvement of these additional layers during training on our dataset.

Adding the detection head

For the detection model we added the following layers to the base model:

  1. Two convolutional layers with a leaky relu activation function, weight decay and a filter size of 512 and 128
  2. Flatten and dropout of 0.6
  3. Dense layer (fully connected) of size 1024 with a tanh activation function
  4. Dense layer of size 686 (this is 7 vertical grid cells 7 horizontal grid cells per grid 4 hot encodings per class of that cell + 2 boxes with 1 confidence and 4 box dimensions)
  5. Custom reshape layer to output classes, boxes and box confidences with different value ranges (classes with softmax, boxes and box confidences with sigmoid)

The convolutional layers are needed to learn features for detection based on the features of the common base model. There must be some room for this learning and these additional convolutions will do the trick. The dense layer collects all data together and provides the capacity to learn to predict needed detection outputs.

Loss function

Now, let’s have a look at the loss function. For object detection based on YOLO v1, a prediction will first be far from perfect and errors will be made in three different ways:

  1. Wrong prediction of the object class
  2. Wrong prediction of the position and bounding box of the object
  3. Wrong confidence which internal box representation is the correct one for the object

Each error needs to be penalized by a loss to train the model, so the overall loss is composed out of these three parts – a so called Multi-Part Loss Function.

In each training step, this loss in calculated and the gradients of the model functions (the way towards the optimum) are calculated. Going one step in the direction of the gradient now decreases the loss.

One thing we deeply dived into while building our "Nano-YOLO" was the composition of the loss function out of these three parts and how these parts interact with each other. It turns out that most hyperparameters described for YOLO v1 are just a weighting in mixing up these three parts into a single loss value.

Visualizing the parts of the loss function

As the typical "loss" value out of the training history will only show the combined loss of the multi-part loss function, we added additional data writers to log out these partial losses.

Within our class holding our actual loss function ‘yolo_loss’ we added an attribute "self.lastClassLoss" which takes the last value of the class loss before this loss is summed up with all the other partial losses at the end of the actual loss function ‘yolo_loss’.

class YOLOLossFunction ():
   def __init__(self, ...):
      ...
      self.lastClassLoss = 0
      ...
   def yolo_loss(self, groundtruth, prediction):
      ...
      self.lastClassLoss = class_loss
      ...
      return (classloss + ... )

To be able to write this "lastClassLoss" to the history while training, we need a simple function taking labels and predictions as input and return just this attribute:

class YOLOLossFunction ():
   ...
   def classLoss(self, y_true, y_pred):
      return self.lastClassLoss
   ...

We just need to add this function to our metric parameter while compiling the model to get the return values written to history with each training step:

yololoss = YOLOLossFunction(...)
detmodel.compile(
  ...,
  metrics=[..., yololoss.classLoss, ...]
)

We did this with each part of the multi-part loss function.

Plot of validation loss of each part of the multi-part loss function (Image by author)
Plot of validation loss of each part of the multi-part loss function (Image by author)

With this information, you can see clearly, how your multi-part loss function works – which part is optimized first and also which part does not move at all. As the optimizer strives for the easiest way to lower the loss – some parts might outnumber others completely if they are not leveled. With these insights, you can manipulate the hyperparameters to your needs for your specific dataset and steer what should be optimized first. Within our loss function, we managed to start optimizing the class prediction before moving to the center position and then the box and its confidence (confidence is still not working correctly in our example but maybe because there is just no need to "specialize" the two boxes in on grid cell on different classes as mentioned in the original YOLO v1 paper).

Class loss

The class loss is the first to be optimized by the loss function – it is the easiest part as all features are already available in the common base model and the new head just needs to learn some new "routings" for these features. We tried to shape our dataset even more to make this easy by having a nearly uniform background that does not change much within the dataset. This enabled us to introduce an empty class to depict the background without object on it.

We also had images within the dataset without any object on them at all.

We implemented the class prediction including the empty class with a softmax activation on last layer – not sigmoid or linear as described in the YOLO v1 paper.

Since most of the picture contains empty grid cells (only one grid cell is hot within our training data), we do not want to let the model guess "empty" for all cells to minimize loss. Therefore, we calculated the hyperparameters to weight between an-object and "empty". We do this by dividing the "empty" class prediction loss by the amount of empty grid cells and the class loss for any other class by the amount of grid cells responsible for objects (containing its center) as normalization, of course over all items in a batch.

Box loss

As a reminder, the Box Loss depicts how good the center of an object is found and how good the predicted bounding box covers an object. The second part reminded us of IoU (Intersection over Union) and the first part of the Euclidean distance between center points – and this is exactly how we implemented it. Only IoU would not do the job, as there might be no overlap at all at the beginning of training, so minimizing the distance will lead to an overlap which subsequently can be optimized.

Within the training, you can see how first the distance loss is minimized, estimating the correct center of the object before the height and width for the bounding box is learned. Leading to results which prefer correct center location over correct bounding box. Again, we weight the loss by counting only images within the batch which do contain objects at all.

Confidence loss

The confidence indicates how sure the algorithm is about the estimation. Within YOLO v1, this is used to encode if a box within a grid cell contains an object at all. We have replaced this with our concept of the "empty" class, so the confidence will only indicate which of the two boxes per grid cell should be evaluated at last.

As mentioned before, the YOLO v1 paper argues that this also enables cells to "specialize" on certain classes. We have seen examples of this (having a look at the values of the other box), while our overall results show that most of the time this box confidence just answer an "it does not matter which box you choose as both are quite good".

We again weight the parameter with the number of objects within one batch.

Weighting partial losses according to objects in images/ objects

While experimenting with our model and dataset, we figured out that unbalanced numbers of objects (existence on images, types of classes) could lead to unbalanced losses. Therefore, we added our concept of (we call it) "data composition loss normalization" which really improved our results and – more important – just felt natural for us to understand what the model tries to learn: As there are many empty grid cells within on image and even more empty grid cell over the whole batch (which may contain completely empty images), it is no achievement to predict something as "empty", so this needs to be weighted down against the single occurrences of small objects within the images. Boosting the loss which occurs on these few small objects significantly.

Summing up the parts to one loss per batch

At last, the loss function sums up the three loss values –loss predicting the class label, loss predicting object position and size (box-loss) and loss in predicting the box confidence value.

Training

Training "Nano-YOLO" with our own data from scratch is a two-step approach – first, we trained the classification model to classify the different objects ("nothing", "toy car", "sharpener", "toolset"), then we continued with the detection (object location and size).

Pretrain the classifier

We trained the classifier with an "ADAM" optimizer and the Sparse Categorical Crossentropy loss function predefined by Tensorflow. We left the configuration settings of ADAM optimizer on default, which means a learning rate of 0.001, exponential decay rate of first momentum of 0.9, exponential decay rate of second momentum of 0.999, the constant of numerical stability epsilon of 1e-07. We did not use the AMSGrad variant of the optimizer.

After 6 epochs (6000 images, batch size 64) – we reached a validation accuracy of 0.99. The model can classify similar but unseen images.

This result looks impressive but must be taken with caution as we generated our data set out of very few actual images.

"Nano-YOLO" – detection

We kept the trained "common base" model and added the detection head to train our "Nano-YOLO".

We trained the model with a stochastic gradient decent (SGD) optimizer and the previously explained loss function. For all training epochs, we used the optimizer with the initially set values – a momentum of 0.6 and a learning rate of 0.003. All other settings were left on default values.

After 250 epochs, a batch size of 32 and 6000 images – we reached an accuracy of 0.81 (VOC2007 [email protected] = 0.81).

To reach this training result, we tried several learning rates from 0.001 to 0.01 as well as momentum of 0.3 to 0.8. Additionally, we also tried to use the ADAM optimizer with similar adjustments to the learning rate. We experienced the ADAM optimizer to be quite unstable at the beginning and – once the loss was small and close to optimum – a randomly sudden explosion in loss values (the problem of "exploding gradients" is well known and described but we have not dived into it yet and wanted to continue with our object detection approach).

Results

The object detection works quite well, though the classification of toolset and toy car could be more accurate and robust. Especially, using "real world" images with more than one object on it. E.g., the result image below is taken with different lightning settings as well as three objects in one image – something the model has not seen during training. In this image, the model could detect and classify the "sharper" and "toolset" correctly – though the classification of the toy car was "toolset".

Detected bounding boxes on image never seen during training (Image by author)
Detected bounding boxes on image never seen during training (Image by author)

The perspective of the car is completely different to all previously trained car perspectives on the other images. The box though is still quite accurate.

To measure the accuracy of our Nano Yolo model, we implemented the mean average precision from the VOC challenge (year 2007) as a Tensorflow metric (vectorized calculation) considering our special case with only one object per image.

With this metric, our Nano YOLO version has an accuracy of 0.81 with a minimum IoU threshold required of 0.8 (VOC2007 [email protected] = 0.81). This graph shows how the mean average precision changes according to different selected thresholds for IoU in steps of 0.1 over all classes. We did not plot the different recall over precision values for each threshold as usually seen in articles and paper.

Mean average precision plotted for several IoU thresholds between 0 and 1 (Image by author)
Mean average precision plotted for several IoU thresholds between 0 and 1 (Image by author)

Learnings

"Described hyperparameters have been chosen for a specific dataset/challenge – estimate your own or calculate them on the fly."

For our dataset, the hyperparameters depend on the selected images in each batch and required an adoption depending on the number of objects and "no objects" in batch. Also, the coverage of images by objects is important – the VOC2007 has a high coverage of images compared to our small objects. We tried the original hyperparameters from the Yolo v1 paper and could not get good results on our dataset.

"An ’empty’ class makes things easier if the background stays consistently empty."

For our dataset and approach, it simplified things to use an empty class.

On other datasets (like the VOC2007 set) the different backgrounds will not work with our concept of an empty class. Here, it would probably be better to use to original YOLO v1 behavior with sigmoid activations – allowing all classes to predict a confidence close to zero to indicate that no known object can be recognized in an image cell.

"Start with a small model to experiment fast."

Our overall approach – having a small model to work on a small dataset – helped us a lot to dive into the concepts of object detection and learning the craftsmanship to write custom loss functions. Especially, it enabled us trying out different optimizers, learning rates as well as adoptions to hyperparameters and model layers quite fast. A small model with small images helped us to run most experiments within minutes and having a full train in less than an hour on our own machine.

"You have to try."

We started to learn as soon as we stopped to replicate and started to develop things on our own. We encourage everyone to do so too. Have fun trying.


Related Articles