YOLOv2 to detect your own objects using Darkflow

Published in

Towards Data Science

5 min readJul 26, 2018

The previous story before this one shows the most basic usage about Darkflow which is one of the implementation of YOLOv2 object detection model.

This story introduces the basic steps for object detection on your custom dataset. As an example, I did it myself for soccer ball detection. In brief, I am going to show how to 1.prepare dataset, 2.train model, and 3.predict the object.

I have written a Jupyter notebook on Github related to this story. In case of you want to try yourself, please visit here to get full dataset and training procedure.

Preparing the Dataset

First thing to do in this step is to search media (images or video) files for your own interest. In my case, I have chosen soccer ball because 2018 FIFA World Cup Russia was ongoing when I was on this. It was easier to find highlight video clips on Youtube which I can produce a lot of screenshots (Following this LINK, you will find a way to produce bulk screenshots from video file). If you are collecting only image files from Internet, that is fine too. Just remember that all you image files from Internet or screenshots have to belong to a similar (or the same if possible) distribution since we are not going to build general AI.

Secondly, it is time for pre-processing the collected image dataset. YOLO model can be trained with images and associated annotations. An image file is linked to an annotation file, and an annotation file contains multiple annotations’ information. An annotation indicates where an object in the image is located along with its size like (x_top, y_top, x_bottom, y_bottom, width, height). These annotations should be written by your own hand, and it may take for a long time. However, fortunately, there are great tools for easing this task. I have chosen THIS tool(labelImg).

As seen in the above screenshots, you can easily bound boxes around objects that you are interested in. I have bounded boxes for soccer ball only, but multiple objects can be too. Just remember that PascalVOC text should be displayed since it means annotation style for YOLO is now selected, and even though you use a tool like this, this process takes quiet a long time.

In case you want my dataset…

I have uploaded and shared my pre-processed image dataset and associated annotations. You can download it here. You may want it if you want to experiment training/predicting task yourself, and also you may want it if you want to add more dataset to increase the model’s performance.

Predict the Object

In this step, before prediction, trained model should be prepared for inferences. In order to gain a trained model, here are what you should do.

Find a pre-trained model on COCO or VOC dataset
Change configurations to fit the model into your own situation
Build the model
Train the model

First, In my case, I used darkflow (open sourced project for YOLOv2) and its pre-trained parameters. Please visit my github repository to see the full instructions for using these.

Second, there should be a file for specifying the model’s configuration. If not, you should change it in the code itself. For darkflow, there is a file (*.cfg) for each model. Ultimately, you should change some values in the last layer. For example, in my case, I wanted only a soccer ball to be detected. So, I changed the value of “classes” to 1 (the number of classes is 20 and 80 for VOC and COCO dataset respectively). Since the number of units(classes) in the last layer, some associated number of parameters should also be changed appropriately. According to darkflow’s document, I needed to change filters in the [convolutional] layer (the second to last layer) to num * (classes + 5) which is 30.

The third and the fourth steps probably different from various implementations. In darkflow, the code below shows how to build the model.

It will print the architecture of the model with some associated changes which are different from the pre-trained model. Then the following one line of code starts training process.

I did run 100 epochs, and each epoch had 23 steps. So the total steps of 2300 are completed. For your reference, I used NVIDIA GTX 1080Ti, and it took about an hour to finish training process.

Every 250 steps, darkflow creates a checkpoint file. By specifying a checkpoint file, you can start from where you left off. Also, you can distribute your trained model with that checkpoint file. The code below shows how to load it for predicting process.(-1 indicates the latest checkpoint)

Finally, you have your own model trained for detecting custom objects. The detailed description of how to display images or video with bounding boxes are explained in the previous story. Also please visit the github repository for more detailed explanation.

Prediction Result on images and video

I have experimented with random images from the internet (above). It worked just like I was expecting. The video shows the prediction result on a video file (below).

Final Thoughts

I found that the ball detection is pretty hard in a wide view of the soccer game since the object is very tiny (smaller than your finger tip).
Every data should come from very similar distribution. Since photos of soccer game are taken from variety of different angles, sometimes it is too hard to predict.
Collect as many data as possible from different situations.
I found that the dataset better includes many arbitrary positions of the objects. Because soccer ball appears in almost every location in a game, the unseen locations are hard to be predicted.
Finally, soccer ball sometimes appears in between players or is hidden partially by players, and it also made algorithm hard to predict. Dataset should reflect this particular situation.