Detecting Pikachu in videos using Tensorflow Object Detection

Juan De Dios Santos
Towards Data Science
7 min readMay 13, 2018

--

Deep inside the many functionalities and tools of TensorFlow, lies a component named TensorFlow Object Detection API. The purpose of this library, as the name says, is to train a neural network capable of recognizing objects in a frame, for example, an image.

In a previous work of mine, found here, I explained and went through the procedure I followed to detect Pikachu on Android devices using this TensorFlow package. Moreover, I also gave an introduction to the library and discussed the different architectures and features it provides, as well as a demonstration on how to evaluate the training process using TensorBoard.

Moving forward, a couple of months later, I took on the task of improving my previously trained Pikachu detection model with the purpose of detecting them straight from a video, using Python, OpenCV, and of course, TensorFlow Object Detection. The code is available on my GitHub: https://github.com/juandes/pikachu-detection

Pikachu

This article is about the steps I followed to achieve this. Firstly, I will state the issues I noticed in my original model, and what I did to improve them. Then, I will proceed to describe how by using this new and improved model, I built a detection system for videos. Finally, you will be able to see two videos with several Pikachu detections.

But before we start, here is a short gif showing some quick detections.

Pikachu being detected
That’s Pikachu

Improvements to model

As mentioned before, in a previous work, I did the original training of a Pikachu detection model with the goal of using it on Android devices as well as a Python notebook. However, the fact that I was not completely pleased with how the model was performing, motivated and drove me to improve the system, and thus, to write this article.

The main concern I had was the number of Pikachu images I used to build the system, which was 230. Of these, around 70% of them were used for training while the remaining 30% were used for testing, so not too many for training. While this is technically not an issue (since the model was performing “okayish”), I added 70 more pictures to the training set (not that many, but better than nothing).

As a consequence of this, because now I have more images, I had to extend the training of the model. Instead of retraining from scratch, I used the training checkpoint of my earlier model and continued from there; the former was trained at 15000 epochs, and the new one at 20000. The next two graphs show the total loss and precision (taken from TensorBoard); it’s easy to notice that there was not much change from epoch 15000 to epoch 20000 (particularly in the loss).

Plot of loss
Plot of precision

The last (and small) correction I did was modifying the detection threshold from the Android app. The default value, which is 0.6, was increased to 0.85.

Did the improvements change anything? Even by trying to put aside my confirmation bias, I’d say, yes. I noticed a small and tiny improvement. The biggest change I noticed was a diminishment in the number of false positives in the Android app due to objects that look like a yellow blob; of course, this could be because the threshold was increased.

Now, with a newest and (hopefully) improved model, I was ready to use it to detect Pikachu in videos. Before moving on, I’d like to mention that I will skip the whole process of freezing and import of the model as this was addressed in my earlier work.

Detection from videos

Performing an object detection from a video is not as hard or fancy as it sounds. In layman terms, we can say that a video is a collection of images that follow a sequence, therefore the detection process is fairly similar to that of detecting from a normal image. Why fairly similar? Well, due to the nature of a video, there are several steps regarding the preprocessing and the preparation of the frame that has to be solved before feeding it to the detection model. In the next lines, I will explain this, plus the procedure I followed to perform the detections and how I created a new video to display them.

Most of my code is based on a Python notebook provided in the TensorFlow Object Detection repo; this code does most of the hard work as it includes many functions that ease the detection process. Also, I’d recommend you to take a look at my script and use it as a guide as you read the following paragraphs.

From a high-level perspective, the code I wrote has three main tasks:

Loading the resources

To begin with, the frozen model, data labels, and video have to be loaded. For simplicity reasons, I recommend a short, and medium-sized video since it can take a bit of time to process the whole movie.

Iterating over video

The principal functionality of the script is based on a loop that iterates over every frame of the video. At each iteration, a frame is read and its color space is changed. Next, the actual detection process is made to find all those nice and yellow Pikachus. As a result, the coordinates of the bounding box where the Pikachu is located (if any was found) and the confidence value of the detection, are returned. Subsequently, a copy of the frame will be created, containing the bounding box where the Pikachu should be as long as the confidence score is above the given threshold. For this project, I set the confidence threshold to a very low one, 20%, because I noticed that the amount of false positive being detected in the video was really small, so I decided to “risk” its performance just to have more Pikachu detections.

Creation of new video

All the newly created copies of the frames carrying the detection boxes discussed in the preceding step are used to construct a new video. To build this video, a VideoWriter object is needed, and at each iteration of the previously discussed loop, the copy of the frame will be written into this object (without any sound).

Results and discussion

These two videos showcase how the model performs.

The detections from the first video were pretty good. Even though Pikachu kept holding a ketchup bottle through the whole video, the model was able to detect it in most scenes. On the other hand, there is an instance at 0:22 where it was not detected, moreover, the shot where Scyther (the green mantis looking thing) broke the ketchup bottle (0:40 to 0:44) was a false positive.

The performance of the model on the second video was not as good as the first one, with the central problem being during the scenes where there are two Pikachus on the frame. In this case, the model seems to detect as a Pikachu the two of them, instead of having one detection for each — a clear example is at 0:13 where both Pikachus are slapping each other (sad scene :( , I know).

Conclusion and recap

In this article, I talked about how we can use the TensorFlow Object Detection package to detect Pikachus in videos. In the beginning, I discussed a bit of my previous work in which I used an earlier version of a model to do the detections on an Android device. Said model, even though it was doing its job, had some problems that I wished to work on; those improvements led me to do this project and to build a detection model for videos.

The new model works as intended. Sure, there are some hiccups here and there — that are related to the false positives, and Pikachus not detected — but it does what it has to do. As a future work, I’d like to append to my training set more images of different angles of Pikachus, for example, side view and back views just to have more diversity in the data, and thus, a better outcome.

Thanks for reading. I really hope this guide helped some of you. If you have any questions, comments, doubt, or just want to chat, leave a comment and I will be happy to help.

--

--

Data storyteller, Trust and Safety Software Engineer, and fan of quantifying my life. Also, I like Pokemon. https://juandes.com, @jdiossantos.