YOLOv3 PyTorch on Google Colab

Doing object detection video processing on your browser

Published in

Towards Data Science

4 min readApr 6, 2020

For computer vision enthusiasts, YOLO (You Only Look Once) is an extremely popular real-time object detection concept since its very fast and has great performance.

In this article, I will share codes for processing a video to get bounding boxes of each object every frame inside Google Colab

We will not discuss the YOLO concept or architecture since a lot of good articles in Medium already elaborate that. Here we only discuss functional codes

Let's get started

You can try yourself on this Google Colab.

We start from a well-written and my favorite git hub repo from Ultralytics. Despite the repo already contains how to process video using YOLOv3 just running python detect.py --source file.mp4 I would like to break down and try to simplify the codes just by removing several unnecessary lines for this case and I add how to show processed video on Google Colab / Jupyter notebook

Prepare YoloV3 and LoadModel

First clone Ultralytics YoloV3 Repository then import common packages and repo’s function

Set Argument Parser, initialize devices (CPU / CUDA), initialize the YOLO model then load weight.

We are using YOLOv3-spp-ultralytics weights which the repo said it far better than other YOLOv3 in Mean Average Precision

The functiontorch_utils.select_device() will automatically find available GPU unless the input'cpu'

Object Darknet is initialize YOLOv3 architecture on PyTorch and the weight needs to be loaded using pre-trained weight (we don't want to train the model at this time)

Predict Object Detection on Video

Next, we will read the video file and rewrite the video with objects bounding boxes. Following 3 GitHub Gist is part of a function predict_one_video that will be used at the end.

We are writing the new video using MP4 format, it explicitly stated on vid_writer. Whilefps, width and height are used according to the original video

Start looping each frame on the video to get predictions.

The image size for this model is 416. A function name letterbox is resizing the image and give padding to image hence one of width or height becomes 416 and the other less than equal 416 but still divisible by 32

The second part is we turn the image into RGB format and put channels in the first dimension (C,H,W). Put the image data into the device (GPU or CPU) and scale the pixel from 0-255 to 0-1 . Before we put the image into the model, we use the function img.unsqeeze(0) because we have to reformat image into 4 dimensions (N,C,H,W) which N is the number of images which is 1 in this case.

After preprocessing the image, we put it into the model to get prediction boxes. But the predictions have a lot of boxes so we need non-maximum suppression to filter and merge the boxes.

Detection Free Human Instance Segmentation using Pose2Seg and ... — Non-maximum Suppression (NMS). Image source

Draw Bounding Box and Label then Write Video

We loop all the prediction (pred) after NMS to draw the box, but the image is already resized into 416 pixels, we need to scale it back into original size using the function scale_coords then we draw the box using the function plot_one_box

Show Video on Colab

The video is written as Mp4 Format on function predict_one_video after saved as Mp4 we compress into h264so the video can be played on Google Colab / Jupyter directly.

Show Raw Video

We show video using IPython.display.HTML with 400 width pixels. The video is read using binary

Compress and Show Processed Video

The output of OpenCV video writer is an Mp4 video with size 3 times larger than the original video, and it cannot be displayed on Google Colab using the same way, one of the solutions is we do the compression (source)

We compress Mp4 video to h264 usingffmpeg -i {save_path} -vcodec libx264 {compressed_path}

Result

Left is raw video, Right is processed using those codes

Try on your own video

Go to the Google Colab file on GitHub HERE

upload your video inside input_video folder
Just Run the last cell (predict & show video)

Source

Thank you for reading, hope it helpful

Next stories:

Yolov3 using a webcam on Google Colab
Yolov3 training hand detection
Yolov3 training safety helmet