Webcam Object Detection with Mask R-CNN on Google Colab

How to use Mask R-CNN for Object Detection with live camera stream on Google Colaboratory

Published in

Towards Data Science

6 min readJan 29, 2020

Mask R-CNN algorithm in low light — thinks it sees a cat ¯\_(ツ)_/¯

There are plenty of approaches to do Object Detection. YOLO (You Only Look Once) is the algorithm of choice for many, because it passes the image through the Fully Convolutional Neural Network (FCNN) only once. This makes the inference fast. About 30 frames per second on a GPU.

Object bounding boxes predicted by YOLO (Joseph Redmon)

Another popular approach is the use of Region Proposal Network (RPN). RPN based algorithms have two components. First component gives proposals for Regions of Interests (RoI)… i.e. where in the image might be objects. The second component does the image classification task on these proposed regions. This approach is slower. Mask R-CNN is a framework by Facebook AI that makes use of RPN for object detection. Mask R-CNN can operate at about 5 frames per second on a GPU. We will use Mask R-CNN.

Why use a slow algorithm when there are faster alternatives? Glad you asked!

Mask R-CNN also outputs object-masks in addition to object detection and bounding box prediction.

Object masks and bounding boxes predicted by Mask R-CNN (Matterport)

The following sections contain an explanation of the code and concepts that will help in understanding object detection, and working with camera inputs with Mask R-CNN, on Colab. It’s not a step by step tutorial but hopefully, it would be as effective. At the end of this article, you will find the link to the Colab notebook to try it yourself.

Matterport has a great implementation of Mask R-CNN using Keras and Tensorflow. They have provided Notebooks to play with Mask R-CNN, to train Mask R-CNN with your own dataset and to inspect the model and weights.

Why Google Colab

If you don’t have a GPU machine or don’t want to go through the tedious task of setting up the development environment, Colab is the best temporary option.

In my case, I had lost my favorite laptop recently. So, I am on my backup machine — a windows tablet with a keyboard. Colab enables you to work in a Jupyter Notebook in your browser, connected to a powerful GPU or a TPU (Tensor Processing Unit) virtual machine in Google Cloud. The VM comes pre-installed with Python, Tensorflow, Keras, PyTorch, Fastai and a lot of other important Machine Learning tools. All for free. Beware that your session progress gets lost due to a few minutes of inactivity.

Getting started with Google Colab

The Welcome to Colaboratory guide gets you started easily. And the Advanced Colab guide comes in handy when taking input from camera, communicating between different cells of the notebook, and communication between Python and JavaScript code. If you don’t have time to look at them, just remember the following.

A cell in Colab notebook usually contains Python code. By default, the code runs inside /content directory of the connected Virtual Machine. Ubuntu is the operating system of Colab VMs and you can execute system commands by starting the line of the command with !.

The following command will clone the repository.

!git clone https://github.com/matterport/Mask_RCNN

If you have multiple system commands in the same cell, then you must have %%shell as the first line of the cell followed by system commands. Thus, the following set of commands will clone the repository, change the directory to Mask_RCNN and setup the project.

%%shell
# clone Mask_RCNN repo and install packages
git clone https://github.com/matterport/Mask_RCNN
cd Mask_RCNN
python setup.py install

Import Mask R-CNN

The following code comes from Demo Notebook provided by Matterport. We only need to change the ROOT_DIR to ./Mask_RCNN, the project we just cloned.

The python statement sys.path.append(ROOT_DIR) makes sure that the subsequent code executes within the context of Mask_RCNN directory where we have Mask R-CNN implementation available. The code imports the necessary libraries, classes and downloads the pre-trained Mask R-CNN model. Go through it. The comments make it easier to understand the code.

Create Model from Trained Weights

Following code creates model object in inference mode, so we could run predictions. Then it loads the weights from the pre-trained model that we downloaded earlier, into the model object.

Run Object Detection

Now we test the model on some images. Mask_RCNN repository has a directory named images that contains... you guessed it... some images. The following code takes an image from that directory, passes it through the model and displays the result on the notebook along with bounding box information.

The result of the prediction

Working with Camera Images

In the advanced usage guide of Colab, they have provided code that can capture an image from a webcam in the notebook and then forward it to the Python code.

Colab notebook has pre-installed python package called google.colab which contains handy helper methods. There's a method called output.eval_js which helps us evaluate the JavaScript code and returns the output to Python. And in JavaScript, we know that there is a method called getUserMedia() which enables us to capture the audio and/or video stream from user's webcam and microphone.

Have a look at the following JavaScript code. Using getUserMedia() method of WebRTC API of JavaScript, it captures the video stream of the webcam and draws the individual frames on HTML canvas. Like google.colab Python package, we have google.colab library available to us in JavaScript. This library will help us invoke a Python method using kernel.invokeFunction function from our JavaScript code.

The image captured from webcam is converted to Base64 format. This Base64 image is passed to a Python callback method, which we will define later.

We already discussed that having %%shell as the first line of the Colab notebook cell makes it run as terminal commands. Similarly, you can write JavaScript in the whole cell by starting the cell with %%javascript. But we will simply put the JavaScript code we wrote above, inside the Python code. Like this:

Python — JavaScript Communication

The JavaScript code we wrote above invokes notebook.run_algo method of our Python code. The following code defines a Python method run_algo which accepts a Base64 image, converts it to a numpy array and then passes it through the Mask R-CNN model we created above. Then it shows the output image and processing stats.

Important! Don’t forget to surround the Python code of your callback method in try / except block and log it. Because it will be invoked by JavaScript and there will be no sign of what error occurred while calling the Python callback.

Let’s register run_algo as notebook.run_algo. Now it will be invoke-able by the JavaScript code. We also call the take_photo() Python method we defined above, to start the video stream and object detection.

Try it yourself

You are now ready to try Mask R-CNN on camera in Google Colab. The notebook will walk you step by step through the process.

(Optional) For Curious Ones

The process we used above converts the camera stream to images in a browser (JavaScript) and sends individual images to our Python code for object detection. This is obviously not real-time. So, I spent hours trying to upload the WebRTC stream from the JavaScript (peer A) to the Python Server (peer B) without success. Perhaps my unfamiliarity with the combination of async / await with Python Threads was the main hindrance. I was trying to use aiohttp as Python server that will handle WebRTC connection using aiortc. The Python library aiortc makes it easy to create Python as a peer of WebRTC. Here is the link to the Colab notebook with an incomplete effort of creating WebRTC server.

Originally published at https://emadehsan.com on January 29, 2020.