Web Scraping using Selenium and YOLO to build Computer Vision datasets

This workflow makes easy the heavy work of building a dataset

Borja Souto García
Towards Data Science

--

Photo by Nicolas Picard on Unsplash

If we ask several experts about the keys of a Computer Vision project (especially if we want a real application), maybe one the most repeated element (and in my opinion the most important thing) is the dataset. The problem appears when we also ask about the most arduous and laborious task, which answer also usually is the same: the dataset.

The task to compose a dataset can be summarized in three steps:

  1. Capture images.
  2. Annotate images: labeling the sample for a set of classes.
  3. Validation: checking if the labels are correct.

The first thing we would do is to resort to state-of-the-art datasets that can be used for our task, but the problem is we don’t always find what we need. At this moment we are facing a manual and a painful task. This article shows how to avoid this manual work.

Using Selenium [1](an open-source web-based automation tool), Instagram (indirectly one of the largest image databases in the world) and YOLO [2](one of the most employed deep learning algorithms in object detection) we can generate a dataset automatically (the only thing you can’t avoid is the validation step). To show a simple example, we will generate a simple dataset of two classes: cat and dog.

We will design a bot with Selenium that will access and move through Instagram automatically. Also, we will use YOLO, a Convolutional Neural Network to detect and order the dogs and cats we need.

Environment Setup

The programming language we are going to use is Python 3.6, which allows us to use Selenium and YOLO easily.

We will need the following external dependencies:

  • GluonCV: Framework that provides implementations of state-of-the-art deep learning algorithms in computer vision. This toolkit offers us a large number of pre-trained models. We are going to use one of them [3].
  • Pillow: Python Imaging Library.
  • Selenium: It is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers. We will use the Python module provided by this library.
  • Requests: It is an elegant and simple HTTP library for Python. We will use it to download images.

You can install these dependencies with this command:

pip install mxnet gluoncv Pillow selenium beautifulsoup4 requests

In addition, you will need the ChromeDriver. This provides capabilities for navigating to web pages from Selenium. You have to copy it to usr/local/bin, and voilà! You can now control your Chrome browser from Selenium!

Object detection module

As we have already mentioned, we will use the YoloV3 as a detection module (By the way, YoloV4 is available from one month ago). In this way, we will try to find and differentiate cats and dogs.

YoloV3 module using GluonCV and MXNet

Summarizing, YOLO (You Only Look Once) is basically a single CNN that provides multiple predictions (bounding boxes) with an associated probability for each class. In addition, NMS (Non-Maximum Suppression) is used to merge multiple detections into a single one. We are facing a fast neural network that works in real-time.

Dog and cat detections using YoloV3 module. Dog: base image by Joe Caione on Unsplash. Cat: base image by Nathan Riley on Unsplash

Selenium BOT module

We will use Selenium to program a bot to login and move through Instagram automatically downloading the images using the detection module.

As you can see in the __call__() method, we can divide the code into 4 main steps:

  1. Open Instagram home page.
  2. Login on Instagram with your username and password.
  3. Put your hashtag to restrict the search, this case #PETS, because we want dogs and cats in this example.
  4. And the most important step, scroll and download images.

In this last step, we scroll and parse the HTML to obtain the URL that we need to download the image and compose it with Pillow. At this point, we check if the image contains a dog, a cat, or nothing, to then save it in the corresponding folder to each class, or discard the image.

Finally, you only have to worry about validating the sample. Enjoy! 😉

If you want to launch this code easily you can find it in this repository. You just have to follow the steps indicated in the README.

In the following GIF, you can see how the dataset is automatically generated: the bot accesses to Instagram with username and password, enters the #PETS hashtag, and scrolls while downloading only the images of dogs (bottom folder) and cats (top folder).

Running program

Emphasize that this is a simple example, there are many open datasets of dogs and cats of course. We can add the complexity that we want to our generator, this is a proof of concept, possibly our tasks are not so simple, nor the data that we need so basic.

Thanks for reading!

--

--