Classifying Pet-Safe Plants with fast.ai

Building an Image Database

Using BeautifulSoup, Selenium and FastAI

Kenichi Nakanishi

Published in

Towards Data Science

10 min readOct 27, 2020

Diving into Deep Learning, there are a huge amount of high-quality resources out there, but having meaningful projects to work on is one thing that everybody agrees is crucial to really learning the ropes.

Having a fiancée that loves buying plants, and a cat that loves nibbling on them — I figured what could be better than putting together a classifier that will tell me if a plant is safe or not!

One thing to note is that all of the work done here was performed on Google Colabs, so differences in setup and imports will be necessary if you want to try to do the same thing on your local machine. The notebook used can be found on my Github.

Step 1 — Getting Data

Unfortunately, I couldn’t find a premade image dataset out there that was appropriate for what I wanted to do on Kaggle, or using Google’s Dataset Search. So, what else to do but build my own!

I decided on using ASPCA’s plant toxicity list for cats and dogs, as it was a site I’ve ended up using more than a few times while in a plant nursery. This gave us a nice core to work from. To scrape this text data from the website, we can turn to BeautifulSoup, a Python library for useful for pulling data out of HTML and XML files.

However, when looking at their website, the table isn’t presented as an easily accessible html table, but instead stores the data as rows in a panel. Luckily, BeautifulSoup gives us an easy way to search through the parse tree to find the data we want. For example:

After collecting our raw data together, we need to separate it out into columns with a bit of creative splitting:

We can then repeat this process for the list specific to dogs then merge the dataframes and clean up NaNs that came from any entries:

Step 2— Shallow Cleaning

Next, we can begin shallow cleaning, which involves having a look at the dataset and deciding which key features we want to use, and standardizing their format.

The (dirty) dataframe. Notice the inconsistent use of spp./sp. and the missing Family.

We currently have Name, Alternative Names, Scientific Name, and Family as well as our toxicity columns, all from scraping the ASPCA website with BeautifulSoup. As we will be basing our image collection on a Google image search, it was decided that searching based on the exact scientific name of each plant would be best for getting images that were as specific as possible. Names such as “Pearly Dots”, “Elephant Ears”, “Fluffy Ruffles” and “Pink Pearl” will quickly return results that aren’t the plants we’re looking for.

Looking through that series, we notice there are slight differences in capitalization, use of sp./spp./species to denote a species, cv./var. for a cultivar. Based on these observations, we write a few quick functions to apply to the series to try to standardize the data for further cleaning.

Step 3 — Deep Cleaning via Cross-referencing

Looking closely at our scientific names (and having iterated through the data a few times), we find that a lot of the names are outdated synonyms for more accepted species, or are misspelled. This will cause issues in both image collection and later on, when training a model to identify identical images that possess different labels.

One google later, we find the World Flora Online database, an open-access, Web-based compendium of the world’s plant species. They list both synonyms and accepted species names and are regularly updated by ‘Taxonomic Expert Networks’. Perfect for cross-referencing our unreliable scientific names. This database provides their data in a .txt file, which we can read in and work to compare against the database scraped from the ASPCA plant toxicity database.

Data from the WFO — regularly updated by ‘Taxonomic Expert Networks’.

As a first step, we will do a left merge onto our data from the ASPCA, keeping all our classes and adding on any data that matches the exact scientific names we currently have. Our goal is to update all the plants in our database to their most up-to-date accepted scientific name, so we do a quick sort by that key (taxonomicStatus), and drop any duplicates keeping the first entry (which would be Accepted, if it existed).

Step 3.1 — Fix Typographical Mistakes with String Matching

Many scientific names refer to the same species but are off by a few letters due to typos in the ASPCA database. Lets use the SequenceMatcher from difflib to quantify string distances (using a form of Gestalt Pattern Matching) to spot these errors by comparing each entry that is not accepted to entries in the WFO database. We can sort the dataframe and compare only to the scientific names that begin with the same letter to save time. If the name is similar enough, we hold on to it and ultimately return the closest match. Here we’ve set the threshold to 0.9 to avoid any incorrect matches.

We also define a function to fix the problematic entries in our data, which will update their scientific name, family, genus and taxonomic status to the (correct) corresponding entry in the WFO database.

Now, we can loop through our data, searching for name matches and correcting their corresponding dataframe entry on the spot.

This process helps us catch mistakes that would otherwise require intensive checking and a high level of domain knowledge:

Even knowing there was a mistake and looking at the names side by side it can be hard to spot the typos that were in the database.

Step 3.2 — Manual Cleanup of Unidentified Species

Unfortunately, many of these unidentified species don’t have an entry in the database that is sufficiently close enough for me to feel comfortable with automatic fixing. Hence, we do some manual fixes for the remaining unknowns. Thankfully, the above code has reduced the number of samples that need manual attention to only 50 or so entries, and we can re-use our fix_name function from before to fix these entries, based on the correct ones we find on Google.

Step 3.3 — Match Synonymous Scientific Names

Now that the scientific names have all been corrected, we still need to standardize them, as scientific names can change over time due to updated research (leading to the Synonym label in the Taxonomic Status column). If a scientific name is a synonym for an accepted one, we’d like to use the accepted one for our future Google Image search.

Multiple labels (synonyms under taxonomicStatus) for the same class (scientificName) would be a very bad thing later on.

Thankfully, the WFO database contains an acceptedNameUsageID field which contains the accepted name for a given synonymous scientific name, we can utilize this to look up the accepted scientific name and pass them into the fix_name function.

Step 3.4 — Finish Off

Now, we’ve corrected typos (automatically and manually) and match up the remaning synonyms with their most up-to-date accepted names. All that’s left is to clean up the dataframe for image downloading.

Whew! This process took quite a few iterations to get the approach right. However, making sure we have clean data to work from before we build our image database is crucial before spending time training a model.

A few interesting takeaways from the final pet plant toxicity dataframe:

33 out of 110 plant families are not entirely toxic or non toxic.
7 out of 350 plant genera are not entirely toxic or non toxic.
Only two types of plants show species-specific toxicity, lillies for cats and walnuts for dogs!

Step 4— Downloading Images

The first step in downloading images is obtaining the URLs of each image we would like to grab. To do this, we’re adapting a method based on Selenium based on a post by Fabian Bosler. Selenium is a portable framework for testing web applications, and nearly all interactions with a website can be simulated. The Selenium webdriver acts as our virtual browser, and can be controlled through python commands. Here use a script to seach Google Images based on a query we give it to only look for and download thumbnail URLs, as we are going to be grabbing a lot of images. A catch is that many of Google’s image thumbnails are stored as base64 encoded images. We’d like to also grab these so we don’t miss out on any images with high relevance, as the further along we go in search results, the worse the images become for training purposes.

Great! Now we have a way to scrape Google Images for pictures! To download our images, we’re going to leverage a function from fast.ai v2, download_images which… downloads images. However — we’re going to dive into the source code and upgrade it a little bit to hash the images as they come in, and ignore/delete any duplicates so we end up with consistent sets of unique images. We will also allow it to decode and download encoded .jpg and .png images, which are formats that Google Images use to store their thumbnails.

Now, we can loop over each of our scientific plant names to collect URLs for each of them, then download those images while validating that each of those images are unique. Each set of images is downloaded into their own folder in my linked Drive on Colabs. One thing to note is that the number of URLs to grab needs to be significantly more than the number of images you ultimately want due to the large number of duplicate images present on Google Images.

After downloading, we take steps to ensure that each folder contains the correct number of unique images. See the linked Github repo for more details and code.

So, at this stage we’ve put together a dataframe with useful label information as well as downloaded unique images for each of our classes. Those images are neatly separated into their own folders and all placed directly in our Google Drive. It is important to note that if you want to use those images for training a CNN, you will get massive speedups if you bring those images into the local Colab environment before using them, but that will be discussed more in the next post.

Final Thoughts

Building a database from scratch an image classification project is straightforward for simple toy examples (see fast.ai v2’s book for a good example of a brown/black/teddy bear classifier. For this project, I wanted to extend the same approach but apply it to a larger set of classes. The process can really be broken down into a few steps:

Get a list of classes
Scraping tabular or text data from a web page is very straightforward thanks to BeautifulSoup, and typically only requires a little bit more processing either through regular expressions or built-in python methods.
Cleaning and validating the accuracy of the data you do download is the biggest challenge in this step. When we have 10 classes and domain knowledge, it’s easy to spot errors and fix them before proceeding. When we have 500 classes and don’t know which of ‘Lavendula angustifolia’ or ‘Lavandula angustifolia’ is correct, things get much harder. A separate source of data that we can validate our data against is crucial. In this case we trust the toxicity information in the ASPCA data, but not the scientific names they supply, and have had to correct them using the WFO database, who provide up to date taxonomic information.
Get a list of image URLs for each class
Here, Selenium is very flexible. I recommend looking at this post by Fabian Bosler for more ideas on how a Selenium webdriver can be used. Anything you could do manually, the webdriver can emulate. We can perform searches, find the thumbnails and download those, or even click through to larger images and download those if we wanted higher-resolution images.
Download each image into a labelled folder
The fastai function for download images works well, but a major stumbling point is the downloading of duplicate images. If you want more than a handful (10–15) of images, you very quickly end up with a large proportion of duplicates from the images you get if you downloaded every result of a Google Image search. Additionally, the function wasn’t able to handle base64 encoded images as many thumbnails were stored as. Thankfully, fastai provides their source code which could be modified to interpret encoded images as well as download http links, hash each of them as they came in and only keep unique images.

What’s Next?

I’ll be taking a look at how the number of images we use makes a difference, as well as putting together a filter to take out some images with artificial (text, pictures etc.) components we may have inadvertently grabbed through this automated download process. Then, onto building and training a CNN image classification model using the fast.ai architecture!