Classifying Pet-Safe Plants with fast.ai

In Part 1: Building a Database, we’ve scraped the web for information on plants and how toxic they are to pets, cross-referenced the fields against a second database, then finally downloaded unique images for each class through Google Images. In this part, we will be training baseline neural networks (using the new fast.ai framework) to identify the species of plant based on a picture. We’ll then assess how good the dataset we’ve put together is for training a neural network, and looking for ways to improve this.
The main goals herein will be comparing the effects of changing the number of images per class, and how we can try to compare each training run fairly by controlling randomness.
I’ve found fast.ai to be an extremely useful framework (that sits atop the PyTorch libraries) for diving right into machine learning. An analogy Jeremy Howard (founding researcher at fast.ai) has used is akin to learning how to play football – do we want to study the precise physics and mechanics of how to kick a ball, or get amongst it and learn as we go? The latter has been much more engaging, and we can pick up important bits of theory as we go thanks to the vast resources available online. One such example is fast.ai’s own book, which gives a very good overview of the process and contains a particularly interesting chapter on data ethics.
Table of Contents
-
A Training Baseline 1.1 – Imports and Seeding Randomness 1.2 – Loading Data into Colabs 1.3 – DataBlock and DataLoader(s) 1.4 – Stratified Splitting 1.5 – Creating the DataBlock and DataLoader 1.6 – Creating the Learner
-
Training the Model(s) 2.1 – How do we choose the Learning Rates? 2.2 – An Example Training Run
1. A Training Baseline
1.1 – Imports and Seeding Randomness
fast.ai is being rapidly updated, which calls for the installation of specific package versions for reproducibility. We’re going to simply import everything we need (and stuff we need later) into the global namespace for ease of use.
In order to compare training runs, we need to control the sources of randomness that are present in the system (augmentations, splitting, etc.) While a lot of discussion on this topic is present, I’ve found that for Colabs the use of the following function prior to creating the DataBlock
will allow for reproducible results even between kernel restarts. You’ll also need to set num_workers = 0 (or 1)
in your DataLoader, but it is 0 by default and we will not be changing it herein.
Note that results will change if you call any functions that use the set randomness (e.g. learn.dls.show_batch()
) or if you change the GPU used (e.g. Tesla P100 vs V100 on Colabs). This can result in a need to factory reset the runtime and reconnecting until the same GPU is provided.
1.2 – Loading Data into Colabs
Right now, the data is saved in individual class-labelled folders in Google Drive, with 150 images (.jpgs) per class. Colab can directly link to your Google Drive, but simply pointing your learner at the drive and proceeding with training will significantly slow down the process due to the constant need to transfer images for each batch of training.
To get around this, we can recursively copy the folder containing each subfolder directly into the current kernel (however, this can be relatively slow if you have a large number of small files).
Recursive means that
cp
copies the contents of directories, and if a directory has subdirectories they are copied too.If this method is too slow, we can first download the files, zip them up and upload that zipped file to Google Drive. Then, every time we need the data, we can simply download that file and unzip it in the kernel.
Either way, the work we put in here to ensure our images are present on the Colabs kernel will save a lot of time during training.
1.3 – DataBlock and DataLoader(s)
Now, our data is present directly in the kernel and can be easily accessed during training. To use the data, fast.ai has developed a flexible system called the DataBlock
__ API. At a high level, the DataBlock
simply serves as a list of instructions when building batches and our DataLoaders
. This is discussed in more detail in Chapter 5 of the Fastai book.
Our DataBlock
will look like this:
blocks = (ImageBlock, CategoryBlock) blocks
specifies the independent and dependent variable types using a tuple of built-in blocks. In this case, we are passing in images and looking to get out categories.-
splitter = stratifiedsplitter
splitter
defines what function to use to split the data into a train set and validation set.stratifiedsplitter
is a simple custom function that will split based on a column in a DataFrame (as we will see later), but any type of split can be defined – from random to splits based on folder locations or names. -
get_x = get_x
get_x
defines what function to use to get the list of images in our dataset. -
get_y = get_y
get_y
defines what function to use to create the category labels for our dataset. -
item_tfms = item_tfms
item_tfms
are snippets of code that run on each individual item. fastai includes many predefined transforms, and this step is typically used to standardize the size of each image. -
batch_tfms = batch_tfms
batch_tfms
are applied to a batch as a combined operation on the GPU only once. This preserves image definition and reduces the number of artifacts when compared to performing the operations individually and interpolating multiple times. Here is typically where image augmentation steps such as resizing and rotations will be defined.
1.4 – Stratified Splitting
The splitter
defines which of the images will be in the training dataset and validation dataset. There are many ways to approach this, but herein we prepare a function that looks at the path holding all the folders containing the images for each class and returns a DataFrame containing that pairs a class to each image.
Now that we have this DataFrame, we can choose exactly how we’d like to do our splitting. To enable future k-fold validation, lets prepare a way to generate stratified folds, which preserves the percentage of samples for each class. We do this with the help of sklearn’s StratifiedKFold
function, passing in the appropriate columns of our DataFrame as the X
and y
.
Great! Now we have a DataFrame (df_cnn
in the above example) that contains the Class
,Path
and an is_valid
label.
Additionally, as we’d like to compare how training goes between datasets with different numbers of images as fairly as possible, we don’t want to use a random train/validation split each time. If we did so, the differences in images used for training and validation would inherently change the model performance. To control for this, we first do a shuffle stratified split on all the images (see a below). Then, after we remove any images from the dataset, we do an inner join onto the previously defined split such that any images still remain in the same training or validation set as before (creating a pseudo-stratified split, see b below, as we can’t guarantee the exact same ratio of images remain in each set).
Fixing all randomness may typically not be a good idea as getting high variation between different runs may give you a hint that something is wrong. Natural variations in score can help you to achieve even better scores if you use cross-validation.

1.5 – Creating the DataBlock and DataLoader
As previously mentioned, the splitting is defined within theDataBlock
for the DataLoader
, and here we use a get_dataloader
function to automate the process.

This function begins with the definition ofget_x
, get_y
, splitter
, item_tfms
and batch_tfms
. Here,get_x
and get_y
tell our DataBlock
to look at the appropriate columns in a DataFrame to find the image paths and labels respectively. As discussed, splitter
identifies the images as training or validation using the ‘is_valid’ column.
For our transforms, we follow a presizing strategy, where item_tfms
resizes each image to a dimension significantly larger than the target training dimensions, and batch_tfms
composes all of the common augmentation operations (including a resize to the final target size) into one combined operation for the GPU.
Note that the batch_tfms
here use the base aug_transforms
defined by fastai, which apply a list of flip, rotate, zoom, warp, lighting transforms, then apply a normalization using the ImageNet stats. We add a clause for the addition of a random erasing transform, which will be discussed later. Finally, each of these snippets are put to use in the definition of the DataBlock
, as discussed previously.
We then check if a split_path
to a set of images with which we should prepare a ‘master’ stratified split has been defined, and if so, apply an inner join of the images in our img_path
to define our pseudo stratified split. Otherwise, we generate a stratified split for the images in img_path
. This process encompasses the previous two functions (create_path_df
and stratified_split
), taking the parameters defined previously.
The output of the function is a DataLoader
, which represents a Python iterable over a dataset with extended functionality, supporting things such as automated batching, shuffling and multi-process data loading.
1.6 – Creating the Learner
Again, to make things more convenient for us down the line, we’re going to wrap everything we’ve written so far into a function that defines the parameters and feeds them through the fastai convenience function cnn_learner
to finally get out our learner.

Here, we set a few parameters for our default learner. We’ll use 224×224 px size images, with a batch size of 64. A pretrained ResNet34 architecture be used, a classic and reliable neural network. The parameters of the optimization function we’re using (Adam) are also then defined in the body of the function, before calling the random_seed
function to fix all randomness. We also set up a range of useful callbacks that will save the results as a .csv as well as show the training and validation losses in a plot, live during training. Finally, we create the learner using a fastai convenience function that takes in all the separate items we prepared earlier, before adding an option to switched to mixed-precision training.
Phew! We’re ready for training. It was a lot of work to get to this point, and it isn’t necessary to set up things in this way. However, taking the effort to wrap things up into a single neat function will pay dividends later on as we can now easily alter a range of parameters (in a consistent way) including the number of images in the dataset and the architecture of the neural network. Doing so will make everything just that bit neater and avoid the need to copy and paste sections of code each time, reducing the probability of making mistakes.
2. Training the Model(s)
As we can see in thecreate_simple_cnn_learner
function, we will start with a simple but robust CNN, ResNet34. Let’s first create a learner that will use only 1/3 of the dataset (50 out of 150 images for each class).
We can then take a look at the images using
learn.dls.show_batch(max_n=9)

A transfer learning procedure will be used to fine tune the network for our images using the weights pretrained from ImageNet. The basic idea behind transfer learning is that the pretrained ResNet34 model created by thecreate_simple_cnn_learner
function will already have a decent idea at identifying things that existed in the ImageNet dataset. As the images we’re using won’t be significantly different from the real-world images used in ImageNet – it makes sense that we don’t want to mess around with the weights too much. The theory and decisions to take when using transfer learning can be more nuanced, see this excellent blog post for a more in-depth explanation.
Here, we will train our model with the weights in the initial layers frozen (only training the weights of the last fully connected layers) before unfreezing everything and doing a ‘fine-tune’ train on all the weights at relatively lower discriminative learning rates. This means that the learning rates will be staggered from small (in the early layers) to relatively larger (as we approach the final layers), in groups of layers which are defined by fastai. Intuitively, this has to do with the details of the image being looked at by each layer. Early layers tend to look at the broad strokes of the image such as gradients, edges and corners, all details whose weights won’t need to be re-trained to any significant degree. Vice-versa for the later layers.
The above code lays out the basic transfer learning procedure we will use herein to train our classifiers. Fixed learning rates and epochs will be used to compare training runs, with 10 epochs used for training the model head, and a further 10 epochs for fine-tuning the network. Note that enumerate_params
is a little function that tells us how many frozen and unfrozen parameters exist for a given learner when called.
2.1 – How do we choose the Learning Rates?
Many excellent blog posts exist that explain the importance of selecting appropriate learning rates for the problem at hand. As it isn’t the focus of this post, we won’t be going into too much detail. Here (following the advice of Leslie N. Smith, as many others do), we’ve used the built-in fastai function learn.lr_find()
to plot the losses against the learning rates and pick a value a bit before the minimum where the loss still improves, to determine the appropriate learning rate(s) for each step of the transfer process.

In practice, we use the fit_one_cycle
function, with the learning rates used representing the maximum learning rate in a one cycle policy with cosine annealing, as can be seen below. Note that the x-axis below represents the number of batches passed through the learner.

This policy is a result of Leslie Smith’s work on hyper-parameters (learning rate, momentum and weight decay) combined with tweaks from fastai, offering fast results in the training of complex models. Intuitively, we can think of the policy as beginning with a lower value to warm-up the training. As we progress, the learning rate increases and momentum decreases in order to encourage the optimizer to quickly investigate new areas of the loss function, acting as a regularization method, typically landing in areas with flatter minima (which are better at generalization). In the final part of the 1-cycle policy, decreasing learning rates allow the optimizer to enter a steeper local minimum inside flatter area. See this excellent post by Nachiket Tanksdale for a more detailed explanation.
2.2 – An Example Training Run
After running baseline_fit(learn)
, a single training run will end up looking something like this.

We can see the usefulness of the ShowGraphCallback()
, which can give us an indication of when overtraining is occurring. Typically we might look for the divergence in the training and validation loss, and we should be particularly careful if the validation loss starts to increase.
3. Training the Model(s)
Now that we have everything set up, we can start to do some interesting comparisons. Let’s start with comparing how our top 5 accuracy varies as we change the number of images (using our pct_images
parameter while creating the learner) from our data collection in Part 1.

While the total training time scales linearly with the number of images, we see that the top 5 accuracy of the models experience a drop off when 150 images are used per class. Uh-oh. Typically we’d expect more images to give us a better result!
3.1 – Why aren’t more images better?
The problem has to do with the quality of images we’ve downloaded to create our database. Recall that each folder of images was downloaded off Google Images based on the scientific name of each plant. Let’s take a look at an early search result vs. a later one from a random class, say Peperomia peltifolia.

Ah ha! Many more than 150 search results have to be looked at before we can generate 150 unique images per class (430+ for just this random class.) As we dive deeper and deeper into the search results, our results will get less and less relevant. This will result in images like drawings, graphs and fact sheets – which are all bad at training our model to do what we want it to do (classify a plant based on a natural photo.) Indeed, our comparison of training results suggests that trying to use more than 100 images per class starts to harm our model due to the inclusion of many poor training examples.
We’d like to do some cleaning up of these images without having to manually examine each of the files in the 500+ folders. Join us in Section 3: Targeting and Removing Bad Training Data as we look at a few approaches for doing exactly that.