The world’s leading publication for data science, AI, and ML professionals.

Kaggle: Man vs Machine

We train an AutoML image classification model for Kaggle's latest competition. See how it ranks against human Data Scientists.

Image licensed to author
Image licensed to author

Kaggle recently (end Nov 2020) released a new Data Science competition, centered around identifying deseases on the Cassava plant – a root vegetable widely farmed in Africa.

"As the second-largest provider of carbohydrates in Africa, cassava is a key food security crop grown by smallholder farmers because it can withstand harsh conditions. At least 80% of household farms in Sub-Saharan Africa grow this starchy root, but viral diseases are major sources of poor yields. With the help of data science, it may be possible to identify common diseases so they can be treated." From Kaggle.com Cassava Leaf Desease Classification

The challenge – train a multi-label Image Classification model to classify images of the Cassava plant to one of five labels:

  • Labels 0,1,2,3 represent four common Cassava diseases
  • Label 4 indicates a healthy plant
A healthy Cassave plant (Photo by malmanxx on Unsplash)
A healthy Cassave plant (Photo by malmanxx on Unsplash)

Google AutoML

AutoML is an no-code AI service on the Google Cloud AI Platform that uses AI to train AI. We have seen some impressive results with AutoML on many client projects, so we thought it would be interesting to see how it faired against some of the most talented human data scientists on the planet.

Google Cloud AI Platform

If you are new to Google’s Cloud AI Platform, we cover an overview of the services available in the article below.

Google Cloud AI Platform: Hyper-Accessible AI & Machine Learning

Preparing the Kaggle training data

Kaggle provided 21,000 labelled jpeg images of the Cassava plant to assist with model training. The first task was to upload this to Google Cloud Storage (GCS), so that AutoML could access the images.

To learn how to do this, see our previous article below. Once you have completed those steps, return back here to see how we train the model.

Importing Kaggle Training Data into Google Cloud Storage

AutoML UI vs API Training

With a Kaggle competition like this you can either train an AutoML Edge model directly from a Kaggle Notebook using the AutoML API, or, as we choose, you can:

  1. Train an Edge model using the AutoML UI (an edge model is a fully serialized model that we typically deploy to mobile and other devices).
  2. Export the Edge model as a TensorFlowLite model.
  3. Import the TensorFlowLite model into your Kaggle Notebook environment.
  4. Run predictions as usual using Python Tensorflow libraries .

Training the AutoML Edge model

  1. In a browser, open the Google Cloud Console. Search for "AI Platform" and select AI Platform (Unified) from the results.
  1. Select Data sets from the menu on the left and click Create. On the screen that appears enter a name for your data set. Then select the IMAGE tab and check the Image classification (multi-label) radio button.
Creating my new data set for training images
Creating my new data set for training images

Finally, click CREATE.

  1. After a minute or so, you should see your dataset listed. Here is mine:
My new dataset
My new dataset
  1. On the screen above, click on your data set name and you will be taken to the screen below. Here, check Select import files from Cloud Storage and browse to a .csv file that contains a list of URIs to the training images in your Google Cloud bucket.

Here is an example row in the CSV we uploaded. Note the last column (0 in this case), is the label.

gs://kaggle-cassave-234523/upload_dir/train_images/1000015157.jpg,0

Leave the Data split dropdown as Automatic.

Finally, click the CONTINUE button.

You will now see this screen. For this Kaggle exercise, importing the circa 20k took around 20 minutes. You will receive an email once this is complete (a good excuse for a cup of tea).

  1. With your images imported into your data set, you should see a screen that looks like this. It’s a great place to get a feel for your images that are in each label set – use the handy filter option.

For example, in our Cassva competition we spent quite a bit of time analysing what a healthy plant (label 4) looked like, and noted that we thought a lot of these plants looked far from healthy!

Another feature we use is if you select the ANALYSE tab, this shows how your training images are distributed across labels. Note in the Cassava data, there is an imbalance seen in label 4 (mosaic desease).

  1. With your image data imported, we can now train the model 🙂

Select Training from left sidebar and click CREATE.

In the form that appears:

Pick your dataset created in step 5 from the dropdown.

Choose the default annotation set.

Annotation setsAn annotation set refers to a set of labels— in our case this will refer to the labels we uploaded alongside the Cassava plant images. If you want to, in AI Platform Unified you can actually go into a dataset and create additional label sets and label images yourself. A really handy feature (assuming your image set is small enough in size to make this feasible).

Check the AutoML Edge option.

Click CONTINUE.

  1. Enter a name for your trained model. We typically use the default. Expanding the Advanced Options allows you to define your training:validation:test split. Typcially we start with the recommended 80:10:10.

Click CONTINUE

8.You are now presented with some options for optimising your model. As this is an Edge model, and is run on the device (e.g. mobile phone), latency is a key factor for a typical Edge use-case. However, as we are trying to achieve the highest accuracy score for our Kaggle completion we select Highest accuracy.

Click CONTINUE.

  1. In the final screen, you choose how long you want to AutoML to spend training your model. You are charged based on the operation, use-case and training time. This is covered in detail in the official documentation below.

Pricing | AI Platform (Unified) | Google Cloud

This is the pricing table taken from the above, shown here for image data.

In our example, our operation is Training (on-device) and our task is Classification. Therefore, we are charged $4.95 per node hour.

A node hour is a measure of one hour training performed on a single node in AutoML. For our training operation, AutoML has the number of nodes fixed at 8.

We choose 8 node hours for our model, so this will take 8/8 = 1hr to train

And cost $4.95 * 8= $39.60 to train

Make sure to Enable early stopping

Tip: Note in this screen AutoML gives you an estimated completion time for model training. Also note, that should AutoML reach an optimal model in less node time, training will finish and you will be only charged for the nodes used. Lastly, note the minimum training time is 1 node hour.

Finally, click START TRAINING and AutoML will handle the rest. Exciting hey!

Enjoy cup of tea #2 while your model trains.

You will receive an email once your model is ready. Meanwhile, if you click on Training you will see your model listed, and note the status is Pending, indicating training is in progress.

Evaluating your AutoML model

Once your model has finished training comes the exciting part – seeing the level of accuracy AutoML managed to achieve.

To do this:

  1. Click Models on the left sidebar and click on your model name. You will be taken to the Model detail view. Here is mine:

The EVALUATE tab reveals your model accuracy and other metrics. We saw some promising results with our Kaggle Cassava model:

Precision 88.4% | Recall 83.3%. | Average precision 0.916

Clicking on the MODEL PROPERTIES tab gives a nice summary of our model.

Exporting your AutoML model to Tensorflow Lite

  1. Remaining in the Model view (see prior step), select the DEPLOY AND TEST tab. From here you have the option of exporting the model to a variety of different formats as well as deploying the model:

We want to use this in a Kaggle notebook via TensorFlow, so we choose TF Lite. This displays the following window. Here, we choose to export the model to a GCS bucket (we used the same bucket that contained our Kaggle training images).

Click EXPORT and then DONE.

Our model is now ready to upload to Kaggle 🙂

Ranking our Model on the Kaggle Leaderboard

So this is the really exciting part – we are going to see how our "machine made" model fairs against human-made models in the Kaggle contest.

To do this, we essentially need to submit a Kaggle Notebook that fires the (unseen) test images at our exported TF Lite model, generate a prediction, and outputs the prediction for Kaggle to score.

Kaggle can then rank our machine-made model in the Kaggle leaderboard.

Excited? So were we!

Sumbitting the AutoML model to Kaggle

  1. First, we navigate to our GCS bucket that has our exported TF Lite model file. Save this locally to your machine (via the download link on the .tflite model in Cloud Storage).
  2. In Kaggle, click on <> Notebooks in the left sidebar and click the +New Notebook button.
  3. Select Python as the language and select Notebook (not script) and click the CREATE button.
  1. Give your notebook a name (we always prefix them with the competition name to make them easy to find later on).
  1. Next we upload our exported Tensorflow Lite model into this Notebook, ready for prediciton. To do this, click on +Add data in the top right of the screen, and the following page is displayed:

Click Upload and navigate to the TF Lite model you saved locally. You should now see your model listed under Inputs in the right sidebar:

My exported AutoML model now in my Kaggle Notebook
My exported AutoML model now in my Kaggle Notebook
  1. Our final step is to run the predictions through the model.

Add the following code cell to your Kaggle Notebook. Note the placeholder in bold should be replaced with the path to your AutoML model (hover over your model as shown below to get this).

import numpy as np
import tensorflow as tf
from PIL import Image
import os
import pandas as pd
import matplotlib.pyplot as plt
# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path='YOUR MODEL PATH')
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
print(input_details)
output_details = interpreter.get_output_details()
print(output_details)
df_results = pd.DataFrame(columns=('image_id', 'label'))
image_no = 0
for i in test_images:
    image = Image.open(f'/kaggle/input/cassava-leaf-disease-classification/test_images/{i}')

    # resize image
    new_width = 224
    new_height = 224
    new_size = (new_width,new_height)
    image = image.resize(new_size, Image.ANTIALIAS) 

    image_array = np.array(image)

    # show image in debug
    # plt.imshow(image_array)

    image_array = tf.expand_dims(image_array,0)

    interpreter.set_tensor(input_details[0]['index'], image_array)
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])

    output_array = np.array(output_data, dtype=np.float32)
    prediction = np.argmax(output_array)

    # add to dataframe
    df_results.loc[image_no] = [i, prediction]
    image_no = image_no + 1

    #print(output_array)
    #print(prediction)
df_results.to_csv('submission.csv', index=False)
!head submission.csv

Run this Notebook in Kaggle, and you should see the first rows of your CSV showing the image name and the predicted label.

Tip: Make sure you disable internet access – otherwise you won’t be able to submit this Notebook for scoring. Click on Settings in the right sidebar and untick Internet.

Finally click the Save as version button.

Submit to Kaggle

Navigate to the Kaggle competition page and click the Submit button. On the screen that’s displayed, choose the latest version of your Notebook and click the Submit button.

You will now now see your model running predictions on the unseen test data:

My AutoML model, running predictions using Kaggle's test images
My AutoML model, running predictions using Kaggle’s test images

How did our AutoML model do against the humans?

The big reveal 🙂

So, our model took about 20 (agonising) minutes to process Kaggle’s test images and for Kaggle to grade our model predictions.

And our official Kaggle grade?

87.3% accuracy

Where did that place our machine-made model against human competitors?

Well, as of 28 Nov 2020, our machine-made AutoML model ranked 559 out of 766 in the Kaggle leaderboard for the Cassava competition.


Conclusions

Our machine-made model scored an impressive 87.3% accuracy; achieved with zero data science knowledge, and an investment of under $40 to train the model. The model was really easy to train, in fact hardest part of the exercise was coding the submission Notebook in Kaggle (hopefully the detailed walkthrough will encourage others to try it out).

Our model was ranked 559 out of 642. This may seem fairly low but there are a few things to note:

  1. This is an Edge (TF Lite) model, streamlined for mobile deployment (the model file only just over 5Mb). Therefore, it was quite a big ask to pitch it against non-edge models.

Note: We are going to attempt to repeat the exercise using a Tensorflow.js model, an alternative export format that’s available from AutoML. Currently, we are not sure how to run this from within a Python Kaggle Notebook – these models are designed to run in a web browser, but if we do manage it, we will post an update 🙂

  1. One of the top performing contestants in the competition revealed their Notebook solution (achieves 90% accuracy). Unsurprisingly, many of the contestants immediately adopted this Notebook as their starting point, so the scoreboard is heavily contested around the 90% zone.

If you want to give it a try, Google Cloud offer 15 node hours free that you can use to train two Edge models in AutoML (say one 8 node hour and a second 7 node hour). This is part of their generous free tier (the billing manager will automatically discount this for you).


Next Steps

  1. You can follow the progress of how AutoML does in other Kaggle competions by following my Kaggle profile.
  2. View the Cassava Kaggle competition used in this walkthrough.
  3. Learn more about Ancoris Data, Analytics & AI

Related Articles