The world’s leading publication for data science, AI, and ML professionals.

Importing Kaggle Training Data into Google Cloud Storage

We show you how in easy to follow steps.

Image licensed to author
Image licensed to author

So, you have picked your Kaggle competition, and you want to start training your model and make yourself known on the Kaggle leaderboard.

If like us, you use Google Cloud AI Platform for your Data Science workloads, one of the first steps in a Kaggle competition is to upload the Kaggle training data into Google Cloud Storage.

We will be using this recently announced (late Nov 2020) Kaggle competition as an example.

Cassava Leaf Disease Classification

This particular competition is an Image Classification problem with circa 20k training images (jpeg files) in Kaggle. Here’s a simple way to upload these into a Google Cloud Storage bucket.

Before you start

  1. We assume you already have a Google Cloud Project. If not, you can easily create one (for free) using the link below.

GCP Free Tier – Free Extended Trials and Always Free | Google Cloud

  1. Make sure you have permissions on your Google Cloud Project to create new storage buckets in Google Cloud Storage (GCS) and upload files. You can read the GSC documentation here if you are unsure.
  2. We assume you have registered with Kaggle, and have signed up for a competition.

Copying Data from Kaggle to GCS

  1. In a browser, navigate to kaggle.com
  2. Create a new Kaggle Notebook. To do this, click <> Notebooks on the left sidebar and click + New Notebook
  3. Give your notebook a name so you can easily find it again. I called mine "Cassava – Copy Kaggle Data to GCS"

Tip: I prefix my notebooks in Kaggle with the competition name to make them easy to find later on

  1. Add the following imports in a new Notebook cell (click the +Code button to add a new cell:
import os 
from google.cloud import storage

Run the cell by clicking into the cell, and then clicking the play button that appears on the left.

  1. Add a reference in your Notebook to the Casava Kaggle training images. To do this, click Add data.

In the window that appears, select the Competition Data tab, find the Casava competition, and click the Add button.

You should now see the Casava data listed under Input in the sidebar:

  1. Grant Kaggle access to your Google Cloud Storage service. To do this, select Google Cloud Services from the Add-ons menu:

In the window that appears, tick Cloud Storage and then click the Link Account button.

  1. Add a new code cell and declare the following two functions; the first will create a storage bucket, and the second will upload all files to a bucket that are found in the source folder path.

Replace the GCP_PROJECT_ID with your Google Cloud project id.

Once added, run the cell and confirm no errors.

storage_client = storage.Client(project='GCP_PROJECT_ID')
def create_bucket(bucket_name):
 bucket = storage_client.create_bucket(bucket_name)
def upload_files(bucket_name, source_folder):
 bucket = storage_client.get_bucket(bucket_name)
 for filename in os.listdir(source_folder):

  blob = bucket.blob(filename)
  blob.upload_from_filename(source_folder + filename)
  1. Add a new cell to create your new GCS bucket. Replace BUCKET_NAME with a suitable, unique name. Run the cell.
bucket_name = 'BUCKET_NAME' 
create_bucket(bucket_name)

Tip: GCS bucket names must be globally unique. An easy way to ensure this, is to add a suffix of your GCP project id (which is also unique).

  1. In a browser, open the Google Cloud Console. Type "storage" in the search bar and click on Storage. Confirm you see your new bucket above in the list.
My new bucket in Google Cloud Storage
My new bucket in Google Cloud Storage
  1. Finally, upload the training resources from Kaggle into your new bucket. To do this, add a new cell with the following code:

local_data = ‘../input/cassava-leaf-disease-classification/train_images/‘ upload_blob(bucket_name, local_data)

Replace the string in bold, with the location of your training data. To get this, click on the Copy file path button (shown below). And then add the folder that contains the data you want to copy (mine was in train_images).

Run that cell and your data will be copied! It can take a little while, depending on your file volumes and size. The Cassava data took about 15 minutes.

Conclusion

Hopefully, you found this walkthrough useful.

Since Google acquired Kaggle in 2019 we are seeing tighter integration with Google Cloud, for example, you can now submit models trained in AutoML. Keep an eye out for a follow-up article to this, where I will do just that using Google Cloud AI Platform Unified and AutoML.

Next steps

  1. In Part 2, we will show how to train a model for this competition using Google Cloud AI Platform Unified and AutoML.
  2. View the Cassava Kaggle competition used in this walkthrough.
  3. Learn more about Ancoris Data, Analytics & AI

Related Articles