Downloading Kaggle datasets directly into Google Colab

Instructions to use the Kaggle API to download and work with data entirely on Google’s virtual machine.

Anna
Towards Data Science

--

An astronaut leaving the limited world of her personal computer behind, and embarking on a journey to the virtual machine.
A data scientist leaving behind the limited world of her personal computer and embarking on a journey to the virtual machine. (Photo by The New York Public Library on Unsplash)

There are many perks to working in Google Colaboratory, or “Colab” for short, for data science projects. For those like me, who are plugging away on an older laptop — the main highlight is the free access to accelerated hardware GPU/TPU for larger projects.

For the Phase 4 Project of my data science boot camp, I set out to build an image classifier to aide in diagnosing pneumonia from chest X-rays. Excited to embark on my first computer vision assignment (cue laser eyes), I hurried over to Kaggle to download the dataset. My computer informed me it would take 4 hours to download the zip file of almost 6,000 JPEGs. Oh no. As I watched the “estimated time remaining” continue to increase, I began to fill with dread about what this bode for the upcoming model training times. In the tenet of the greyscale society trapped in infomercials: “There’s gotta be a better way!”

Enter Google Colab. Set up to run Python in a similar user style as Jupyter, Colab is optimized to assist with high processing needs by providing access to hosted GPU/TPU (see more here about runtimes). All you need to get started is a Google account. Your Colab notebooks will be saved in your Google Drive, but to access your files from Colab you will need to connect to (or ‘mount’) your Drive into your notebook at the start of each session.

The virtual machine operates out of ‘/content’. This is where your Colab working data needs to be. Screenshot by author.

Colab’s virtual machine operates out of the ‘content’ folder (shown left). Any data you wish to work with during your session should be stored in this folder.

To keep the servers available for the maximum amount of users there are limits to usage at a free or paid tier:

For Free Users

The kernel will time out after 30 minutes of inactively — meaning all variables will need to be reloaded.

The entire contents of the virtual machine ( ‘/content’) will be cleared after 12 hours. Thus, the data will need to be reimported at the start of each 12 hour stretch.

For PRO (Paid) Users

For about $10 monthly, the kernel inactivity time-out will increase to 90 minutes. The contents of the virtual machine will now be cleared after 24 hours.

Alright, I’ve got my driving gloves on and seatbelt buckled. I am ready to go full throttle. But how do I get my data into Colab when I can barely get it onto my computer? One solution is to download the data directly from Kaggle into Colab — bypassing my local computer entirely.

Since Colab clears all data from ‘/content’ after 12/24 hours, I will download the dataset into my Google Drive for storage (and access from anywhere). I will also need to add some code to the top of my Colab project notebook to load it in from Drive for each session. The most efficient way to do this, is to keep the file on Drive in its compressed, zipped state. Even though they are both Google external machines, they do not operate on the same servers. You will need to copy the zip file from Drive to the Colab virtual machine (‘/content’) at the start of each new session and unzip it there. Otherwise, you will need to load in each, individual file from Drive — and didn’t we arrive here for speed?

Once everything has been configured properly it is quite simple and quick to use. Below is the code to get it set up.

Setting up Kaggle API access

Download your API token from you Kaggle account page. Screenshot by author as demo.
  1. First, you must collect your Kaggle API access token. Navigate to your Kaggle profile ‘Account’ page. Under your ‘Email Preferences’ you will find the button to ‘Create’ your API token. Click the button to download your token as a JSON file containing your username and key.
  2. Create a folder for Kaggle in your Google Drive. Save a copy of the API token as a private file in this folder so you can access it easily from the virtual machine.
  3. Fire up a Colab notebook and establish a connection to your Drive. You will be prompted to login to your Google account and copy a validation token each time you connect to Drive from Colab. After establishing the connection, you will see a new folder appear in the side bar as ‘/gdrive’ (or whatever you choose to name it).
# Mount google drivefrom google.colab import drivedrive.mount('/gdrive')

4. Configure a ‘Kaggle Environment’ using OS. This will store the API key and value as an OS environ object/variable. When you run Kaggle terminal commands (in the next step), your machine will be linked to your account through your API token. Linking to the private directory in your Drive ensures that your token information will remain hidden to those you may choose to share notebooks with.

# Import OS for navigation and environment set upimport os# Check current location, '/content' is the Colab virtual machineos.getcwd()# Enable the Kaggle environment, use the path to the directory your Kaggle API JSON is stored inos.environ['KAGGLE_CONFIG_DIR'] = '../gdrive/MyDrive/kaggle'

Downloading data from Kaggle into Google Drive

This can be done one time in a seperate notebook. Once downloaded, you can store the data as long as needed on your Drive.

  1. Install the Kaggle library to enable Kaggle terminal commands (such as downloading data or kernels, see official documentation).
!pip install kaggle

2. Go to the competition page for your data. Copy the pre-formatted API command from the dataset page you wish to download (for example, this Xray image set).

Copy the pre-formated Kaggle API command by clicking the vertical ellipsis to the right of ‘New Notebook’. Screenshot by author.

3. Navigate into the directory where you would like to store the data. Paste the API command. Keep the file zipped to save space on your Drive and to import to Colab more efficiently.

# Navigate into Drive where you want to store your Kaggle dataos.chdir('../gdrive/MyDrive/kaggle')# Paste and run the copied API command, the data will download to the current directory!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia# Check contents of directory, you should see the .zip file for the competition in your Driveos.listdir()
Screenshot by author as demo.

Loading data from Drive into Colab

When you are ready to use the data in Colab, you will need to copy it from long term storage location on your Drive (‘/gdrive/’) to the temporary virtual machine ('/content') at the start of each session.

  1. In the project notebook where you will be using the data, mount a connection to Drive as shown above.
  2. Copy the compressed file to the virtual machine and unzip it to access the data. Do not remove the original zip from your Drive after copying. You will need to copy it at the start of each session while working with Colab.
# Complete path to storage location of the .zip file of datazip_path = '../gdrive/MyDrive/kaggle/chest-xray-pneumonia.zip'# Check current directory (be sure you're in the directory where Colab operates: '/content')os.getcwd()# Copy the .zip file into the present directory!cp '{zip_path}' .# Unzip quietly !unzip -q 'chest-xray-pneumonia.zip'# View the unzipped contents in the virtual machineos.listdir()

You will be able to use this unzipped data with the Colab accelerated GPU/TPU for up to 12 hours (if free access) or 24 hours (if PRO). You can reload it from Drive as many times as needed.

Conclusion

I was able to train TensorFlow CNN models for image classification in minutes using Google Colab. Downloading the data directly into Google Drive allowed me to spare memory on my machine and bypass a treacherous download time. Once stored on Drive, it was easy to import the data into Colab. All the time saved could be spent trying out additional model architecture or browsing the internet for pygmy versions of your favorite animals. Goodbye tedious fit times of the past, hello accelerated machines!

Thank you for reading my post! View my boot camp journey and my complete computer vision analysis on my GitHub.

--

--