Downloading OpenImages Dataset in Google Drive using Colab

Downloading your Custom Dataset

Published in

Towards Data Science

4 min readJul 24, 2020

https://storage.googleapis.com/openimages/web/index.html

Want to train your Computer Vision model on a custom dataset but don't want to scrape the web for the images. Try out OpenImages, an open-source dataset having ~9 million varied images with 600 object categories and rich annotations provided by google. The dataset contains image-level labels annotations, object bounding boxes, object segmentation, visual relationships, localized narratives, and more.

You can download the specific categories as per your interest instead of downloading the whole dataset. In this article, we will download some selected classes/categories using google colab and will save the data in Google Drive. You can use jupyter notebook as well in your local runtime.

For this article, you need a google account and some space in google drive to download the data.

Mount your Google Drive in Colab.

2. Create a folder where your data should be stored. Here, I have created a directory named OpenImages.

!mkdir OpenImages

3. Change the working directory to the folder you have just created. Confirm your working directory using “!pwd” command.

cd OpenImages

4. Run the below commands to download the annotations files for train, test, and validation along with the class file.

# Download required meta-files
!wget https://storage.googleapis.com/openimages/2018_04/class-descriptions-boxable.csv
 
!wget https://storage.googleapis.com/openimages/2018_04/train/train-annotations-bbox.csv
 
!wget https://storage.googleapis.com/openimages/2018_04/validation/validation-annotations-bbox.csv
 
!wget https://storage.googleapis.com/openimages/2018_04/test/test-annotations-bbox.csv

5. Use the GitHub link to save the downloadOI.py in your working directory(https://github.com/spmallick/learnopencv/blob/master/downloadOpenImages/downloadOI.py) or copy the code present in the link and create the file directly as shown below. (Thanks to Sataya Mallick/learnopencv for this amazing code repository. Do check out this amazing Github repository if you are interested in Computer Vision and OpenCV).

#This magic function is used to create the files.downloadOI.py is the file name
%%writefile downloadOI.py#Author : Sunita Nayak, Big Vision LLC

#### Usage example: python3 downloadOI.py --classes 'Ice_cream,Cookie' --mode train

import argparse
import csv
import subprocess
import os
from tqdm import tqdm
import multiprocessing
from multiprocessing import Pool as thread_pool

cpu_count = multiprocessing.cpu_count()

parser = argparse.ArgumentParser(description='Download Class specific images from OpenImagesV4')
parser.add_argument("--mode", help="Dataset category - train, validation or test", required=True)
parser.add_argument("--classes", help="Names of object classes to be downloaded", required=True)
parser.add_argument("--nthreads", help="Number of threads to use", required=False, type=int, default=cpu_count*2)
parser.add_argument("--occluded", help="Include occluded images", required=False, type=int, default=1)
parser.add_argument("--truncated", help="Include truncated images", required=False, type=int, default=1)
parser.add_argument("--groupOf", help="Include groupOf images", required=False, type=int, default=1)
parser.add_argument("--depiction", help="Include depiction images", required=False, type=int, default=1)
parser.add_argument("--inside", help="Include inside images", required=False, type=int, default=1)

args = parser.parse_args()

run_mode = args.mode

threads = args.nthreads

classes = []
for class_name in args.classes.split(','):
    classes.append(class_name)

with open('./class-descriptions-boxable.csv', mode='r') as infile:
    reader = csv.reader(infile)
    dict_list = {rows[1]:rows[0] for rows in reader}

subprocess.run(['rm', '-rf', run_mode])
subprocess.run([ 'mkdir', run_mode])

pool = thread_pool(threads)
commands = []
cnt = 0

for ind in range(0, len(classes)):
    
    class_name = classes[ind]
    print("Class "+str(ind) + " : " + class_name)
    
    subprocess.run([ 'mkdir', run_mode+'/'+class_name])

    command = "grep "+dict_list[class_name.replace('_', ' ')] + " ./" + run_mode + "-annotations-bbox.csv"
    class_annotations = subprocess.run(command.split(), stdout=subprocess.PIPE).stdout.decode('utf-8')
    class_annotations = class_annotations.splitlines()

    for line in class_annotations:

        line_parts = line.split(',')
        
        #IsOccluded,IsTruncated,IsGroupOf,IsDepiction,IsInside
        if (args.occluded==0 and int(line_parts[8])>0):
            print("Skipped %s",line_parts[0])
            continue
        if (args.truncated==0 and int(line_parts[9])>0):
            print("Skipped %s",line_parts[0])
            continue
        if (args.groupOf==0 and int(line_parts[10])>0):
            print("Skipped %s",line_parts[0])
            continue
        if (args.depiction==0 and int(line_parts[11])>0):
            print("Skipped %s",line_parts[0])
            continue
        if (args.inside==0 and int(line_parts[12])>0):
            print("Skipped %s",line_parts[0])
            continue

        cnt = cnt + 1

        command = 'aws s3 --no-sign-request --only-show-errors cp s3://open-images-dataset/'+run_mode+'/'+line_parts[0]+'.jpg '+ run_mode+'/'+class_name+'/'+line_parts[0]+'.jpg'
        commands.append(command)
        
        with open('%s/%s/%s.txt'%(run_mode,class_name,line_parts[0]),'a') as f:
            f.write(','.join([class_name, line_parts[4], line_parts[5], line_parts[6], line_parts[7]])+'\n')

print("Annotation Count : "+str(cnt))
commands = list(set(commands))
print("Number of images to be downloaded : "+str(len(commands)))

list(tqdm(pool.imap(os.system, commands), total = len(commands) ))

pool.close()
pool.join()

6. The content in the directory OpenImages should look like this

7. Run the below command to avoid “aws: not found sh: 1: aws: not found” error.

!pip install awscli

8. Downloading the specific data. Here I am downloading 2 classes Sink and Toilet. You can try multiple classes and download the data. You can find all the class names in the “class-descriptions-boxable.csv”.

# Download Sink and Toilet images for test 
!python3 downloadOI.py --classes "Sink,Toilet" --mode test# Download Sink and Toilet images for train
!python3 downloadOI.py --classes "Sink,Toilet" --mode train# Download Sink and Toilet images for validation
!python3 downloadOI.py --classes "Sink,Toilet" --mode validation

There are 3 mode options you can choose

a) — mode train: To download the training data

b) — mode test: To download the test data

c) — mode validation: To download the validation data

This is how the downloaded data will look

Each Category has an image file along with its annotations in a text file.

Filepath: OpenImages/train/Sink/01681d52ad599ab4.jpg
Image File: 01681d52ad599ab4.jpg
TextFile: 01681d52ad599ab4.txt
TextFile Content : Sink,0.249529,0.420054,0.659844,0.682363

.txt file has the dimensions of the bounding boxes. It can have multiple entries in case if the image has multiple objects.

Note: Keep in mind that every time you run this command the old data will get deleted. So if you want to use the command again, better try it in some other folder to avoid the deletion of the previous downloading data.

That’s all. Play around with it and download the custom data to train your custom computer vision model.

Github link to the code https://github.com/mringupt/Colab/blob/master/Downloading_OpenImages_Custom_Dataset.ipynb

Downloading OpenImages Dataset in Google Drive using Colab

Downloading your Custom Dataset

Written by MRINALI GUPTA