The world’s leading publication for data science, AI, and ML professionals.

Version Control your Large Datasets using Google Drive

Making reproducible datasets possible

Photo by David Pupaza on Unsplash
Photo by David Pupaza on Unsplash

MLOps has recently gained some limelight in Machine Learning community. With so many experiments, tracking, managing and orchestrating them with other components has been an important subject in discussion lately. In particular, datasets keep changing when you trying your experiments. You may be scaling, creating new features, eliminating features or just taking a sample of the entire dataset. A naming convention like data_feat1.csv , data_scaling.csv, data_v2.csv , etc. is okay for initial few experiments but is painful when you scale up your experiments.

So, I thought we could probably version control datasets using Git. But git does not allow data greater than 100 MB to be stored or version controlled. For e.g. here I am just taking a very famous MNIST dataset to version control.

Image from author
Image from author
git add *
git commit -m "Adding files"
git push -u origin master

As you can see from the screenshot above that git does not accept files greater than 100 MB to be pushed directly. There were two options that I came across to be tried:

  1. Using Git Large File Storage (suggested in the output terminal)
  2. Using DVC for Version Control of dataset.
Image created by Author
Image created by Author

I did not go with Git LFS(Large File Storage) because while using version control systems like DVC to track your model, metrics, etc. it becomes easier just to keep everything under one system. There are several options that DVC provides to store your datasets, like Amazon S3, Azure Blob Storage, but my favorite is Google Drive. DVC has designed the system that the entire process is seamless to integrate once you start using DVC. Following are the steps showing how to do it:

Step 1: Initialize the repository:

Initialize git and DVC in your local directory/location where you have the dataset currently. Go to the terminal and type the following:

$ git init
$ dvc init

Note: dvc init needs to be entered at the exact location where .git file is located. Please make sure about this, otherwise your dvc won’t be able to identify git files. Here we will try to initialize the training data which is approximately 105 MB.

Step 2: Staging the files to be committed:

$ dvc add *
$ ls -alh

You will see a new file created called train.csv.dvc Commit the new files to git using the following code:

$ git add .gitignore train.csv.dvc
$ git commit -m "Adding training data"

If you see the contents of this .dvc file you will see md5 and path that is a meta file linked to the data which can be used to track the data.

Here the commit is done locally to your git repo. However, the files have not yet been pushed or stored anywhere(google drive here).

Step 3: Push files to Google Drive:

The next step is to define the storage location where you want the files to be stored. You can use the following code to do that:

$ dvc remote add -d storage gdrive://auth_code_folder_location

Followed by that do git commit using :

git commit .dvc/config -m "Config remote storage"

You can push the data to the storage location using the following:

dvc push
Image by author
Image by author

You may get an error as above if you don’t have the package pydrive2 installed.

pip install pydrive2

Click on the link and authenticate it providing access to DVC to push files to google storage. Once you allow it will provide a verification code that needs to be entered into the terminal, something like below:

Once you enter the code you will see the files being pushed to the google drive.

Image created by author
Image created by author

Step 4: Push your meta information to Github:

Now all the tracking information is version controlled in your local git repo. You can push all that to your Github repo. You could do the following steps in your terminal:

$ git remote add origin <your repo>
$ git push -f origin master

You should see all the changes now on your github:


Now lets says if you want to access the data from some other location/computer you need to have DVC installed and you could follow the below steps to download data locally.

Firstly to check if the data exists or not your could just enter the following:

dvc list <your repo>
Image by the author
Image by the author

But before downloading the data you need to clone the repository. You can do :

git clone <your repo>

After the repo is cloned you can do git pull . It will prompt with a url which needs to be authenticated as done before. See the screenshot below:

Finally, you will see the files downloaded to your local computer! Voila!!

Thank you for taking the time out and reading this article. I hope you enjoyed the content and is helpful to you. Please feel free to comment with any suggestions/thoughts.


Follow me on Twitter or LinkedIn. You may also reach out to me via [email protected]


Please sign up for a membership if you want to become a Medium member and enjoy reading articles without any limits. Medium will share a portion with me for any members who sign up using the above link! Thanks.


Related Articles