The world’s leading publication for data science, AI, and ML professionals.

Level Up with Google’s Public Datasets

Connect Jupyter Notebooks to terabytes of data

Photo by Vasily Koloda on Unsplash
Photo by Vasily Koloda on Unsplash

I remember my first data science project, analyzing a 6.5MB file on ten thousand movies. It was magical to quickly find patterns from data from a few keystrokes and soon I was on the hunt to find other datasets to analyze. Going through Kaggle and eventually generating unique data through web crawling, I found that most downloadable datasets were small (<1GB) and that collecting and working with large datasets required a different type of skill: data engineering.

Most aspiring data scientists only know how to analyze clean, comma separated files that fit on their RAM (usually <8GB). In practice, data science is much more unpredictable when you deal with terabytes of data in different formats from different data streams. So how does an aspiring data scientist gain exposure to big data?

To take you to big data, we will explore:

  1. Diving into Big Data with Google Cloud Public Datasets
  2. What is Bigquery?
  3. How to Connect Your Jupyter Notebook to BigQuery

Diving into Big Data with Google Cloud Public Datasets

It turns out that Google has a repository of most publicly available data such as [air quality](https://console.cloud.google.com/marketplace/product/openaq/real-time-air-quality?filter=solution-type:dataset&q=air quality), US census, Reddit, flight. If there’s a public dataset, Google probably has it hosted, available for you to query. What’s more is that new datasets (i.e. Covid-19) are constantly being added and updated.

With the data hosted on BigQuery, you can easily query the data to see if it’s interesting. For example, I can query and answer the question: "How many political ad campaigns on Google, targeted women in Florida?"

Explore other datasets available on Google Public Datasets!

What is BigQuery?

BigQuery is Google’s serverless data warehouse. Getting at the data is easy since the first 1 terabyte of query per month is free. As one Spotify engineer puts it:

What I love about BigQuery is how fast it is. Combining its speed with a repository of more than 100 public datasets, we can quickly analyze the data to decide if it’s interesting for analysis.

Since the datasets are hosted in the same data warehouse, we can join two public datasets to do a unique analysis. For example, we can predict demand for NYC Taxis based on the Weather Forecast. Google even wrote a nice blog post about it: How to forecast demand with Google BigQuery, public datasets and TensorFlow.

How to Connect Your Jupyter Notebook to BigQuery

Now that we’ve found data that we want to analyze on BigQuery, let’s bring it to our local Jupyter Notebook environment. We can take advantage of BigQuery’s API to load the data into a dataframe.

The first step is to install the BigQuery client library, which can be done using pip or conda.

pip install --upgrade google-cloud-bigquery

or

conda install -c conda-forge google-cloud-bigquery

Next we’ll need to create a service account key for our jupyter notebook to use and authenticate to Google Cloud.

  1. In the Cloud Console, go to the Create service account key page
  2. From the Service account list, select New service account.
  3. In the Service account name field, enter a name (i.e. pythonBigQuery)
  4. From the Role list, select Project > Owner. You can view and change this field later by using the Cloud Console. If you are developing a production app, please specify more granular permissions.
  5. Click Create. A JSON file that contains your key downloads to your computer

With the JSON key downloaded, we can authenticate our jupyter notebook to connect to BigQuery. We import the os library to set our Google credentials to the downloaded key. As shown in the notebook below, I can now query and retrieve a BigQuery table as a dataframe with column names and data types inherited from the table!

df.head() for bigquery-public-data.baseball.schedules
df.head() for bigquery-public-data.baseball.schedules

Google’s official documentation: BigQuery API Client Libraries

Conclusion

Google Public Datasets is a good first stop to exploring big data. Taking advantage of cloud computing and warehousing, we can explore terabytes of data within seconds. With natively supported APIs, we can easily integrate data on BigQuery to our jupyter notebooks!


Related Articles