The world’s leading publication for data science, AI, and ML professionals.

Google Trends as a Machine Learning Features with BigQuery

Written with Co-Author: Daniel Shin, MSCS at Stanford

Photo by Firmbee.com on Unsplash
Photo by Firmbee.com on Unsplash

Sometimes as engineers and scientists, we think of data only as bytes on RAM, matrices in GPUs, and numeric features that go into our predictive black-box. We forget they represent changes in some real-world patterns.

For example, when real world events and trends arise, we tend to defer to Google first to acquire related information (i.e where to go for a hike, what does term X mean) – which makes Google Search Trends a very good source of data for interpreting and understanding what is going on live around us.

This is why we decided to study a complex interplay between Google Search trends using it to predict other temporal data, and see if perhaps it could be used as features for a temporal machine learning model, and any insights we can draw from it.

⚠️ A word of caution

In this project, we looked at how Google Trends data could be used as features for times series models or regression models. We chose crime rate as our point of interest (response variable) because it is one of the large temporal datasets on Google Cloud Platform with millions of rows, which fits with our purpose using BigQuery.

However, we will note that whenever we are working with any dataset dealing sensitive topics, it’s important to be impartial and not many assumptions with pre-emptive bias regarding protected traits (i.e age, sex, and race). We included a section at the end that we encourage you to not skip: A discussion on Fairness in AI ⚖️️.

Additionally, for this project, we primarily focused on Narcotics related crimes only (i.e drug trafficking) for our ML model. Even so, we think it’s best not to deploy similar models in the real world without guardrails and further research on its social ramifications. This study is purely for exploratory and education purposes, and should be treated as such.

⚙️ Setting-Up: Using BigQuery with Google Colab

BigQuery is a serverless data warehouse that has many public datasets available. It can be a cool place to try out Machine Learning with massive datasets, many of which are regularly updated under bigquery-public-data. In this section, we’ll go through how to set-up BigQuery.

We first authenticate ourselves into Google Colab from the same account which we have credits on BigQuery for. Everyone gets around 1 TB worth of Queries, which is enough for our purposes (i.e we require about ~200 MB per optimized query). Students may also get some educational credits.

# Authenticate yourself into Google Colab
from google.colab import auth
auth.authenticate_user()
project_id = "YOUR_PROJECT_ID_IN_BIGQUERY"
# create a BigQuery client with the same account as Colab
from google.cloud import bigquery
client = bigquery.Client(project=project_id)

Then, we can convert an IPython notebook cell into a SQL cell like the following, which will store the output into a pd.DataFrame in variable_name.

%%bigquery variable_name --project $project_id

SELECT *
FROM `bigquery-public-data.chicago_crime.crime`
LIMIT 5

📚 Co-occurrence: Google Trends feature suggestions using Wordnet

Our initial thought is to use the google_trends.top_rising_terms dataset and do a LEFT JOIN with the columns with date shifts. However, we ran into the problem of the dataset being dominated by current events and names that are unrelated to our response variable.

SQL statement for selecting Chicago search trends, updated on Dec 3rd, 2022, by Author
SQL statement for selecting Chicago search trends, updated on Dec 3rd, 2022, by Author

Then, we tried fetching terms using the unofficial PyTrends API for Google Trends. However, we were quickly hit with velocity limits, so we had to be more selective with our data. Our only option was to download our data by querying each word on the Google Trend site and combing word datasets with a script manually. So, we focused on obtaining the correct word features.

One of the intuitions we had was that the terms describing substances would have regional variations and different names. For this problem, we used wordnet to generate a set of words that have "High Co-occurrence" with some terms for common street drugs.

But Why Use NLTK?

The inspiration for our usage of WordNet comes from "co-occurrence" based recommendation models (i.e possibly for YouTube videos, Amazon catalogs, etc) which often do batched compute for co-occurrence to rank related items. This is similar to what masked language models (i.e BERT) seeks to represent.

import nltk=
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet

all_words = set()
def find_word_features(word):
   words = set()
   for syn in wordnet.synsets(word):
       for i in syn.lemmas():
           words.add(i.name())
   print(words)
   all_words.update(words)

We then downloaded and combined them with some simple scripting. After that, we fitted a number of Linear Regression models, extracted the R-squared value of the model, then ranked them based on correlation with crime rate after one month (see later on why).

Top words with high correlation with crime-rate after 1 month, by Author
Top words with high correlation with crime-rate after 1 month, by Author

We were very surprised to find some of these words to have a really high R² value, especially the word "go". The word was generated from co-occurrence with "ecstasy," but at the end, we assumed this simply correlated with phrases of "where to GO to get …" on Google search.

🥞 Joining Training Datasets

Joining the dataset requires some degree of data manipulation, which we won’t get into the technical details. But we note here that we did both a 1-month and 2-month shift on the aggregated crime count, as we believe for there to be a temporal correlation there has to be a time-lag between searching and our response variable.

Time lag can be a hyperparameter that we tune later.

%%bigquery narcotics_year_month_two_months_lag --project $project_id

WITH shifted_date_tbl AS (
 SELECT
 date,
 CAST( DATE_SUB(CAST (date AS datetime), INTERVAL 2 MONTH) AS timestamp) AS shifted_date,
 *
 FROM `bigquery-public-data.chicago_crime.crime`
),
date_count AS (
SELECT
CONCAT(CONCAT(EXTRACT(YEAR FROM shifted_date), '-'), EXTRACT(MONTH FROM shifted_date)) AS year_month, COUNT(*) as crime_count
FROM shifted_date_tbl
WHERE primary_type = 'NARCOTICS'
GROUP BY year_month
ORDER BY year_month DESC
)

SELECT *
FROM `your-project-id.crime_search_trends.chicago_narcotics`
AS chicago_narcotics,
date_count
WHERE year_month = Month

We won’t want to do any feature engineering yet at this point because we would want to visualize the correlation before deciding on how to optimize each word distribution.

For example, perhaps we want to take a log or shift of the data or cast the feature distribution into quantiles. But we would first have to visualize the distribution.

📈 Correlational for Feature Selection

We analyzed somewhere between 30–50 words that are both generated by WordNet, as well as our own intuition. Many of them follow a positive trend such as "jet", but we were surprised to find ones with negative correlation such as "pot."

"Jet" trend score Vs. Narcotics crime count, by Author
"Jet" trend score Vs. Narcotics crime count, by Author
"Pot" trend score Vs. Narcotics crime count, by Author
"Pot" trend score Vs. Narcotics crime count, by Author

We were very curious as to why this was the case. Our original thought was this has to do with the decriminalization or marijuana in our region of interest (Chicago, which happened in 2020), but it turns out this was not the case even if we only correlate data from pre-2015.

"pot" trend Vs. Narcotics crime count before 2015, by Author
"pot" trend Vs. Narcotics crime count before 2015, by Author

Only after plotting the trends for "positively correlated words" side-by-side with "negatively correlated words" such as "pot" do we realize the trend. See the graph below.

Downward trend for narcotics crime and other terms with uptick of "pot" search trend, by Author
Downward trend for narcotics crime and other terms with uptick of "pot" search trend, by Author

A plausible explanation for this is simply that over-time, crimes for narcotics related crimes and arrests has been steadily decreasing, yet search trends and popular culture actually has a negative correlation in terms of attitudes towards marijuana.

This means that although yes, it would be a great feature for predicting crime due to negative correlation, but not for the reasons we initially thought. This goes to show our initial interpretations and schema towards features may be skewed and perhaps even wrong.

🌐 Geographical Analysis

We also did some geographical analysis using heat maps for search term location and locations of crime which was committed.

%%bigquery crime_time_data --project $project_id

SELECT longitude, latitude
FROM `bigquery-public-data.chicago_crime.crime`
WHERE primary_type = "NARCOTICS" AND
longitude IS NOT NULL AND
latitude IS NOT NULL

Unfortunately, there was no way to export this data from Google Trends so we could not extrapolate and map the data for aggregation. Otherwise, we could have done a left join on the city names after mapping locations of each crime to the County or City they are from.

Chicago narcotics crime density map Vs. Search term density map, by Author
Chicago narcotics crime density map Vs. Search term density map, by Author
San Francisco narcotics crime density map, by Author
San Francisco narcotics crime density map, by Author

🚀 Training a predictive model

We won’t go too deeply into details on how to optimize and select the best features for a model, since there are many guides out there on Medium that does that already.

Instead, in this section, we’ll quickly go over

Temporal/Times Series models are similar to what we have here, where we use historical data to predict future trends. Examples of this are like the Facebook Prophet model, or the ARIMA/SARIMA model. We think this is a good fit in general for problems dealing with temporal data because it can take account of seasonality and patterns.

You can train them directly in Colab after doing some SQL transforms, then making a single call on Google Cloud Platform using CREATE MODEL statements.

Training & Testing set split for Times Series models, by Author
Training & Testing set split for Times Series models, by Author

Regression Models are similar to what we did with time-shifting the dataset. We would train our model to predict what is going to happen T time steps in the future, given the current states. This type of model tends to overfit and ignore trends, but if we have a massive amount of data available over short periods of time, we could always just retrain the model with data from the last few weeks, days, or even hours.

We can train a Logistic Regression model using this command in bigQuery

%%bigquery --project $project_id
CREATE OR REPLACE MODEL `crime_search_trends.narcotics_log_regression_3_terms_trend_quantile`
OPTIONS(MODEL_TYPE='LOGISTIC_REG')
AS
SELECT
 block AS feature1,
 # iucr AS feature1,
 arrest AS feature2,
 domestic AS feature3,
 # beat AS feature3,
 district AS feature4,
 ward AS feature5,
 community_area AS feature6,
 x_coordinate AS feature7,
 y_coordinate AS feature8,
 #drug terms
 NTILE(5) OVER(ORDER BY smoke) AS feature9,
 NTILE(5) OVER(ORDER BY weed) AS feature10,
 NTILE(5) OVER(ORDER BY go) AS feature11,
 smoke AS feature12,
 weed AS feature13,
 go AS feature14,
 CAST(primary_type = 'NARCOTICS' AS INT) AS label
FROM (
 SELECT *
   FROM `your_project_id.crime_search_trends.chicago_cooccur_processed`
   AS chicago_narcotics, (SELECT
     FORMAT_DATE('%Y-%m', date) AS year_month,
     FORMAT_DATE('%Y-%m-%d', date) AS year_month_date,
     *
   FROM `bigquery-public-data.chicago_crime.crime`
   WHERE date IS NOT NULL) AS crime_with_date
   WHERE year_month = Month
)
WHERE
 date BETWEEN '2004-01-01 00:00:01' AND '2018-01-01 23:59:59'

One of our intuitions during feature exploration was that, perhaps it matters much less how "trending" a word is, we only care that the feature is rising or that it is "above average" in its baseline search rate.

So we wanted to use the quantiling as a feature engineering function – it can be done in bigquery with the NTILE(N) OVER (ORDER BY feature_name) AS feature_name in the SELECTstatement.

Popularity trend varies across different words
Popularity trend varies across different words

You can also get the training info that typically gets printed out during training in a notebook by using the ML.TRAINING_INFO label. It is useful for helping with hyperparameter tuning.

%%bigquery --project $project_id

SELECT
*
FROM
ML.TRAINING_INFO(MODEL `crime_search_trends.narcotics_log_regression_3_terms_trend_quantile`)

Based on our investigation, quantiling decreases F1 Score, but increases accuracy, precision, or recall. We think this could be because that many of our terms are highly correlated in their semantic meaning, thus overall reduces the "overall performance" of the machine learning model since the even small change in the dataset distribution could greatly impact the performance.

A common way to do fix this issue is running Ordinary Least Sqaures (OLS) models on our dataset, iteratively identify the redundant features that have high multicollinearity (i.e, feature(s) that the OLS model assigned very little weight to in that iteration), until there is a subset of features we think are good. Feed that to your other model of choice with higher complexity (i.e Logistic Regression, MLP, other neural models).

⚖️ A Discussion on Fairness in AI

Let’s imagine a more cynical example of what we did in this study.

Suppose we are building a time-delayed prediction algorithm for crime prediction, and suppose we are an (unintentionally) under-informed data science team trying to build a model with a short deadline. We build it fast, it has great metrics, so we throw it over the wall to the DevOps team.

Since we don’t know what words are useful, suppose we built an unsupervised machine learning model that finds the "highly predictive" words for Google trends, and uses it as features. Overtime, the model picks up on words that are disproportionately used by some marginalized groups, and now the model is biased in the worst way possible.

In the case of crime prediction, this model would be highly problematic and have detrimental effects on real people if used for the wrong purposes.

We have also shown in this study, our opinions on what may be "good features," and even our intuition behind why a feature may be good could be flawed or simply confirmation biases, as with the case of "pot" and "weed" keyword’s negative correlation.

This is why it is important to have clear communications about "ethical AI" between data science teams. Other may not be well informed about how AI works behind the scenes, and it’s therefore our responsibility as engineers and data scientists to ensure that our model is not biased and harmful.


Related Articles