How to do text similarity search and document clustering in BigQuery

Use Document embeddings in BigQuery for document similarity and clustering tasks

Published in

Towards Data Science

6 min readAug 22, 2020

BigQuery offers the ability to load a TensorFlow SavedModel and carry out predictions. This capability is a great way to add text-based similarity and clustering on top of your data warehouse.

Follow along by copy-pasting queries from my notebook in GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.

Text embeddings are useful for document similarity and clustering tasks. Image by kerttu from Pixabay

Storm reports data

As an example, I’ll use a dataset consisting of wind reports phoned into National Weather Service offices by “storm spotters”. This is a public dataset in BigQuery and it can be queried as follows:

SELECT 
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  ST_GeogPoint(longitude, latitude) AS location,
  comments
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
LIMIT 10

The result looks like this:

Let’s say that we want to build a SQL query to search for comments that look like “power line down on a home”.

Steps:

Load a machine learning model that creates an embedding (essentially a compact numerical representation) of some text.
Use the model to generate the embedding of our search term.
Use the model to generate the embedding of every comment in the wind reports table.
Look for rows where the two embeddings are close to each other.

Loading a text embedding model into BigQuery

TensorFlow Hub has a number of text embedding models. For best results, you should use a model that has been trained on data that is similar to your dataset and which has a sufficient number of dimensions so as to capture the nuances of your text.

For this demonstration, I’ll use the Swivel embedding which was trained on Google News and has 20 dimensions (i.e., it is pretty coarse). This is sufficient for what we need to do.

The Swivel embedding layer is already available in TensorFlow SavedModel format, so we simply need to download it, extract it from the tarred, gzipped file, and upload it to Google Cloud Storage:

FILE=swivel.tar.gz
wget --quiet -O tmp/swivel.tar.gz  https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1?tf-hub-format=compressed
cd tmp
tar xvfz swivel.tar.gz
cd ..
mv tmp swivel
gsutil -m cp -R swivel gs://${BUCKET}/swivel

Once the model files on GCS, we can load it into BigQuery as an ML model:

CREATE OR REPLACE MODEL advdata.swivel_text_embed
OPTIONS(model_type='tensorflow', model_path='gs://BUCKET/swivel/*')

Try out embedding model in BigQuery

To try out the model in BigQuery, we need to know its input and output schema. These would be the names of the Keras layers when it was exported. We can get them by going to the BigQuery console and viewing the “Schema” tab of the model:

Let’s try this model out by getting the embedding for a famous August speech, calling the input text as sentences and knowing that we will get an output column named output_0:

SELECT output_0 FROM
ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT "Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially." AS sentences))

The result has 20 numbers as expected, the first few of which are shown below:

Document similarity search

Define a function to compute the Euclidean squared distance between a pair of embeddings:

CREATE TEMPORARY FUNCTION td(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>, idx INT64) AS (
   (a[OFFSET(idx)] - b[OFFSET(idx)]) * (a[OFFSET(idx)] - b[OFFSET(idx)])
);CREATE TEMPORARY FUNCTION term_distance(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>) AS ((
   SELECT SQRT(SUM( td(a, b, idx))) FROM UNNEST(GENERATE_ARRAY(0, 19)) idx
));

Then, compute the embedding for our search term:

WITH search_term AS (
  SELECT output_0 AS term_embedding FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(SELECT "power line down on a home" AS sentences))
)

and compute the distance between each comment’s embedding and the term_embedding of the search term (above):

SELECT
  term_distance(term_embedding, output_0) AS termdist,
  comments
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT comments, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
)), search_term
ORDER By termdist ASC
LIMIT 10

The result is:

Remember that we searched for “power line down on home”. Note that the top two results are “power line down on house” — the text embedding has been helpful in recognizing that home and house are similar in this context. The next set of top matches are all about power lines, the most unique pair of words in our search term.

Document Clustering

Document clustering involves using the embeddings as an input to a clustering algorithm such as K-Means. We can do this in BigQuery itself, and to make things a bit more interesting, we’ll use the location and day-of-year as additional inputs to the clustering algorithm.

CREATE OR REPLACE MODEL advdata.storm_reports_clustering
OPTIONS(model_type='kmeans', NUM_CLUSTERS=10) ASSELECT
  arr_to_input_20(output_0) AS comments_embed,
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
))

The embedding (output_0) is an array, but BigQuery ML currently wants named inputs. The work around is to convert the array to a struct:

CREATE TEMPORARY FUNCTION arr_to_input_20(arr ARRAY<FLOAT64>)
RETURNS 
STRUCT<p1 FLOAT64, p2 FLOAT64, p3 FLOAT64, p4 FLOAT64,
       p5 FLOAT64, p6 FLOAT64, p7 FLOAT64, p8 FLOAT64, 
       p9 FLOAT64, p10 FLOAT64, p11 FLOAT64, p12 FLOAT64, 
       p13 FLOAT64, p14 FLOAT64, p15 FLOAT64, p16 FLOAT64,
       p17 FLOAT64, p18 FLOAT64, p19 FLOAT64, p20 FLOAT64>AS (
STRUCT(
    arr[OFFSET(0)]
    , arr[OFFSET(1)]
    , arr[OFFSET(2)]
    , arr[OFFSET(3)]
    , arr[OFFSET(4)]
    , arr[OFFSET(5)]
    , arr[OFFSET(6)]
    , arr[OFFSET(7)]
    , arr[OFFSET(8)]
    , arr[OFFSET(9)]
    , arr[OFFSET(10)]
    , arr[OFFSET(11)]
    , arr[OFFSET(12)]
    , arr[OFFSET(13)]
    , arr[OFFSET(14)]
    , arr[OFFSET(15)]
    , arr[OFFSET(16)]
    , arr[OFFSET(17)]
    , arr[OFFSET(18)]
    , arr[OFFSET(19)]    
));

The resulting ten clusters can visualized in the BigQuery console:

What do the comments in cluster #1 look like? The query is:

SELECT sentences 
FROM ML.PREDICT(MODEL `ai-analytics-solutions.advdata.storm_reports_clustering`, 
(
SELECT
  sentences,
  arr_to_input_20(output_0) AS comments_embed,
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
))))
WHERE centroid_id = 1

The result shows that these are mostly short, uninformative comments:

How about cluster #3? Most of these reports seem to have something to do with verification by radar!!!

Enjoy!

Links

TensorFlow Hub has several text embedding models. You don’t have to use Swivel, although Swivel is a good general-purpose choice.

Full queries are in my notebook on GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.