Whisked Away by BigQuery ML

Predicting Whiskey Preferences in GCP

Phil Goerdt

Published in

Towards Data Science

10 min readJan 10, 2019

“I’m looking for…

…anything that is peaty, smoky like a campfire on the beach, medium to high alcohol, dried fruits, citrus, and a has subtle hint of terra firma.”

It’s easy for me to walk into my local liquor store and talk tasting notes (and joke) with the whiskey guy or gal on staff, and he or she will be able to know what I’m looking for and make a recommendation to me. But can a machine learning model do the same on a minimal data set and with limited inputs? That’s the question at hand.

(For those of you joining in now, thanks for coming along on this part of the story! I recommend you take a look at the other blogs in this series, but if you don’t, the summary is I have a bunch of whiskey data with my tasting preferences. Let’s see what we can learn from it.)

BigQuery ML in a nutshell

For this exercise I’ll be using Google Cloud Platform’s BigQueryML engine. If you didn’t guess, this is an ML tool that will allow you to create ML models and run them in BigQuery.

BQML is an incredible advantage for anyone who has tons of data within BigQuery (or anywhere else that can easily transfer data to BQ) because the data can stay put and doesn’t require data frame defining. Not only that, we can manage and run our models in BigQuery as well; this means no more ML specific compute virtual (or I suppose physical) machines to provision. And finally, we can create these models using specialized SQL commands available in BQ. This isn’t like other SQL plugins that require tons of user defined functions, classes and other fun things that keep us from developing real content; it works right out of the box. How cool is that?!

And you chose that why…?

Why should we care about these features? Shouldn’t I try using “a real ML framework” like TensorFlow? Well, yes… maybe. I could try to do this exercise in a “real framework” like TensorFlow. But, there are a few reasons I didn’t go down that path.

For one, I didn’t want to write another “how to do your first whatever with TensorFlow” blog. That’s been blogged to death already. Also, someone like Sara Robinson is going to do so much better of a job at producing cool, avant-garde TensorFlow content than me. BigQueryML here we come!

Another reason is that in previous blogs I have talked about my love of simplicity when designing systems that work together. I like things to have as few moving parts, to have as few failure points as possible, and to be as automated as possible from the outset. BigQueryML is simple, it’s cheap for me to experiment with, and it fits in my current tech stack definition. Using it was a no brainer for me.

Finally, my requirements didn’t dictate that I needed anything with high horsepower. My training set is a subset of just over 100 rows (all the whiskies I’ve tried in 2018), and I’m doing pretty simple regressions for prediction. And because of these reasons, I think the risk of overtraining by using a more robust and highly engineered solution was high. (Comment and discuss below, if you’d like. If there is a lot of response to this, maybe I’ll do some TensorFlow models and compare them to BQML in a future post.)

Regardless of my reasons for choosing BigQueryML, let’s get down to the business of using it.

Big queries, bigger ML?

As I said above, using BigQueryML is pretty dang easy, and mostly this is due to the feature set of the product. However, just because it’s easy to use doesn’t mean that it’s without nuance. Let’s create an example model based on my data, evaluate it, and have a little chat about it.

I’ll have what he’s having

In some earlier blogs in this series, I provided a quick run down of what the whiskey rating data looked like. Here’s a quick reminder:

Above are the raw, unweighted scores for two whiskies, along with a few dimensional attributes. Think that’s enough to predict against? I think the verdict is still out on this one. I think that we if think about this from how most drinkers of whiskey typically start the conversation of what they drink, it is usually by saying a style of whiskey, such as “I’m a bourbon guy” or “I only drink single malts”. Sometimes maturity comes into play, and oftentimes price dictates what we buy or can afford. These usually are good indicators by themselves. But will the model be able to learn beyond that? Do we have enough to detect nuance?

I guess we’ll have to see what the model thinks.

Modeling like a pro

Now that we’re all up to speed on the training data set, let’s take a look at how to actually model stuff in BigQueryML.

If you are familiar with SQL, you may have guessed creating a model is similar to any DDL statement. Using the create model statement, we can create some models that run and train right in BigQuery. The GCP documentation does an excellent job of reviewing all of the available parameters, so check there if you want to read up on things I’m not using in this example.

--hi! i'm a (very) simple model for predicting a total score!create or replace model `sample_schema.predict_total_test`
options
  ( model_type='linear_reg',      /*this is the model type*/
    l1_reg=0,                     /*how much L1 regularization should we use?*/
    max_iterations=5,             /*how many times should we train?*/
    input_label_cols=['label'],   /*this is the column to predict*/
    data_split_method='seq',   /*how do we want to split the data?*/
    data_split_eval_fraction=0.3, /*how much should we split?*/
    data_split_col='date_sampled' /*which column should we split on?*/
  ) as /*this is a query to collect data to use in the model.
  all of these columns must be present later when we run the model.*/select 
  f.rating_name    as full_name
, f.distillery     as distillery
, f.date_sampled   as date_sampled
, f.total_score    as label
from 
  `sample_schema.w_ratings_f` f
where 1 = 1
;

It seems simple enough, but let me break down what is going on in there.

First, we have the create or replace model statement followed by the model name. Next comes the options clause, which allows us to parameterize our newly minted model. As I said before, there are lots of fun levers to pull and knobs to twist in here, and you can read more about those in the documentation. This example has just a couple of options in it to whet your appetite. Let’s take a look at what they are.

model_type: This is the type of model I’m creating. I’m using the recommended linear regression model which is used for forecasting and predictions.
l1_reg: This controls the weighting of inputs in the model, which is important for reducing noise in features and removing irrelevant features in sparse, highly dimensional datasets. Since this model is so simple, having a 0 here is fine. Although, if we edit the model to have higher dimensionality, we may want to tweak this parameter. (When would this become relevant? Let’s pretend that I add tasting notes to the data set, and that only some of the sample set has these notes. Because these notes would be sparse (not applying to all the rows), and because we would have different notes for each (a varied set of text), we would want to force the weights of this feature to 0 to ignore it for a particular input if there are no notes.)
max_iterations: This is how many steps the model will take to train. I’m paranoid about overtraining since this is such a small dataset, which is why I have lowered the max to 5 from the default of 20.
input_label_cols: This identifies the column(s) that are going to be predicted. Simply put, features are inputs and labels are outputs. Here I am trying to predict the total score.
data_split_method: How should we split up the data for training? The options are random and seq.
data_split_eval_fraction: This controls how much of the data is split off for training. The default is .2, but since this data set is so small, I’d prefer a little more. Hence the choice for .3.
data_split_col: This identifies the column on which to split the data. It is not included in the feature set.

Moving further down the SQL statement, you’ll notice that there is a select statement that serves the data to the model. This is pretty self explanatory, and this simplicity of not needing to define data frames, features and all of the other fun stuff that comes along with ML models makes BigQueryML very appealing for someone in my situation.

Once the model is defined, I can run the above statement and what happens next is pretty cool. BigQuery ingests the model logic, queries the data and then runs the model in BigQuery. Seriously, how cool is that? The eagle eyed (with billing admin permissions) may notice that there are a few additional line items for GCE, and that is because you will be billed for running these ML statements. However, you aren’t responsible for spinning these resources up or down when running them.

Now that I’ve run the model, I should evaluate it. The reason for doing that is to better understand the accuracy of the model. While we can definitively know if the model is accurate (because, you know, math), in some cases model building and tuning is a bit more art than science. Let’s walk through a few examples of why this is the case.

Exam time

I can check out a few important indicators of my models accuracy by running the following command in BigQuery:

select *
from ml.evaluate(model `sample_schema.predict_total_test`);

This will give you an output that looks something like this:

The variety of stats above help quantify the amount of error that is in the model; that is, how much the model is “off” from the correct score. If we are trying to predict the correct score each time (and we are), we would be trying to have an error as close to zero as possible.

Who remembers “line of best fit” from stats class?

Let’s take a quick look a this figure from Google’s ML crash course. This figure shows us how the model is trying to fit to the data set, and clearly the one on the right is doing a better job of it. By looking at the statistical errors for the model it looks like we could do better… right?

In theory, yes, I definitely could. What I did next was run the same model multiple times with different feature sets to determine what had an impact on the model’s error rates. You can see below that adding and removing additional features changed the errors rates quite a bit, and none of the error rates were as low as the base model.

Does that mean that we should revert to the base? In my opinion, no, and here’s why. The data set is just too small. If we look at the base model, there are two features: The whiskey’s full name, and the distillery. If I were to only use those for predicting whether or not I liked a whiskey, I would probably have poor results over time. The reason the error rates are going up when I add in more features is that there is not enough data or rich enough features to determine a pattern.

This means that even while we can see from a mathematical standpoint one model is “the most correct”, it does not mean it will scale or be able to learn from new data it is exposed to. And what’s the point of that?

Back to the drawing board?

I’ve made several other models since that first model, and have experimented with them to produce much better results. Part of this process was getting more data, specifically introducing tasting notes to the model. Also, tuning the models parameters also helped. So what did the final “work in progress” turn out like?

I think the results above are OK. Obviously I still need to do some tweaking to the models, and I also think that a small data set is part of the problem. (If you want to check out all of the recommendations, feel free to head over to Damn Dram.)

But let’s not forget the star of the show here, which is BigQueryML. It was so incredibly easy to build, manage and test these models in BigQueryML, and it integrated seamlessly into what I was already building. And I can’t see anything wrong with that.

Phil Goerdt is an independent consultant specializing in cloud architecture and data engineering.