Choosing between TensorFlow/Keras, BigQuery ML, and AutoML Natural Language for text classification

Comparing text classification done three ways on Google Cloud Platform

Lak Lakshmanan
Towards Data Science

--

Google Cloud Platform offers you three¹ ways to carry out machine learning:

  • Keras with a TensorFlow backend to build custom, deep learning models that are trained on Cloud ML Engine
  • BigQuery ML to build custom ML models on structured data using just SQL
  • Auto ML to train state-of-the-art deep learning models on your data without writing any code

Choose between them based on your skill set, how important additional accuracy is, and how much time/effort you are willing to devote to the problem. Use BigQuery ML for quick problem formulation, experimentation, and easy, low-cost machine learning. Once you identify a viable ML problem using BQML, use Auto ML for code-free, state-of-the-art models. Hand-roll your own custom models only for problems where you have lots of data and enough time/effort to devote.

Choosing the ML method that is right for you depends on how much time and effort you are willing to put in, what kind of accuracy you need, and what your skillset is.

In this article, I will compare the three approaches on a text classification problem so that you can see why I’m recommending what I am recommending.

1. CNN + Embedding + Dropout in Keras

I explain the problem and the deep learning solution in detail elsewhere, so this section will be very brief.

The task is that given the title of an article, I want to be able to identify where it was published. The training dataset comes from articles posted on Hacker News (there’s a public dataset of these in BigQuery). For example, here are some of the titles whose source is GitHub:

Training dataset

The model code to create a Keras model that uses a word embedding layer, convolutional layers, and dropout:

model = models.Sequential()
num_features = min(len(word_index) + 1, TOP_K)
model.add(Embedding(input_dim=num_features,
output_dim=embedding_dim,
input_length=MAX_SEQUENCE_LENGTH))
model.add(Dropout(rate=dropout_rate))
model.add(Conv1D(filters=filters,
kernel_size=kernel_size,
activation='relu',
bias_initializer='random_uniform',
padding='same'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters=filters * 2,
kernel_size=kernel_size,
activation='relu',
bias_initializer='random_uniform',
padding='same'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(rate=dropout_rate))
model.add(Dense(len(CLASSES), activation='softmax'))
# Compile model with learning parameters.
optimizer = tf.keras.optimizers.Adam(lr=learning_rate)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['acc'])
estimator = tf.keras.estimator.model_to_estimator(keras_model=model, model_dir=model_dir, config=config)

This is then trained on Cloud ML Engine as shown in this Jupyter notebook:

gcloud ml-engine jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=${PWD}/txtclsmodel/trainer \
--job-dir=$OUTDIR \
--scale-tier=BASIC_GPU \
--runtime-version=$TFVERSION \
-- \
--output_dir=$OUTDIR \
--train_data_path=gs://${BUCKET}/txtcls/train.tsv \
--eval_data_path=gs://${BUCKET}/txtcls/eval.tsv \
--num_epochs=5

It took me a couple of days to develop the original TensorFlow model, my colleague vijaykr a day to modify it to use Keras, and maybe a day to train it and troubleshoot it.

We got about 80% accuracy. To do better, we’d probably need a lot more data (92k examples is insufficient to gain the benefits of using a custom deep learning model) and perhaps incorporate more preprocessing (such as removing stop words, stemming words, using a reusable embedding, etc.).

2. BigQuery ML for text classification

When using BigQuery ML, convolutional neural networks, embeddings, etc. are (not yet anyway) an option, so I dropped down to using a linear model on a bag-of-words. The point of BigQuery ML is to provide a quick, convenient way to build ML models on structured and semi-structured data.

Splitting the titles word-by-word and training a logistic regression model (i.e., a linear classifier) on the first 5 words of the title (using more words doesn’t help all that much):

#standardsql
CREATE OR REPLACE MODEL advdata.txtclass
OPTIONS(model_type='logistic_reg', input_label_cols=['source'])
AS
WITH extracted AS (
SELECT source, REGEXP_REPLACE(LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')), " ", " ") AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
)
, ds AS (
SELECT ARRAY_CONCAT(SPLIT(title, " "), ['NULL', 'NULL', 'NULL', 'NULL', 'NULL']) AS words, source FROM extracted
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
)
SELECT
source,
words[OFFSET(0)] AS word1,
words[OFFSET(1)] AS word2,
words[OFFSET(2)] AS word3,
words[OFFSET(3)] AS word4,
words[OFFSET(4)] AS word5
FROM ds

This was fast. The SQL query above is the full enchilada. There is nothing more to it. The model training itself took only a few minutes. I got 78% accuracy which compares quite favorably to the 80% I got with the custom Keras CNN model.

Once trained, batch predictions using BigQuery are easy:

SELECT * FROM ML.PREDICT(MODEL advdata.txtclass,(
SELECT 'government' AS word1, 'shutdown' AS word2, 'leaves' AS word3, 'workers' AS word4, 'reeling' AS word5)
)
BigQuery ML identifies the New York Times as the most likely source of an article that starts with the words “Government shutdown leaves workers reeling”.

Online predictions using BigQuery can be accomplished by exporting the weights into a web application.

3. AutoML

The third option I tried is the code-free option that, nevertheless, uses state-of-the-art models and techniques underneath. Because this is a text classification problem, the Auto ML approach to use is Auto ML Natural Language.

3a. Launch AutoML Natural Language

The first step is to launch Auto ML Natural Language from the GCP web console:

Launch AutoML Natural Language from the GCP web console

Follow the prompts and a bucket will be created to hold the dataset that you will use to train the model.

3b. Create CSV file and have it available on Google Cloud Storage

Where BigQuery ML requires you to know SQL, AutoML just requires that you create a dataset in one of the formats the tool understands. The tool understands CSV files arranged as follows:

text, label

The text itself can either be a URL to a file containing the actual text (this is useful if you have multi-line text, such as reviews or entire documents) or it can be the plain text item itself. If you are providing the text item string directly, you need to put it in quotes.

So, our first step is export a CSV file from BigQuery in the right format. This was my query²:

WITH extracted AS (
SELECT STRING_AGG(source,',') as source, title FROM
(SELECT DISTINCT source, TRIM(LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' '))) AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
)
GROUP BY title
)
SELECT title, source FROM extracted
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')

Which yields the following dataset:

Dataset for Auto ML

Note that I have stripped out punctuation and special characters. Whitespace has been trimmed, and SELECT distinct is used to used to discard duplicates and articles that appear in multiple classes (AutoML will warn you about duplicates, and can deal with multi-class labels, but removing them is cleaner).

I saved the result of the query as a table using the BigQuery UI:

Save the query results as a table

and then exported the table to a CSV file:

Export the table data to the Auto ML bucket

3c. Create Auto ML dataset

Next step is to use the Auto ML UI to create a dataset from the CSV file on Cloud Storage:

Create a dataset from the CSV file on Cloud Storage

The dataset takes about 20 minutes to ingest. At the end, we get a screen full of text items:

The dataset after loading

The current Auto ML limit is 100k rows, so our 92k dataset is definitely pushing some boundaries. A smaller dataset will get ingested faster.

Why do we have a label called “source” with only example? The CSV file had a header line (source, title) and that too has been ingested! Fortunately, AutoML allows us to edit the text items in the GUI itself. So, I deleted the extra label and its corresponding text.

3d. Train

Training is as easy as clicking on a button.

Auto ML then proceeds to try various embeddings, and various architectures and does hyperparameter tuning to come up with a good solution to the problem.

It takes 5 hours.

3e. Evaluation

Once the model is trained, we get a bunch of evaluation statistics: precision, recall, AUC curve, etc. But we also get the actual confusion matrix from which we can compute anything else we want:

The overall accuracy is about 86% — higher even than our custom Keras CNN model. Why? Because Auto ML is able to take advantage of transfer learning from models built on Google datasets on language use, i.e. includes data that we did not have available to our Keras model. Also, because of the availability of all that data to transfer learn from, the model architecture can be more complex (read: more deep).

3f. Prediction

The trained AutoML model is already deployed and available for prediction. We can send it a request and get back the predicted source of the article:

Predictions from Auto ML

Notice that the model is much more confident than the BQML one (although both gave the same correct answer), a confidence driven by the fact that this Auto ML model was trained on more data and is built specifically for text classification problems.

I tried another article title from today’s headlines and the model nailed it as being from TechCrunch:

Correctly identifies the title as being from a TechCrunch article.

Summary

While this article is primarily about text classification, the general conclusions and advice carry over to most ML problems:

  • Use BigQuery ML for easy, low-cost machine learning and quick experimentation to see if ML is viable on your data. Sometimes, the accuracy you get with BQML is sufficient, and you will simply stop here.
  • Once you identify a viable ML problem using BQML, use Auto ML for code-free, state-of-the-art models. Text classification, for example, is a very specialized field with high-dimensional inputs. So, you can do better with a customized solution (i.e., Auto ML Natural Language) than with a structured data approach that just uses bag-of-words.
  • Hand-roll your own custom models only for problems where you have lots of data and enough time/effort to devote. Use AutoML as a benchmark. If, you can not beat Auto ML after some reasonable effort, stop wasting time. Just go with Auto ML.

¹ There are a few other ways to do machine learning on GCP. You can do xgboost or scikit-learn in ML Engine. The Deep Learning VM supports PyTorch. Spark ML works well on Cloud Dataproc. And of course, you can use Google Compute Engine or Google Kubernetes Engine and install any ML framework you want. But in this article, I’ll focus on these three.

²Thanks to Greg Mikels for improving my original AutoML query to remove duplicates and cross-posted articles.

--

--