How to do text classification with CNNs, TensorFlow and word embedding

Lak Lakshmanan
Towards Data Science
9 min readJul 6, 2017

--

Suppose I gave you the title of an article “Amazing Flat version of Twitter Bootstrap” and asked you which publication that article appeared in: the New York Times, TechCrunch, or GitHub. What would be your guess? How about an article titled “Supreme Court to Hear Major Case on Partisan Districts”?

Did you guess GitHub and New York Times? Why? Words like Twitter and Major are likely to occur in any of the publications, but word sequences like Twitter Bootstrap and Supreme Court are more likely in GitHub and the New York Times respectively. Can we train a neural network to learn this?

Note: Estimators have now moved into core Tensorflow. Updated code that uses tf.estimator instead of tf.contrib.learn.estimator is now on GitHub — use the updated code as a starting point.

Creating dataset

Machine learning means to learn from examples. To learn which publication is the likely source of an article given its title, we need lots of examples of article titles along with their source. Although it suffers from severe selection bias (since only articles of interest to the nerdy membership of HN are included), the BigQuery public dataset of Hacker News articles is a reasonable source of this information.

query="""
SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""
traindf = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) > 0").execute().result().to_dataframe()
evaldf = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) = 0").execute().result().to_dataframe()

Essentially, I pull the URL and the title from the Hacker News stories dataset in BigQuery and separate it into a training and evaluation dataset (See Datalab notebook for complete code). The possible labels are github, nytimes, or techcrunch. Here’s what the resulting dataset looks like:

Training dataset

I wrote the two Pandas dataframes out to CSV files (a total of 72,000 training examples approximately equally distributed between nytimes, github, and techcrunch).

Creating a vocabulary

My training dataset consists of the label (“source”) and a single input column (“title”). However, the title is not numeric and neural networks need numeric inputs. So, we need to convert the text input column to be numeric. How?

The simplest approach would be to one-hot encode the titles. Assuming that there are 72,000 unique titles in the dataset, we will end up with 72,000 columns. If we then train a neural network on this, the neural network will essentially have to memorize the titles — no further generalization is possible.

In order for the network to generalize, we need to convert the titles into numbers in such a way that similar titles end up with similar numbers. One way is to find the individual words in the title and map the words to unique numbers. Then, titles with words in common will have similar numbers for that part of the sequence. The set of unique words in the training dataset is called the vocabulary.

Assume that we have four titles:

lines = ['Some title', 
'A longer title',
'An even longer title',
'This is longer than doc length']

Because the titles are all of varying length, I will pad out short titles with a dummy word and truncate very long titles. This way, I will get to deal with titles that all have the same length.

I can create the vocabulary using the following code (this is not ideal, since the vocabulary processor stores everything in memory; for larger datasets and more sophisticated preprocessing such as incorporating stop words and case-insensitivity, tf.transform is a better solution — that’s a topic for another blog post):

import tensorflow as tf
from tensorflow.contrib import lookup
from tensorflow.python.platform import gfile

MAX_DOCUMENT_LENGTH = 5
PADWORD = 'ZYXW'

# create vocabulary
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
vocab_processor.fit(lines)
with gfile.Open('vocab.tsv', 'wb') as f:
f.write("{}\n".format(PADWORD))
for word, index in vocab_processor.vocabulary_._mapping.iteritems():
f.write("{}\n".format(word))
N_WORDS = len(vocab_processor.vocabulary_)

In the code above, I will pad short titles with a PADWORD which I expect will never occur in actual text. The titles will be padded or truncated to a length of 5 words. I pass in the training dataset (“lines” in the above sample) and then write out the resulting vocabulary. The vocabulary turns out to be:

ZYXW
A
even
longer
title
This
doc
is
Some
An
length
than
<UNK>

Note that I added the padword and that the vocabulary processor found all the unique words in the set of lines. Finally, words that are encountered during evaluation/prediction that were not in the training dataset will be replaced by <UNK>, so that is also part of the vocabulary.

Given the vocabulary above, we can convert any title to a set of numbers:

table = lookup.index_table_from_file(
vocabulary_file='vocab.tsv', num_oov_buckets=1, vocab_size=None, default_value=-1)
numbers = table.lookup(tf.constant('Some title'.split()))
with tf.Session() as sess:
tf.tables_initializer().run()
print "{} --> {}".format(lines[0], numbers.eval())

The code above will look up the words ‘Some’ and ‘title’ and return the indexes [8, 4] based on the vocabulary. Of course, in the actual training/prediction graph, we will need to make sure to pad/truncate as well. Let’s see how to do that next.

Word processing

First, we start with the lines (each line is a title) and split the titles into words:

# string operations
titles = tf.constant(lines)
words = tf.string_split(titles)

This results in:

titles= ['Some title' 'A longer title' 'An even longer title'
'This is longer than doc length']
words= SparseTensorValue(indices=array([[0, 0],
[0, 1],
[1, 0],
[1, 1],
[1, 2],
[2, 0],
[2, 1],
[2, 2],
[2, 3],
[3, 0],
[3, 1],
[3, 2],
[3, 3],
[3, 4],
[3, 5]]), values=array(['Some', 'title', 'A', 'longer', 'title', 'An', 'even', 'longer',
'title', 'This', 'is', 'longer', 'than', 'doc', 'length'], dtype=object), dense_shape=array([4, 6]))

TensorFlow’s string_split() function ends up creating a SparseTensor. Talk about an overly helpful API. I don’t want that automatically created mapping though, so I will convert the sparse tensor to a dense one and then lookup the index from my own vocabulary:

# string operations
titles = tf.constant(lines)
words = tf.string_split(titles)
densewords = tf.sparse_tensor_to_dense(words, default_value=PADWORD)
numbers = table.lookup(densewords)

Now, the densewords and numbers are as expected (note the padding with the PADWORD:

dense= [['Some' 'title' 'ZYXW' 'ZYXW' 'ZYXW' 'ZYXW']
['A' 'longer' 'title' 'ZYXW' 'ZYXW' 'ZYXW']
['An' 'even' 'longer' 'title' 'ZYXW' 'ZYXW']
['This' 'is' 'longer' 'than' 'doc' 'length']]
numbers= [[ 8 4 0 0 0 0]
[ 1 3 4 0 0 0]
[ 9 2 3 4 0 0]
[ 5 7 3 11 6 10]]

Note also that the numbers matrix has the width of the longest title in the dataset. Because this width will change with each batch that is processed, it is not ideal. For consistency, let’s pad it out to MAX_DOCUMENT_LENGTH and then truncate it:

padding = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
padded = tf.pad(numbers, padding)
sliced = tf.slice(padded, [0,0], [-1, MAX_DOCUMENT_LENGTH])

This creates a batchsize x 5 matrix where shorter titles are padded with zero:

padding= [[0 0]
[0 5]]
padded= [[ 8 4 0 0 0 0 0 0 0 0 0]
[ 1 3 4 0 0 0 0 0 0 0 0]
[ 9 2 3 4 0 0 0 0 0 0 0]
[ 5 7 3 11 6 10 0 0 0 0 0]]
sliced= [[ 8 4 0 0 0]
[ 1 3 4 0 0]
[ 9 2 3 4 0]
[ 5 7 3 11 6]]

I used a MAX_DOCUMENT_LENGTH of 5 in the examples above so that I could show you what is happening. In the real dataset, titles are longer than 5 words. So, In I’ll use

MAX_DOCUMENT_LENGTH = 20

The shape of the sliced matrix will be batchsize x MAX_DOCUMENT_LENGTH, i.e. batchsize x 20.

Embedding

Now that our words have been replaced by numbers, we could simply do one-hot encoding but that would result in an extremely wide input — there are thousands of unique words in the titles dataset. A better approach is to reduce the dimensionality of the input — this is done through an embedding layer (see full code here):

EMBEDDING_SIZE = 10
embeds = tf.contrib.layers.embed_sequence(sliced,
vocab_size=N_WORDS, embed_dim=EMBEDDING_SIZE)

Once we have the embedding, we now have a representation for each word in the title. The result of embedding is a batchsize x MAX_DOCUMENT_LENGTH x EMBEDDING_SIZE tensor because a title consists of MAX_DOCUMENT_LENGTH words, and each word is now represented by EMBEDDING_SIZE numbers. (Get into the habit of figuring out tensor shapes at each step of your TensorFlow code — this will help you understand what the code is doing, and what the dimensions mean).

We could, if we wanted, simply wire the embedded words through a deep neural network, train it, and go our merry way. But just using words by themselves does not take advantage of the fact that word sequences have specific meanings. After all, “supreme” could appear in a number of contexts, but “supreme court” has a much more specific connotation. How do we learn word sequences?

Convolution

One way to learn sequences is to embed not just unique words, but also digrams (word pairs), trigrams (word triplets), etc. However, with a relatively small dataset, this starts becoming akin to one-hot encoding each unique word in the dataset.

A better approach is to add a convolutional layer. Convolution is simply a way of applying a moving window to your input data and letting the neural network learn the weights to apply to adjacent words. Although more common when working with image data, it is a handy way to help any neural network learn about the correlations between nearby inputs:

WINDOW_SIZE = EMBEDDING_SIZE
STRIDE = int(WINDOW_SIZE/2)
conv = tf.contrib.layers.conv2d(embeds, 1, WINDOW_SIZE,
stride=STRIDE, padding='SAME') # (?, 4, 1)
conv = tf.nn.relu(conv) # (?, 4, 1)
words = tf.squeeze(conv, [2]) # (?, 4)

Recall that the result of embedding is a 20 x 10 tensor (let’s disregard the batchsize for now; all operations here are carried out on a single title at a time). I am now applying a weighted average in a 10x10 window to the embedded representation of the title, moving the window by 5 words (STRIDE=5), and applying it again. So, I will have 4 such convolution results. I then apply a non-linear transform (relu) to the convolution results.

I have four results now. I can simply wire them through a dense layer to the output layer:

n_classes = len(TARGETS)     
logits = tf.contrib.layers.fully_connected(words, n_classes,
activation_fn=None)

If you are used to image models, you might be surprised that I used a convolutional layer, but no maxpool layer. The reason to use a maxpool layer is to add spatial invariance to the network — intuitively speaking, you want to find a cat regardless of where in the image the cat is. However, the spatial location within the title is quite important. It is quite possible that New York Times articles’ titles tend to start with different words than GitHub ones. Hence, I didn’t use a maxpool layer for this task.

Given the logits, we can figure out the source by essentially doing TARGETS[max(logits)]. In TensorFlow, this is doing using tf.gather:

predictions_dict = {      
'source': tf.gather(TARGETS, tf.argmax(logits, 1)),
'class': tf.argmax(logits, 1),
'prob': tf.nn.softmax(logits)
}

Just to be complete, I also send along the actual class index and probabilities of each class.

Training and deploying

With the code all written (see full code here), I can train it on Cloud ML Engine:

OUTDIR=gs://${BUCKET}/txtcls1/trained_model
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gsutil cp txtcls1/trainer/*.py $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=$(pwd)/txtcls1/trainer \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET \
--scale-tier=BASIC --runtime-version=1.2 \
-- \
--bucket=${BUCKET} \
--output_dir=${OUTDIR} \
--train_steps=36000

The dataset is quite small, so training finished in less than five minutes and I got an accuracy on the evaluation dataset of 73%.

I can then deploy the model as a microservice to Cloud ML Engine:

MODEL_NAME="txtcls"
MODEL_VERSION="v1"
MODEL_LOCATION=$(gsutil ls \
gs://${BUCKET}/txtcls1/trained_model/export/Servo/ | tail -1)
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model \
${MODEL_NAME} --origin ${MODEL_LOCATION}

Prediction

To get the model to predict, we can send it a JSON request:

from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1beta1', credentials=credentials,
discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1beta1_discovery.json')

request_data = {'instances':
[
{
'title': 'Supreme Court to Hear Major Case on Partisan Districts'
},
{
'title': 'Furan -- build and push Docker images from GitHub to target'
},
{
'title': 'Time Warner will spend $100M on Snapchat original shows and ads'
},
]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'txtcls', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)

This results in a JSON response:

response={u'predictions': [{u'source': u'nytimes', u'prob': [0.7775614857673645, 5.86951500736177e-05, 0.22237983345985413], u'class': 0}, {u'source': u'github', u'prob': [0.1087314561009407, 0.8909648656845093, 0.0003036781563423574], u'class': 1}, {u'source': u'techcrunch', u'prob': [0.0021869686897844076, 1.563105769264439e-07, 0.9978128671646118], u'class': 2}]}

The trained model predicts that the Supreme Court article is 78% likely to come from New York Times. The Docker article is 89% likely to be from GitHub according to the service and the Time Warner one is 100% likely to be from TechCrunch. That’s 3/3.

Resources: All the code is on GitHub here: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/blogs/textclassification

--

--