The world’s leading publication for data science, AI, and ML professionals.

Word Embeddings and Document Vectors: Part 2. Order Reduction

In the previous post Word Embeddings and Document Vectors: Part 1. Similarity we laid the groundwork for using bag-of-words based document…

Word-embeddings yield a linear transformation of n-long (n being the size of the vocabulary making up the text corpus) sparse document vectors to p-long dense vectors, with p << n thus achieving a reduction in order…

In the previous post Word Embeddings and Document Vectors: Part 1. Similarity we laid the groundwork for using bag-of-words based document vectors in conjunction with word embeddings (pre-trained or custom-trained) for computing document similarity, as a precursor to Classification. It seemed that document+word vectors were better at picking up on similarities (or the lack) in toy documents we looked at. We want to carry through with it and apply the approach against actual document repositories to see how the document+word vectors do for classification. This post focuses on the approach, the mechanics, and the code snippets to get there. The results will be covered in the next post in this series.

The outline for the article is as follows. The code for full implementation can be downloaded from github.

  1. Pick a document repository (say the reuter 20-news from SciKit pages for multiclass or the large movie review data set from Stanford for binary sentiment classication)
  2. Tokenize (stopped/stemmed) the corpus and obtain word-vectors (pre-trained and custom-trained) for the tokens. This needs to be done just once, and we use them in all the classification tests. We store them to Elasticsearch for easy and quick retrieval as needed. We will consider Word2Vec/SGNS, and FastText algorithms for the word-vectors. Gensim API is used for generating the custom vectors and for processing the pre-trained ones.
  3. Build a SciKit pipeline that executes the following sequence of operations shown in Figure 1.
  • Get document tokens (stopped or stemmed) from an Elasticsearch index. Vectorize them (with CountVectorizer or TfidfVectorizer from SciKit) to get the high-order document-word matrix.
  • Embed word-vectors (Word2Vec, FastText, pre-trained or custom) fetched from an Elasticsearch index for each token. This results in a reduced order document-word matrix.
  • Run the SciKit supplied classifiers Multinomial Naive Bayes, Linear Support Vectors, and Neural Nets for training and prediction. All classifiers employ default values except for the required number of neurons and hidden layers in the case of neural nets.
Figure 1. A schematic of the process pipeline
Figure 1. A schematic of the process pipeline

The code snippets shown in this post are what they are – snippets, snipped from the full implementation, and edited for brevity to focus on a few things. The github repo is the reference. We will briefly detail tokenization and word-vector generation steps above before getting to the full process pipeline.

1. Tokenization

While the document vectorizers in SciKit can tokenize the raw text in a document, we would like to potentially control it with custom stop words, stemming and such. Here is a snippet of code that tokenizes the 20-news corpus saving to an elasticsearch index for future retrieval.

Code Listing 1: Tokenizing the 20-news corpus and indexing to Elasticsearch
Code Listing 1: Tokenizing the 20-news corpus and indexing to Elasticsearch

In Line 10 above, we remove all punctuation, remove tokens that do not start with a letter, and those that are too long (> 14 characters) or short (< 2 characters). Tokens are lowercased, stopwords removed (line 14), and stemmed (line 18). In Line 36 we remove the headers, footers etc… info from the each post, as those would be a dead give away as to which news group the article belongs to. Basically we are making it harder to classify.

2. Word-Vectors

The following code snippet processes the published fasttext word-vectors into an elasticsearch index.

Code Listing 2: Processing pre-trained word-vectors with Gensim and indexing into Elasticsearch
Code Listing 2: Processing pre-trained word-vectors with Gensim and indexing into Elasticsearch

In line 22 above we read the pre-trained vectors. Line 23 indexes them into elasticsearch. We can also generate custom word-vectors from any text corpus at hand. Gensim provides handy api for that as well.

Code Listing 3: Generating custom word-vectors with Gensim
Code Listing 3: Generating custom word-vectors with Gensim

In lines 35 and 41 the models are trained with the tokens (stopped or stemmed) obtained from the corpus index we created in section 1. The chosen length for the vectors is 300. The _mincount in line 30 refers to the minimum number of times a token has to occur in the corpus, for that token to be considered.

3. Process Pipeline

We vectorize the documents in the repo, transform and reduce the order of the model if Word Embeddings are to be employed, and apply a classifier for fitting and prediction as shown in Figure 1 earlier. Let us look at each one of them in turn.

3.1 Vectorize

We said earlier that we could use SciKit’s count/tf-idf vectorizers. They yield a document-term matrix X for sure, but our word-embedding step in the pipeline needs the vocabulary/words obtained by that vectorizer. So we write a custom wrapper class around SciKit’s vectorizer and augment the transform response with the vocabulary.

Code Listing 4: A wrapper to SciKit's vectorizers to augment the response with corpus vocabulary
Code Listing 4: A wrapper to SciKit’s vectorizers to augment the response with corpus vocabulary

The wrapper is initialized in line 1 with the actual SciKit vectorizer along with min_df (the minimum frequency across the repository required for a token to be considered in the vocabulary) set to 2. Line 8 uses the fit procedure of the chosen vectorizer, and the transform method in Line 12 issues out a response with both X and the derived vocabulary V that the second step needs.

3.2 Embed Words

We have m documents and n unique words among them. The core part of the work here is the following.

  1. Obtain the p-dimensional word-vector for each of these n words from the index we prepared in Section 2.
  2. Prepare an nxp word-vector matrix W where each row corresponds to a word in the sorted vocabulary
  3. Convert the mxn original sparse document-word matrix X to an mxp dense matrix Z by simple multiplication. We have gone over this in the previous post. But note that SciKit works with documents as row-vectors, so the W here is the transpose of the same in Equation 1 in that post. Nothing complicated.

p is of course the length of the word-vector, the projection of the original 1-hot n-dimensional vector to this fake p-word-space. We should be careful with matrix multiplication however as X comes in from the vectorizer as a Compressed Sparse Row matrix, and our W is a normal matrix. A bit of index jugglery will do it. Here is a code snippet around this step in the pipeline.

Code Listing 5: Building the reduced order dense matrix Z
Code Listing 5: Building the reduced order dense matrix Z

Line 1 initializes the transformer with a wordvector object (check github for the code) that has the methods to get the vectors from the index. Line 15 gets a sorted word list from the vocabulary passed from the vectorizer step. The csr X matrix uses the same order for its non-zero entries and we need to obtain W in the same order of words as well. This is done in line 16, and finally the sparse matrix multiplication in line 17 yields the reduced order matrix Z that we are after.

3.3 Classify

This is easy. The classifier gets the m x p matrix Z where each row is a document. It also gets the m x 1 vector of labels when fitting the model. We will evaluate three classifiers – naive bayes, support vector machines, and neural nets. We run them without tweaking any of the default SciKit parameters. In the case of neural nets we try a few different number of hidden layers (1, 2 or 3) and neurons within (50, 100, and 200) as there are no good defaults for that.

Code Listing 6: Preparing a list of classifiers for the pipeline
Code Listing 6: Preparing a list of classifiers for the pipeline

The method getNeuralNet in line 1 generates the tuple we need for initializing the neural net with the hidden layers and neurons. We prepare a suite of classifiers that are applied against the various combinations of vectorizers and transformers.

4. Next Steps

In the earlier post we studied document similarity with word embeddings. In this post we have laid out the machinery to use those concepts against document repositories to obtain reduced order document-word matrix. In the next post we will run the simulations and study the impact of different tokenization schemes, word-vector algorithms, pre-trained Vs custom word-vectors, on the quality and performance of different classifiers.

— – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

A modified version of this article was originally published at xplordat.com on September 27, 2018.


Related Articles