In my previous article, I discussed the first step of conducting sentiment analysis, which is preprocessing the text data. The process includes tokenization, removing stopwords, and lemmatization. In this article, I will discuss the process of transforming the "cleaned" text data into a sparse matrix. Specifically, I will discuss the use of different vectorizers with simple examples.
Before we get more technical, I want to introduce two terminologies that are widely used in text analysis. For a collection of text data we want to analyze, we call it corpus. A corpus contains several observations, like news articles, customer reviews, etc. Each of these observations is called a document. I will use these two terms from now on.
The transformation step works as building a bridge that connects the information carried in the text data and the Machine Learning models. For sentiment analysis, to make sentiment predictions on each document, the machine learning model needs to learn the sentiment score of each unique word in the document, and how many times each word appears there. For example, if we want to conduct sentiment analysis for customer reviews of a product, after training the model, the machine learning models are more than likely to pick up the words like "bad", "unsatisfied" from negative reviews, while getting words like "awesome", "great" from positive reviews.
Facing a supervised machine learning problem, to train the model, we need to specify features and target values. Sentiment analysis is solving a classification problem, and in most cases, it is a binary classification problem, with target values defined as positive and negative. The features used to the model are the transformed text data from a vectorizer. The features are constructed differently with different vectorizer. In Scikit Learn, there are three vectorizers, CountVectorizer, TFIDFVectorizer, and HashingVectorizer. Let’s discuss the CountVectorizer first.
CountVectorizer
The CountVectorizer uses the bag of word approach that ignores the text structures and only extract information from the word counts. It will transform each document into a vector. The inputs of the vector are the occurrence count of each unique word for this document. When having m documents in the corpus, and there are n unique words from all m documents, the CountVectorizer will transform the text data into a m*n sparse matrix. Here is an example showing the use of the CountVectorizer:

The CountVectorizer takes a list of documents and produces a sparse matrix by two steps: fit and transform. During the fitting process, the vectorizer read in the list of documents, count the number of unique words for the corpus, and assign an index for each word. For the example above, we can see there are six unique words for the two documents, and we assign each of them with an index based on alphabetical order. Note that you can specify the stopwords here to exclude useless words. You can use the default list, or make a customized one. Or if you have already preprocessed the text data, you can pass this step.
The next step is to transform the fitted data. CountVectorizer will count the occurrence of each unique word in each document. Here I have two documents, and six unique words, thus we will get a 2*6 matrix shown above. To better understand the elements of the matrix, here I have a graph:

Here, the row id corresponds with each document, and the column id follows the index of the unique words at the fitting process. For example, the word "day" shows up at both documents, so the first column input is (1,1). If a word does not show up in a document, the input of this word in that document row will be 0. As the number of documents increases, the matrix becomes a sparse matrix since there will be more 0s in the matrix.
TFIDFVectorizer
Another more widely used vectorizer is TFIDFVectorizer, TFIDF is short for term frequency, inverse document frequency. Besides the word counts in each document, TFIDF also includes the occurrence of this word in other documents. Specifically, TFIDF is calculated by:

where t_i,wj is the frequency of word wj appears in document i. By examining the equation, it is clear that the first term is calculating the term frequency and the second term is calculating the inverse document frequency. The first term is evaluating how many time the word wj appear in document i, normalized by the length of document i. The higher term frequency indicating a higher TFIDF value, presenting the fact that the word wj plays a very important role in document i by appearing significant times. However, the effect of wj will be weakened if wj also appears in many other documents besides i, which means it is a common word for this topic. This process is captured by the second term, which is the inverse of how many documents wj appears divided by the total number of documents. Combining two effects, a word wj with high TFIDF values in document i means that wj appears many times in document i, and only appears in few other documents.
Using the TFIDF with the previous example, here is the difference:

We can see that the value for each element is smaller, but the shape of the matrix is still the same.
HashingVectorizer
Another vectorizer that is commonly used is called the HashingVectorizer. It is usually used when dealing with large datasets. Using feature hashing, HashingVectorizer is memory efficient and ensures better model performance for large datasets. I won’t talk too many details in this article here, but you can look up more information here.
Additional function inputs
Besides specifying and customizing the stopwords, we can customize the tokenize function in the vectorizers. As discussed in my previous article, including the customized tokenize function here slows down the vectorizing process.
In the previous example, we are constructing the sparse matrix with only single words, we can increase the number of features by including bigram. We can specify it in the function by adding the _ngramrange into the function. Here is an example:

By including bigram, the number of features increases from six to eleven. Sometimes, when we have words like "not bad" in the documents, including bigram will increase model performance.
You can also specify _mindf and _maxdf in the vectorizer function. By specifying how many times a word has to appear in different documents to be considered as a feature, we filter out words that are not very common across the corpus. Additionally, when setting a limit of how many times a word appears in different documents (max_df), we ignore what is too common, like stopwords. Tailoring the vectorizer function inputs in different scenarios should increase the model performance.
It is useful to know the customization options of the vectorizers. For more choices, you can visit sklearn documents for each vectorizer. To ensure the best model performance, we can use GridSearchCV to tune the hyperparameters for the transformers. In my next article, I will discuss more details as I apply TFIDF at my project, and construct the estimators.
Thank you for reading! Here is the list of all my blog posts. Check them out if you are interested!
Read every story from Zijing Zhu (and thousands of other writers on Medium)