
Working with text data is one of the most exciting things when it comes to Data Science related projects. 20 years ago, it seemed impossible that processing and storing text data would be almost a certainty for many organizations and that a lot of data pipelines would revolve around this type of data.
Oddly enough, storing text data for feature processing or data science algorithms don’t come as natural as one might think. For starters, text is mostly represented by binary representations in a computer – a sentence or a document is mostly interpreted as a bunch of characters that have some relationship to binary representation.
It seems that this makes it extremely difficult to use text data in data pipelines, modelling or in the decision making process. Luckily, there are some techniques that are used to represent text as mathematical arrays that can then be encoded into algorithms and even turn my data into the holy-grail of most analytics, tabular data.
A question unfolds – how should I represent my text as tabular data? Is this even possible? Luckily there are a lot of techniques that we can use to provide this text representation and let’s explore three of them next.

Binary Vectorizer
The first technique that we are going to talk about is a really simple one that is still widely used throughout Natural Language Processing pipelines – the binary vectorizer.
Let’s imagine the following two sentences:
‘I went to the grocery store’
‘I went to the movie theater’
If we want to represent both of these sentences in tabular or array format we can first start to extract the distinct words of our corpus (corpus is normally what we refer to the collection of texts). Let’s do that, using Python Code (something I’ll use throughout the article):
sentence_1 = 'I went to the grocery store'
sentence_2 = 'I went to the movie theater'
vocab = set(
list(
sentence_1.split(' ')+sentence_2.split(' ')
)
)
Our vocab object now contains the distinct words of the corpus:
- I, went, to, the, grocery, store, movie, theater
If we order our vocab:
vocab.sort()
We get as a result a list with the following elements:
grocery, I, movie, store, the, theater, to, went
Let’s procede to create an array of zeros with the number of elements of our vocab, where each word w in position j of our list will be mapped to the position j of our array:
import numpy as np
array_words = np.zeros(len(vocab))
Our example array will contain the following elements:
[0, 0, 0, 0, 0, 0, 0]
We can map our sentences to this array by turning each element j to 1’s when the word w is present on the sentence – let’s start with our first sentence ‘I went to the grocery store’ – and update our array accordingly:
[1, 1, 0, 1, 1, 0, 1, 1]
Visualizing our vocab list and array at the same time will make this more explicit:
grocery, I, movie, store, the, theater, to, went
[1, 1, 0, 1, 1, 0, 1, 1]
Notice that only the words that are not present in our sentence are set to 0 – this is a really simple way to map sentences into numerical arrays . Let’s check the array produced with the same logic for the second sentence:
grocery, I, movie, store, the, theater, to, went
[0, 1, 1, 0, 1, 1, 1, 1]
Creating both of these arrays in numpy:
array_words = np.array([
[1,1,0,1,1,0,1,1],
[0,1,1,0,1,1,1,1]
])
We now have a simple mathematical representation of our sentences – luckily we don’t have to do this by hand for all the sentences in a specific corpus as scikit-learn as an excellent implementation of this in the _featureextraction.text module in a function called CountVectorizer.
Imagining we have our sentences as a list:
sentence_list = ['I went to the grocery store',
'I went to the movie theater']
We can define the CountVectorizer object with binary set to True (spoiler alert, this is the parameter that draws the line between a pure CountVectorizer and a BinaryVectorizer!) and tokenizer equals to str.split – don’t fret too much about this last option but is just a way to mimic the same array we have done before (without this option "I" would be removed from the output as single letters are removed by default in the scikit-learn implementation of the Vectorizer):
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(tokenizer=str.split, binary=True)
And then apply a _fittransform on our list(we could also have as an argument the column as a dataframe with several sentences, as example) that will produce an array just like the one we have done manually before:
cvec.fit_transform(sentence_list).todense()
Notice the todense() method called after the _fittransform one. We are doing this because originally _fittransform saves the resulting object in sparse matrix format due to questions of space compression – we will discuss more about this, next.
Let’s look at our array produced by the instruction above:
[[1, 1, 0, 1, 1, 0, 1, 1],
[0, 1, 1, 0, 1, 1, 1, 1]]
Sounds familiar? It is the same we have produced manually before! You can know generalize this approach for any collection of sentences or documents you have.
Now let’s pose a question which is the following sentence:
‘I went to the grocery store and then went to the bike store’
The words ‘went’, ‘to’, ‘the’ and ‘store’ appear twice in our sentence. In the binary vectorizer approach, the array only flags the presence(1) or absence(0) of the word in the sentence. For an NLP application it might make sense to have the real count of the words – let’s see how we can do this with a simple change.
Count Vectorizer
The count vectorizer is a really similar approach to the one above. Instead of flagging the presence of a word with 1’s and 0’s we count the occurence of words in the sentence.
Picking up from the example above, we need to add some words to our vocab, as we have new words in our third sentence that were not present in our first two sentences.
Recall our first vocab object:
grocery, I, movie, store, the, theater, to, went
Let’s add the words ‘and’, ‘then’ and ‘bike’:
and, bike, grocery, I, movie, store, the, theater, then, to, went
This will increase the size of our array! Let’s map the sentence ‘I went to the grocery store and then went to the bike store’ to the new array that originates from the vocab, keeping the Binary format:
and, bike, grocery, I, movie, store, the, theater, then, to, went
[1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1]
Now if instead of having a binary vectorizer we have a count vectorizer that counts the words we have the following:
and, bike, grocery, I, movie, store, the, theater, then, to, went
[1, 1, 1, 1, 0, 2, 2, 0, 1, 2, 2]
The difference is that our array have value 2 for each word that appears twice in our sentence. In some models, building the array of features in this way may lead to better results.
Here, a higher value will be shown for this sentence arrayfor the words ‘store’, ‘the’, ‘to’ and ‘went’ – if this is something that is good for your NLP application really depends on how you want your array to convey the information from the corpus and what type of model you are building.
The implementation in scikit-learn is super similar to what we have done before:
cvec_pure = CountVectorizer(tokenizer=str.split, binary=False)
Binary, in this case, is set to False and __ will produce a more "pure" count vectorizer. _Binary=Fals_e is actually the default argument of the CountVectorizer object if you don’t declare the argument when you call the function.
Updating our sentence list:
sentence_list = ['I went to the grocery store',
'I went to the movie theater',
'I went to the grocery store and then went to the bike store']
And applying our new count vectorizer to our sentences:
cvec_pure.fit_transform(sentence_list).todense()
Here is our resulting array for the three sentences:
[[0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1],
[0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 2, 2, 0, 1, 2, 2]]
As you’ve noticed the size of our array (the number of words w in our vocab) can scale up pretty fast. This may lead to some problems but we will address a tweak we can use at the end of the post.
TF-IDF
The approaches that we’ve seen, never take into account the corpus as a whole – we always look at sentences independently and assume that each text is not related to the other sentences or documents in the corpus.
A common way to approach corpus as a whole, when producing arrays for features, is to use the Term Frequency-Inverse Document Frequency matrix approach or commonly called TF-IDF.
The formula for TF-IDF seems daunting, but it’s actually really simple – we’ll use the most simple implementation of the formula (there are other versions out there, such as the smoothed one, details: https://stats.stackexchange.com/questions/166812/why-add-one-in-inverse-document-frequency):

Notice that there are several terms in this equation – let’s start with the first term, number of occurences of i in j – here we want to map the number of occurences of a specific word in a specific text. Returning to our three sentence example:
sentence_list = ['I went to the grocery store',
'I went to the movie theater',
'I went to the grocery store and then went to the bike store']
Let’s obtain our TF-IDF score for the word store(we will call it i) in the third sentence(we will call it j). How many times does the word i occurs in the text j?
The answer is 2! We have two times the word store in this sentence. We can update our formula as we know the first term:

Now let’s compute the right handside of the equation – we need the number of times the word i occurs in all the documents we have in our corpus. In this case the word store appears in two documents we have on our base – we can also compute N which is the number of documents/sentences we have:

The returning value from this formula is approximately _0.81,_ that is the TF-IDF score ** considered for the word store in the third sentence — this value would substitute a potential 1 in our binary vectorizer or the 2 in count vectorizer for this value in the array**.
How would a word weight more for a specific sentence? Two hypothesis:
- a) Either the word i appears with more frequency in the document j.
- b) The word is rarer in the whole corpus.
Simulating both scenarios, starting with scenario a) – if the word store appeared 4 times in our text, our score would be higher:

With scenario b) you can also raise the TF-IDF score for a specific word in a text if the word is rarer in the corpus – let’s imagine that the word store would only appear in 1 of our sentences, fixing the value of 2 for the first term:

As the word gets rarer in the corpus, the TF-IDF score gets higher for that word and sentence specifically. This is a relevant difference from the approaches we have seen before that didn’t take into account the distribution of words in the full corpus.
Of course as we pass to the realm of distributions we have some downsides – if your population drifts a lot after the deployment of your NLP application (meaning, the expected occurence of certain words in our corpus), this could have significant impact in your application.
Again, we don’t need to do all the math by ourselves! There is a cool implementation in scikit-learn that we can use:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf = TfidfVectorizer(tokenizer=str.split)
Looking at our TFIDF array for the third sentence, the one we computed above and the corresponding vocab:
['and', 'bike', 'grocery', 'i', 'movie', 'store', 'the', 'theater', 'then', 'to', 'went']
[0.31 , 0.31 , 0.236,0.18, 0, 0.471, 0.366, 0,
0.31 , 0.366, 0.366]
Notice how Store has the highest TFIDF score in the sentence – this is because two things:
- Store is repeated on that sentence, hence the left-hand side of the equation has a higher value.
- From the repeated words (store, went, the and to), store is the rarer word for the whole corpus.
You may also notice that the values in the scikit-learn implementation range between 0 and 1 and not exactly the value we got with your simplified version of the formula. This is because the scikit-learn implementation performs normalization and smoothing by default (you can check the norm and _smooth_idf_ parameters of the function.
Dimensionality
In all of these approaches, and as we have seen when we added a new sentence, dimensionality gets big pretty fast. Although both methods save the data in sparse format, sparing us some troubles on memory errors (particularly if we are working using our own laptops) this high dimensionality (high number of columns in our arrays) may be problematic for a lot of NLP applications.
For all the approaches above, scikit-learn implementations has two arguments that will help you to deal with high levels of dimensionality:
- min_df: Receives an integer value that acts as a threshold for minimum number of N documents (or percentage if you pass a float) that the word must be present on to be considered as an array column.
- max_features: Receives an integer that sets the maximum number of columns you enable your arrays to have.
Both of these approaches make you lose information, and using them on our NLP pipeline depends, as always, of your application.
Let’s see an example for our Count Vectorizer with min_df set to 2:
cvec_limit = CountVectorizer(tokenizer=str.split, binary=False, min_df=2)
The returning array, instead of having the size of the vocab, will have only the words that appear in two tweets from our following base:
sentence_list = ['I went to the grocery store',
'I went to the movie theater',
'I went to the grocery store and then went to the bike store']
If you check the feature names, you only have an array of 6 columns that corresponds to the words:
['grocery', 'i', 'store', 'the', 'to', 'went']
And these words are exactly the ones that are only present in at least two of our three tweets!
Do the same experiment with the max_features parameter to check if you were able to understand the intuition behind it!
Conclusion
First and most important, here’s a small gist you can use for your projects:
More things could be done to your text to avoid the problem of dimensionality and also to preprocess your text.
For example , common preprocessing techniques consist of stemming/lemmatizing your sentences or removing stop words – but remember – everytime you perform preprocess on your text, you are losing information! Always take that into account when computing your features and always scope your text data to the end goal itself (a classification model, computing word vectors, using a recurrent neural network, etc.).
Other than the techniques I’ve showed you, more research is being done – as an example turning Word Vectors (https://en.wikipedia.org/wiki/Word_embedding) to Sentence or Document vectors but we will save that for another post!
Thank you for taking the time to read this post! Feel free to add me on LinkedIn (https://www.linkedin.com/in/ivobernardo/) and check my company’s website(https://daredata.engineering/home).
If you are interested in getting training on Analytics you can also visit my page on Udemy (https://www.udemy.com/user/ivo-bernardo/)
_This example is taken from my NLP course for absolute beginners available on the Udemy platform — the course is suitable for beginners and people that want to learn the fundamentals of Natural Language Processing. The course also contains more than 50 coding exercises that enable you to practice as you learn new concepts._