The necessity of being data-driven: the Natural Language Processing case

Algorithms can do wonders but only if you remember to feed them

Published in

Towards Data Science

13 min readJul 30, 2019

Natural Language Processing (NLP) is quite intuitive to a human being but that’s not the case for a computer. The philosophical question as old as time of “Does the computer understand as we do?” becomes very concrete when it comes to NLP. For this specific case, we can ask “Does the computer understand the relation between “burger” and the actual delicious dish you have tasted and held in hand?” Or we could take it a step further: Can the computer make the difference between “burger” and “pizza” and does it understand the similarities between the two? To actually give the answer to that we need to take some steps back and consider what we are actually trying to tell to the computer and how we are telling that to it (can we call it an “it” after all?).

The natural language, english in this case, has associated the word “burger” to that dish and this is what we give as input to the computer. This language though is natural to us but not to the computer. The general idea is that the transformation of the word is that of taking each character and transform it in ascii characters and then to combination of 1-s and 0-s. One might say that now the computer has its version of burger, just like any translation from one natural language to another. This would be enough if it wasn’t that little can be done with such representation of a word if we want to do more complex things like create expressions or phrases or even worse: an entire document.

In a recent project I was assigned to, which put in simple words it was a problem of ranking a collection of documents based on their similarity with a single document, I had my first ever experience with a real life problem that applied NLP techniques. It is a thrilling subbranch of machine learning, if not the most thrilling one, but it comes with a lot of challenges and the main one is that of quantity (and quality) of data being used which, shadowed by the thrill of doing machine learning, is not stressed enough. With the project coming to a (satisfying) end I reached a couple of conclusions that I deem worth sharing.

The usual drill of pre-processing:

If you have ever read any article where the processes of doing text analysis does not start with pre-processing please do send it my way because it has yet to happen for me to find one. The reason behind is because it is fundamental and one of the most important steps of the whole process: cleaning the text and making it as more computer understandable as possible is key to achieving considerably good results for this field.

To lowercase: the simple task of transforming all your text to lowercase to make the rest of the analysis easier
Punctuation and weird character removal: both of these are considered as uninformative and are thus removed from the text we aim to analyze
Tokenization: the simple process of dividing your text to a list of words or sentences that it is composed of
Stop word removal: in every language there is a (sort of) pre-defined set of words that are considered as irrelevant for any kind of text analysis e.g. ‘myself’, ’they’, ‘the’ etc. NLTK has a very generic yet useful list of stopwords for different languages.

Depending on the task at hand, you can add your own set of keywords that are irrelevant for your analysis and therefore remove some noise from your data which is more likely to lead to better results of your work. This proved to be also one of the points that took a great deal of our time: it is a human task, reading and reasoning on the type of text you are considering you can evaluate which words do not add any relevant information for the machine to then understand. Do spend some time to look for this keywords, more often than not it leads to a better performance.

NER (Named Entity Recognition): is the sub-task of locating and classifying named entities (e.g. location, people’s names, organization, quantities etc.). Some libraries that offer this are NLTK, Genism and SpaCy.

Stemming: the process of reducing inflected or derived words to their root also known as stem word. E.g. fish, fishes, fisher are all reduced to the stem fish. Libraries that offer this are NLTK, Genism and SpaCy.
Lemmatization: it is the process of extracting the lemma of the word through morphological analysis of words, where lemma is the base or dictionary form of the word. Libraries that offer this are NLTK, Genism and SpaCy.

The pre-processing steps

Word embedding: this is the most tricky and crucial part of it all but what does this fancy term mean? It’s pretty simple: representation of a word in a numerical form. The ways to go about it are quite many.

Word embedding: Bag-of-words

The ever classic example of most NLP academic slides

The BOW representation is the simple idea of representing a dictionary of D words in a D space where each word is represented by a configuration, keeping track only of the multiplicity. On average, each language has about 171,576 words. New ones are generated quite frequently and this can only mean one thing: increase of the word space representation. This kind of representation has also another flow: consider the case of a dictionary of 4 words [cat, lady, tickle, queen]. In a BOW representation we would have a representation like [1000, 0100, 0010, 0001] which seems about right other than the fact that it does not incorporate any information about the semantics of the words. It is impossible to understand from such representation that the words lady and queen are closer than cat and tickle for example. It makes for a very limited representation if we intend to add to the complexity of the operations of such words. This word embedding approach however doesn’t require training therefore doesn’t require a corpora neither.

Its strength relies in its simplicity.

Word embedding: TF-IDF

What can be considered as the next step of BOW is the TF-IDF representation of words in a corpora text.

It’s main goal is that of giving the importance of a term (t) for a document (d) in a collection of documents (D).

Term-frequency: it is the occurrence of a term t across the document d

Inverse document frequency (IDF): translated in plain english it is the inverse of the frequency of a term considering the whole collection of documents available for the training.

The well-known library sklearn has the implementation of this kind of word embedding. As is the case with almost all other implementations is it very easy to use once you understand what the parameters you give as input are. The interesting aspect is that you can define your own tokenization function and turning at the starting point, you can customize the pre-processing step as you wish. Once you have defined the tokenizer (you can also use the default one) it takes just two lines of code to obtain the numerical representation of your data.

In this case we are using just three features of the TF-IDF Vectorizer: our customized tokenizer, normalization of the frequencies of the terms and max_features, an upper bound on the number of features which translated in other words is the dimension of our representation.

What we get as output is matrix that for the number of rows equals to the number of documents on the train set and the number of columns equals the number of features that we keep (max_features_ = 10), where by features we mean terms in the documents that have the highest TF-IDF score.

Top 10 features of the training documents

The features extracted are the terms with the highest score meaning that these terms are the most relevant ones in representing the documents that we provided to the model. Once you have this numerical representation of the document, where each document is expressed through these 10 features in our case indicating in the vectorial representation how many times the feature in position i appears in the document, you can do just about anything with that, in my case compute the similarity between documents.

Once we have our TF-IDF model we can vectorize any other document through that.

Vectorial representation of the dummy document

The representation of the dummy document is done though 4 keywords: feature, extraction, love, person to be precise, with their respective weights.

In just two lines you can compute the cosine similarity between the vectorized documents.

In this case, the dummy document is more similar to the DFW masterpiece of passage rather than the feature extraction explanation which sounds fairly good to me.

We have taken a step further from BOW where we at least consider the occurrences of the terms and we can also control the dimension of the representation space as well through the max_features parameter. We are however, not yet able to associate a meaning to the terms we are teaching the computer. Nonetheless, it is the most used approach in text mining tasks, over 83% of text-based recommender systems in digital libraries use TF-IDF instead of overly fancy approaches [1].

Word embedding: Word2Vec

As the self-explanatory name indicates this is an approach that transforms words to vectors. The fundamental difference with BOW as well as TF-IDF lies in the fact that here we take into consideration the meaning of the words when associating a vectorial representation to them. In simple terms, this word embedding approach is a shallow neural network of just two layers that aims to learns the linguistic context of words. In this case the two words lady and queen would have similar representation. This kind of approach however requires a large corpus which becomes an issue when you have a specific problem you are trying to apply it to and have little data to train it with. It can be of two types, depending on the architecture that it is being used:

CBOW (Continuous Bag of Words): the prediction of a word given those that surround it (the window). Based on a set of words surrounding the desired words it tries to predict the desired word w(t). The prediction of such word is based on learning process on the corpus that we provide to the model, therefore the more cases the model sees, the better it understands the relations between words and it is thereafter able to predict better the possible w(t).
Skip-grams: the prediction of the words surrounding a given word w(t). In this case, we do the opposite: given a word, we predict the words that might surround it. Following the same line of reasoning, even in this case the larger the corpora we feed to the model, the better this prediction will be.

If you want to read something more on these last two models here is an interesting article.

As the magical words “Neural Networks” are included in the one phrase explanation of this approach one cannot be shocked that what follows is the necessity of a large, large quantity of data. Even for a human being, to learn well a concept, to memorize trends and lines, repeating them time after time is the key: the same can be said about these kind of models.

Now what?

At this point you have very powerful NLP tools at hand but what do we do with it? If you’re a curious cat you’ll start with little code snippets where you test out a library or two and be left in wonder at what we human have achieved so far in teaching the machine our language. Most often it is not until you have a specific problem at hand that you start facing the reality and how limited most of these approaches are. To be able to obtain the ranking I needed for my project I needed first to calculate the similarities between the documents and to calculate the similarities i need to have these documents in numerical form. It is not a question of just obtaining a numerical representation of a document but also getting a representation that optimizes the scope, which in my case was to obtain the best possible ranking.

The most limiting aspect is the available amount of data

As in almost every application of any machine learning algorithm a huge amount of data is required and this case is no exception, if anything it is more critical than in most cases. The fundamental reason behind the necessity of this large amount of data is the need for generalization: a ML is considered a good one if it is able to generalize and to be able to act correctly with new, unseen test data. To reach a good point of generalization the model needs first to see as many as possible data so that it doesn’t just focus on few cases and shape itself around these few case (Hello overfitting, is that you?).

The most challenging issue is quite often faced in the real world: very few data at hand. With less than 100 samples at hand to be ranked (which once the solution would be put to use would reach at least 1000) I stood in front of a board of possible solutions. The first move, a very ambitious (and fairly stupid one looking back) was that to go big: start with something like Doc2Vec, the equivalent of Word2Vec for documents. Knowing from the get go that it was an impossible approach, that of training a Neural Network with only 100 points or less and expect it to learn well, to be able to represent a document as a vector based on those few cases is unrealistic. The results you may wonder: I was able to generate the most unstable model I had ever had to work with. Not only wasn’t it good at learning the representations I was looking for but it also wasn’t able to produce a stable model: removing or adding a single document would lead to different results.

Upon realizing this I went back to the board of possible solutions and started considering the ones discarded at the beginning, mainly due to their simplicity, namely TF-IDF. It worked way better then all the rest I had tested out: shocking, isn’t it? Who would have thought that the approach widely used would do just fine for my problem as well. One might be wondering what “worked way better” mean: for the problem in question we didn’t use a traditional metric, so “an accuracy of 99%” is not what we are talking about here. The evaluation of the solution is human based as per now, in the sense that the results are presented to human experts of the field and they express how satisfied they are with the ranking of the documents. Of course there is plenty of room for improvement, definition of a metric for starters, so that the evaluation of the solution can be even more automatized, however in corporate terms the efficiency of a such solution is that of cost reduction of the process being automatized: indeed, the solution presented achieved that.

Moral of the story

The best thing to do when things go wrong or just at the end of the journey is that of finding the moral of the story, learning from your mistakes and this is what I learnt from mine:

The power of the data is not to be underestimated

A data science project without the data, where by data I mean a lot of data and by a lot I mean about thousands, is like a fish without water. You can survive for a couple of seconds, you can experiment this new exciting life but just like a fish can survive approximately 4 minutes without water so will your project. The point of having very few data to apply any machine learning algorithm is that you can try but you also have to know that the chance of failure and obtaining results of no value is high, very, very high. Nonetheless, giving it a try is always okay as long as you don’t spend days on that, in that case it becomes just a waste of time.

The power of simple approaches is not to be underestimated neither

It is however very common that you have to deal with very few data and it means that you have to find a way to handle these kind of cases. Unless you have many, many data or a pre-trained model on a huge corpus that works for you, what can be considered as a rule of thumb is one: go for the simple approaches first. While it might be hip to use buzzwords, while it might be fun to play with these very complex models, it is counterproductive given that the results obtained are not satisfactory and you’re just throwing away time. Going for a simple approach like TF-IDF Vectorizer might sound very basic and might not capture that much attention but most often it does the work and in the end that’s what matters.