Data Science Sprints

Building and Deploying a Data Science Project in Two weeks

Learn the basics of Natural Language Processing, Flask and ML Model Deployment by building something fun!

Harsh Rana
Towards Data Science
7 min readDec 9, 2019

--

End product: a sentiment analysis web application with built-in ongoing learning capacity

I’m a kinesthetic learner; I learn by doing, building and breaking things. In this article, I’ll share how I built a basic sentiment analysis machine learning model+web application and deployed it in under two weeks to learn three new technologies- natural language processing (NLP), flask and incremental/ongoing learning.

Learning Outcomes

My primary goals with this undertaking were the following:

  1. Learn NLP basics and train a sentiment analysis classifier
  2. Learn how to build a web application using flask
  3. Learn ML-based web application deployment

I wanted to do all this in a rapid building/learning mode with a focus on functionality, not on aesthetic. Some of the other topics I ended up exploring and learning along the way include model optimization, incremental/ongoing learning and jQuery. All Jupyter notebooks and code used in this article can be found on my Github.

The Dataset

For this project, I needed a dataset which contained text data, but was already labelled and structured. This would enable me to focus on NLP-based preprocessing while skimming over other forms of data preparation. After looking at a few different options, I ended up choosing the yelp reviews dataset because it was pre-labelled (text review and 1–5 star ratings), but the actual text inside the reviews was messy and inconsistent. You can see an example record from the raw JSON dataset below-

Example record from raw JSON data

For my model, I was only interested in the text review and the 1–5 stars (maybe in the future I could try doing something fun with the other attributes). Additionally, to further simplify my training process and stick to the rapid development mindset, I decided to modify the labels from 1–5 ratings to a positive (4 or 5 stars) or negative (1 or 2 stars) sentiment associated with the textual review. I dropped the reviews with a 3 star rating because their classification would be difficult to validate. A few lines of code later, I had my data inside a pandas dataframe which can be seen below-

Pandas dataframe with just the text review and positive/negative sentiment

There are many potential issues with the text. Some characters such as * and \n serve no real purpose in helping understand the text, while others such as $ and !, may bring insight into how the reviewer was feeling. Additionally, there are many words/slangs which a human could understand ($8Gs = $8000), but would make little to no sense for a machine. All these and many more scenarios will be tackled in the next section.

Natural Language Processing Pipeline

To prepare the data for classification, we have to first design a NLP pipeline to prepare the text data. There are several steps involved in an NLP pipeline, some of which can be seen below-

Sample NLP Pipeline

In our case, steps such as part-of-speech tagging and named entity recognition would not be very beneficial, as the task at hand is a sample classification problem. So we’ll focus on techniques which offer the most bang for our buck: tokenization, stop words removal and ngram utilization (discussed below).

For the first iteration of my NLP model, I decided to implement a simple bag-of-words model using sklearn’s CountVectorizer() method and train it on roughly 100,000 rows. I utilized sklearn’s train_test_split() method to break up my data into training and testing blocks. After trying a few different classifiers such as logistic regression, multinomial naive bayes and linear support vector machine, I found that the logistic regression model worked best for this data. Here’s the confusion matrix of the respective classifiers-

Confusion Matrices for the various classifiers

Thinking further ahead, I found that the logistic regression model would be unable to utilize incremental learning later in the project, while the linear SVM (powered by sklearn’s SGDClassifier()) would. The accuracy trade-off was small with the logistic regression having an average F1-score of 93%, while the SVM having an average F1-score of 92%. Thus, I decided to proceed with the linear SVM to utilize incremental learning later in the project.

Additionally, I chose to replace CountVectorizer() with the HashingVectorizer() as it uses a technique called hashing to decrease the size of the input processing model by > 99%. The only major trade-off is the loss of ability to look at explicit words in the model (hashing converts all words to numbers), which wasn’t a big concern for me. You can read more about this model here.

Model Optimization

I did a few things to optimize my model. As you can see from the confusion matrix above, my training and testing data both have a lower number of negative reviews, as compared to positive. This resulted in my F1-score for the negative label being 82% and for the positive label being 94%. To mitigate this disparity, I had to balance my dataset by making sure that the number of records corresponding to the two labels was comparable. After balancing my dataset, I proceeded to omit stop words like and, the, a etc. from the bag-of-words model.

After these two steps my model’s F1-score dropped from 92% to 91%, but the previous disparity between the two labels was smaller. The F1-score of my negative and positive reviews were 91% and 90% respectively, so overall the classifier was performing better. Next, I decided to utilize ngrams to further optimize my model.

Ngrams simply refers to a continuous group of n words which co-occur. For example-

In the sentence "I love data science", the 1-grams would be ["I", "love", "data", "science"]. The 2-grams would be ["I love", "love data", "data science"] and so on.

You can even see from the example above why utilizing ngrams would be helpful. The 2-gram “data science” appearing in a sentence would provide more information than just the 1-grams “data” and “science”. Now, back to the optimization: I utilized 1,2 grams for further strengthening my model. The confusion matrices before and after these optimizations can be seen below-

Before and after the ngram optimization

As you can see, this was a major improvement as both positive and negative incorrect guesses dropped. The model’s final F1-score moved up to 92%, and the disparity between the two classes was negligible. At this point, the model was ready. Next up, incremental/ongoing learning.

Incremental Learning

As you all have already seen, our sentiment classification model is far from perfect. This is where the beauty of incremental learning comes in.

Incremental learning is a machine learning paradigm where the learning process takes place whenever new example(s) emerge and adjusts what has been learned according to the new example(s).¹

If you think about how human beings learn language, you’ll quickly see that they’re learning incrementally. Every time a parent corrects a child, a teacher corrects a student or two friends correct each other, a form of incremental learning takes place. In this section, we’ll build a similar system for our NLP model so users can “correct” it when it makes a classification error.

Remember how we chose the SGDClassifier() due to this functionality earlier? The SGDClassifier has a built-in partial_fit() method which will help us implement incremental learning in a few lines of code as follows-

# Incremental training# example x
X_instance = cv.transform(["I like this place, but not much"])
# user determined label
y_instance = ['n']
# max iterations of training until the classifier relearns
max_iter = 100
# loop for a maximum of max_iter times
# partially fit the classifier over the learning instance
# stop when classifier relearns (or max_iter is reached)
for i in range (0,max_iter):
clf.partial_fit(X_instance, y_instance)
if(clf.predict(X_instance) == y_instance):
break

Using this, we can train our model on new data whenever a mistake is made while still maintaining the initial training. An example can be seen below-

Example of incremental learning

Wrap up and part 2

So far we’ve achieved quite a bit: we’ve created a basic NLP classifier which can read a piece of text and classify the sentiment into either positive or negative. We’ve optimized our model using NLP techniques such as stop words and ngrams to achieve a baseline F1-score of 92%. And lastly, we’ve written code to incrementally train our classifier, whenever it makes a mistake, so the learning never stops!

In the next part, we’ll build on top of everything we’ve covered so far and build a basic web application using Flask. In particular, we’ll build APIs to classify user-entered text and incrementally teach/correct our model. This will enable us to interact with our model as a user and further learn about machine learning model deployment!

I hope you enjoyed reading this article just as much as I enjoyed putting it together. If you have any questions, feedback or comments, feel free to get in touch with me through LinkedIn, or my website.

[1]: Geng X., Smith-Miles K. (2009) Incremental Learning. In: Li S.Z., Jain A. (eds) Encyclopedia of Biometrics. Springer, Boston, MA

--

--

A software engineer with a passion for data, personal finance and health.