Another Twitter sentiment analysis with Python — Part 9 (Neural Networks with Tfidf vectors using Keras)

Published in

Towards Data Science

8 min readJan 31, 2018

This is the 9th part of my ongoing Twitter sentiment analysis project. You can find the previous posts from the below links.

In the previous post, I took a detour of implementing dimensionality reduction before I try neural network modelling. In this post, I will implement neural network first with the Tfidf vectors of 100,000 features including up to trigram.

*In addition to short code blocks I will attach, you can find the link for the whole Jupyter Notebook at the end of this post.

Artificial Neural Network

My first idea was, if logistic regression is the best performing classifier, then this idea can be extended to neural networks. In terms of its structure, logistic regression can be thought as a neural network with no hidden layer, and just one output node. You can see this relationship more clearly from the pictures below.

I will not go through the details of how neural networks work, but if you want to know more in detail, you can take a look at the post I wrote previously on implementing a neural network from scratch with Python. But for this post, I won’t implement it from scratch but use a library called Keras. Keras is more of a wrapper, which can be run on top of other libraries such as Theano or TensorFlow. It is one of the most easy-to-use libraries with intuitive syntax yet powerful. If you are a newbie to neural network modelling as myself, I think Keras is a good place to start.

ANN with Tfidf vectorizer

The best performing Tfidf vectors I got is with 100,000 features including up to trigram with logistic regression. Validation accuracy is 82.91%, while train set accuracy is 84.19%. I would want to see if the neural network can boost the performance of my existing Tfidf vectors.

I will first start by loading required dependencies. In order to run Keras with TensorFlow backend, you need to install both TensorFlow and Keras.

The structure of below NN model has 100,000 nodes in the input layer, then 64 nodes in a hidden layer with Relu activation function applied, then finally one output layer with sigmoid activation function applied. There are different types of optimizing techniques for neural networks, and different loss function you can define with the model. Below model uses ADAM optimizing, and binary cross entropy loss.

ADAM is an optimisation algorithm for updating the parameters and minimising the cost of the neural network, which is proved to be very effective. It combines two methods of optimisation: RMSProp, Momentum. Again, I will focus on sharing the result I got from my implementation, but if you want to understand properly how ADAM works, I strongly recommend the “deeplearning.ai” course by Andrew Ng. He explains the complex concept of neural network in a very intuitive way. If you want more in-depth material on the topic, you can also take a look at the research paper “ADAM: A Method for Stochastic Optimization” by Kingma & Ba (2014).

Before I feed the data and train the model, I need to deal with one more thing. Keras NN model cannot handle sparse matrix directly. The data has to be dense array or matrix, but transforming the whole training data Tfidf vectors of 1.5 million to dense array won’t fit into my RAM. So I had to define a function, which generates iterable generator object, so that it can be fed to NN model. Note that the output should be a generator class object rather than directly returning arrays, this can be achieved by using “yield” instead of “return”.

It looks like the model had the best validation accuracy after 2 epochs, and after that, it fails to generalise so validation accuracy slowly decreases, while training accuracy increases. But if you remember the result I got from logistic regression (train accuracy: 84.19%, validation accuracy: 82.91%), you can see that the above neural network failed to outperform logistic regression in terms of validation.

Let’s see if normalising inputs have any effect on the performance.

Then I redefined the model and refit the model with “x_train_tfidf_norm” I got from the above normaliser.

And the result comes out almost as same as without normalisation. And it is at this point I realised that Tfidf is already normalised by the way it is calculated. TF (Term Frequency) in Tfidf is not absolute frequency but relative frequency, and by multiplying IDF (Inverse Document Frequency) to the relative term frequency value, it further normalises the value in a cross-document manner.

Dropout

If the problem of the model is a poor generalisation, then there is another thing I can add to the model. Even though the neural network is a very powerful model, sometimes overfitting to the training data can be a problem. Dropout is a technique that addresses this problem. If you are familiar with the concept of ensemble model in machine learning, dropout can also be seen in the same vein. According to the research paper “Improving neural networks by preventing co-adaptation of feature detectors” by Hinton et al. (2012), “A good way to reduce the error on the test set is to average the predictions produced by a very large number of different networks. The standard way to do this is to train many separate networks and then to apply each of these networks to the test data, but this is computationally expensive during both training and testing. Random dropout makes it possible to train a huge number of different networks in a reasonable time.”

Dropout is simulating as if we train many different networks and averaging them by randomly omitting hidden nodes with a certain probability throughout the training process. With Keras, this can be easily implemented just by adding one line to your model architecture. Let’s see how the model performance changes with 20% dropout rate. (*I will gather all the results and present them with a table at the end.)

Through 5 epochs, the train set accuracy didn’t get as high as the model without dropout, but validation accuracy didn’t drop as low as the previous model. Even though the dropout added some generalisation to the model, but still the validation accuracy is still underperforming compared to logistic regression result.

Shuffling

There is another method I can try to prevent overfitting. By presenting the data in the same order for every epoch, there’s a possibility that the model learns the parameters which also includes the noise of the training data, which eventually leads to overfitting. This can be improved by shuffling the order of the data we feed the model. Below I added shuffling to the batch generator function and tried with the same model structure and compared the result.

The same model with non-shuffled training data had training accuracy of 87.36%, and validation accuracy of 79.78%. With shuffling, training accuracy decreased to 84.80% but the validation accuracy after 5 epochs has increased to 82.61%. It seems like the shuffling did improve the model’s performance on the validation set. And another thing I noticed is that with or without shuffling also for both with or without dropout, validation accuracy tends to peak after 2 epochs, and gradually decrease afterwards.

I also tried the same model with 20% dropout with shuffled data, this time only 2 epochs that I will share the result at the end.

Learning Rate

As I was going through the “deeplearning.ai” course by Andrew Ng, he states that the first thing he would try to improve a neural network model is tweaking the learning rate. I decided to follow his advice and try different learning rates with the model. Please note that except for the learning rate, the parameter for ‘beta_1’, ‘beta_2’, and ‘epsilon’ are set to the default values presented by the original paper “ADAM: A Method for Stochastic Optimization” by Kingma and Ba (2015).

Having tried four different learning rates (0.0005, 0.005, 0.01, 0.1), none of them outperformed the default learning rate of 0.001.

Increasing Number of Nodes

Maybe I can try to increase the number of hidden nodes, and see how it affects the performance. Below model has 128 nodes in the hidden layer.

With 128 hidden nodes, validation accuracy got close to the performance of logistic regression. I could experiment further with increasing the number of hidden layers, but for the above 2 epochs to run, it took 5 hours. Considering that logistic regression took less than a minute to fit, even if the neural network can be improved further, this doesn’t look like an efficient way.

Below is a table with all the results I got from trying different models above. Please note that I have compared performance at 2 epochs since some of the models only ran for 2 epochs.

Except for ANN_8 (with the learning rate of 0.1), the model performance only varies in the decimal place, and the best model is ANN_9 (with one hidden layer of 128 nodes) at 82.84% validation accuracy.

As a result, in this particular case, neural network models failed to outperform logistic regression. This might be due to the high dimensionality and sparse characteristics of the textual data. I have also found a research paper, which compared model performance with high dimension data. According to “An Empirical Evaluation of Supervised Learning in High Dimensions” by Caruana et al.(2008), logistic regression showed as good performance as neural networks, in some cases outperforms neural networks.

Through all the trials above I learned some valuable lessons. Implementing and tuning neural networks is a highly iterative process and includes many trials and errors. Even though neural network is a more complex version of logistic regression, it doesn’t always outperform logistic regression, and sometimes with high dimension sparse data, logistic regression can deliver good performance with much less computation time than neural network.

In the next post, I will implement a neural network with Doc2Vec vectors I got from the previous post. Hopefully with dense vectors such as Doc2Vec, the neural network might show some boost. Fingers crossed.

As always, thank you for reading. You can find the whole Jupyter Notebook from the link below.

https://github.com/tthustla/twitter_sentiment_analysis_part9/blob/master/Capstone_part4-Copy7.ipynb