The world’s leading publication for data science, AI, and ML professionals.

Choosing Neural Networks over N-Gram Models for Natural Language Processing

Today we will look at the strengths of using Recurrent Neural Networks, Gated Recurrent Units, and LSTMs over N-Gram Models with an example…

Photo by Josh Riemer on Unsplash
Photo by Josh Riemer on Unsplash

Why Neural Networks? Traditional Learning Models require lots of space and memory!!!

Traditional learning models transform text from one language to another. For an accurate translation, you will compute the probability of a sentence using an N-Gram Language Model. One limitation is capturing the difference between words that are far from each other. This requires a large corpus, meaning more space and RAM. Recurrent Neural Networks (RNNs) and Gated Recurrent Units (GRUs) are much more efficient for machine translation compared to _N-_Grams since they can use the concept and incorporation of past memory for predicting what a word is, or the sentiment of a corpus of text. While neural networks may be a more suitable option for your analysis, traditional learning models are still effective and will at least offer a starting point for your work.

Vanilla Recurrent Neural Networks (RNN)

RNNs have the ability to capture dependencies not captured by traditional language models. An RNN will propagate information from the beginning of a sentence all away to its end.

Basic RNN Structure (Image from Author)
Basic RNN Structure (Image from Author)

As shown above, the most recent information will have a higher impact at the given step of the RNN compared to older information. The parameters updated through the training of an RNN are Wx, Wh, and W.

Advantages

  • Propagate information between sequences
  • Computations share the same paramters

The Math

The first calculation for an RNN is the hidden state, which does matrix multiplication through an activation function.

Hidden State Calculation (Image from Author)
Hidden State Calculation (Image from Author)
Order of Computation in an RNN (Image from Author)
Order of Computation in an RNN (Image from Author)

After the hidden state for the next time pass is calculated, the state.is then multiple by another parameter matrix Wyh and passed through an activation function, leading to a prediction for that time step.

Prediction Calculation (Image from Author)
Prediction Calculation (Image from Author)

Cost Function

RNNs can utilize the cross-entropy loss for training.

Cross-Entropy Loss (Image from Author)
Cross-Entropy Loss (Image from Author)

The cross-entropy loss for an RNN has a slight variation in its calculation.

RNN Cross-Entropy Loss (Image from Author)
RNN Cross-Entropy Loss (Image from Author)
  • K → number of classes
  • T → Timesteps T
  • Looks at the difference between the actual and predicted y values.

Gated Recurrent Units

GRUs differ from RNNs since they now will learn to keep relevant information (maybe parts of speech) over time. Long-term memory will not be lost if it is relevant to the problem being solved. This is important because in a Vanilla RNN long-term information will begin to vanish over time, LEADING TO VANISHING GRADIENTS. Basically, there is an additional input that is added to the RNN architecture. There is a relevance and update gate placed within the architecture of a GRU. The update gate determines how much previous information should be kept or updated. GRUs are much more mathematically driven and therefore, one downfall to their use is larger memory and computational costs.

The Math

As stated, GRUs are much more math-intensive than Vanilla RNNs.

Gated Recurrent Unit (Image from Author)
Gated Recurrent Unit (Image from Author)

The two sigmoid activation units are the relevance and update gates.

Relevance and Update Equations (Image from Author)
Relevance and Update Equations (Image from Author)

The weight parameters for these gates will change over time to determine what information should be kept, updated, and passed from the unit.

Hidden State Canditate (Image from Author)
Hidden State Canditate (Image from Author)

The hidden state candidate takes information from the relevance gate and hidden state for its calculations, and is ultimately passed through a tanh activation function.

Hidden State (Image from Author)
Hidden State (Image from Author)

The hidden state for a GRU is updated based on the information passed by the update gate.

Prediction (Image from Author)
Prediction (Image from Author)

The final pass through the last activation function is a prediction made by the GRU.

Bidirectional RNNs

Imagine that the future flows to the present, this is basically what a bidirectional RNN does. It will have information flow in both directions, where each direction is independent of the other. Bidirectional RNNs are acyclic graphs, which means that the computations in one direction are independent of the ones in the other direction.

LSTM Cell (Image from Author)
LSTM Cell (Image from Author)

One example of a Bidirectional RNN is a Long Short-Term Model (LSTM). The biggest change in the LSTM cell is it now is splitting the state into two vectors. Of the two new state vectors, one array represents long-term memory while the other represents short-term memory. As shown by the cell above, there are lots of inputs and variable calculations associated with the LSTM cell. h(t) is the short-term information of the cell while c(t) is the long-term information for the cell.

For the long-term information, it will first go through a forget gate to drop some memory information then add new information through the additive gate. After the addition of new memories and a pass through the tanh activation function, short-term memories, h(t), are created.

LSTM Equations (Image from Author)
LSTM Equations (Image from Author)

Shown above are the different calculations made by the LSTM cell during training. The main takeaway is that the LSTM will update the different weight matrices W to identify the relevant long-term and short-term memory for your given problem and data.

Example

I wanted to provide a quick coding example using the Financial Sentiment Analysis Dataset from Kaggle. I trained an LSTM network to identify if the sentiment of financial queries was positive or negative. Today’s code will help you recreate and train the model but be advised that the training of an LSTM model can take a long time due to the complexity and combination of calculations made by the model. While this differs from

First, let’s import the libraries and data.

from keras.layers import Input,GlobalMaxPooling1D, Dense, Dropout, LSTM, SpatialDropout1D
from keras.models import Model, Sequential
from keras.preprocessing.text import Tokenizer
from keras.optimizers import Adam
from keras.losses import BinaryCrossentropy
from keras.utils import pad_sequences
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import re

df = pd.read_csv('finance_data.csv')
df.head()
Dataset Examples (Image from Author)
Dataset Examples (Image from Author)

As you can see, the dataset is a collection of entries discussing different areas of the financial sector. In the "Sentiment" column, we have "positive", "negative", and "neutral" sentiments which we will map to 0 or 1. Additionally, we want to preprocess our sentences as well as turn them into NumPy arrays.

df['Sentence'] = df['Sentence'].apply(lambda x: x.lower())
df['Sentence'] = df['Sentence'].apply((lambda x: re.sub('[^a-zA-z0-9s]','',x)))
mapping = {'negative' :0,
            'neutral' : 1,
            'positive': 1}
df['Sentiment'] = df['Sentiment'].map(mapping)
X = df['Sentence'].to_numpy() #turn the sentences into numpy arrays
y = df['Sentiment'].values #Targert sentimen values

Next will create the training and test sets using an 80/20 split.

X_train, X_test, y_train, y_test = train_test_split(sentences, y, test_size=0.2)

Next, we want to tokenize our sentences. Tokenization is the operation of breaking each sentence down to its individual words, where each word is a token.

vocab_size = 10000
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train) #Fit the tokenizer on the training data
#We will then fit onto each of the sequences
X_train = tokenizer.texts_to_sequences(X_train)
print(X_trai[0])
[2, 100, 3, 2, 138, 12, 326, 9, 259, 29]

After we have transformed our sequences, we will pad them so they are all the same shape.

seq_length = 100
X_train = pad_sequences(X_train, maxlen=seq_length, padding='post', truncating='post')
#Now do the same for the test data 
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=seq_length, padding='post', truncating='post')

Let’s create our classification model. The model will first use an embedding layer which will be the size of the embeddings for the sequence vectors (in this case the Fintech summaries). For simplicity, I used 1 LSTM layer but two or more could also work (further hyperparameter tuning as well as _k-_folds cross-validation is recommended to be used here). Finally, a sigmoid activation function will be used for predicting the output as 0 or 1.

embed_dim = 100
batch_size = 32
model = Sequential()
model.add(Embedding(vocab_size, embed_dim, input_length=sequence_length))
model.add(LSTM(units = 100))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

The last step for us is to train and execute the model.

history = model.fit(X_train, y_train, epochs=100, 
                    validation_data=(test_padded, y_test))

Training this model will take a long time, but after training, you will have a model that can predict the sentiment of any FinTeh-related summary/review/corpus.

Conclusion

While N-gram models are a great tool for predicting the next word in a sentence, Neural Networks are a much more powerful tool since they can preserve long-term information and should not be looked over when conducting an NLP analysis. Not only can neural networks be used for n-gram predictions, but today’s example also showed how simple it is to create an LSTM network in Python and implement it on your data!

If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore (This really helps me out more than you can imagine)! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!

Sources

-GeÌron, Aureìlien. Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 2nd ed., O’Reilly, 2019.

Financial Sentiment Analysis

-This dataset is COO Public domain and is allowed for public use.


Related Articles