The world’s leading publication for data science, AI, and ML professionals.

Gradient Descent Based Word Vectorization for Natural Language Processing

The simpler alternative to Google's Word2Vec

Photo by Raphael Schaller on Unsplash
Photo by Raphael Schaller on Unsplash

Sentiment analysis is an essential tool, for many different tasks. This ranges from predicting stock market sentiment based on tweets, to predicting the intent of a customer to automatically generate responses. Google’s Word2Vec works well, but has a large problem.

It requires a massive dataset. When google trained the Word2Vec Network, it used thousands of documents that it had special access to. It would be a nightmare to find, normalize and then use enough data of that quality for the program to work properly, making it futile to try to implement this into my own projects

After thinking for a while, a formulated a technique to convert words into vectors, using a completely different concept than the word-grouping method that Google used.

Concept:

Let’s work back from the end goal that we have: converting a word into a vector. Getting a vector as the direct output of the program is difficult, due to the fact that training the system for two variables of equal weightage (as in the case of a vector). So we can start by knowing that our final output will be a singular value. This value will still be converted into a vector, the first value being -1 or 1 (representing the positive or negative sentiment) and the second value being any value (representing the magnitude of the sentiment).

If we generate a value for each word, we can use gradient descent to change this value, so as to calculate the correct sentiment every time.

How is a propagation executed? Simple! Multiple all the values for each word in a tweet. Sigmoid this value, resulting in a value between 0 and 1, 0 being negative and 1 being positive.

The Code:

Step 1| Prerequisites:

import os
from pandas import read_csv
import string
import numpy as np

These libraries are necessary for the program to work.

Step 2| Access Dataset:

os.chdir(r'XXXXXX')
csv = read_csv('stock_data.csv')
csv

Change the XXXXX to the directory in which the dataset is stored. You can get the stock sentiment dataset from this link.

Step 3| Prepare Dataset:

X = csv['Text'].values
y = csv['Sentiment'].values
np.unique(y)
X[5]

Extracting the X and the y values of the dataset is simple, as it is in a similar form in the dataset.

Step 4| Clean Dataset:

counter = 0
for i in range(len(y)):
    if y[i] != 1:
        counter += 1
        y[i] = 0
new_X = []
for i in range(len(X)):
    try:
        words = X[i].split()
        counter = 0
        while True:
            upper = False
            for word in words:
                if word.isupper() or 'https' in word or word[0] == '#' or not(word.isalpha()):
                    words.remove(word)
                    upper = True
            if upper == False:
                break
                counter+= 1
        for i in range(len(words)):
            words[i] = words[i].lower()
        new_X.append(words)
    except:
        pass
X = new_X
flatten = lambda t: [item for sublist in t for item in sublist]
all_words = flatten(X)
unique = list(np.unique(all_words))
unique.sort()
vectors = np.random.randn(len(unique),1)

I have removed links, hashtags and company names from the dataset, so as to prevent the model only picking up sentiment from how well a company is doing at the moment.

I also need to generate a list of unique word, so that the vectors can be assigned by index.

Step 5| Vectorization, Propagation and Training:

def sigmoid(x):
    return 1/(1+np.exp(-x))
def sigmoid_p(x):
    return sigmoid(x)*(1 -sigmoid(x))

def predict_sentiment(tweet):
    sentiment = 1
    for word in tweet:
        index = unique.index(word)
        sentiment *= vectors[index]
    sentiment = sigmoid(sentiment)
    return sentiment
def adjust_vectors(pred_sentiment,true_sentiment,tweet):
    dloss_dpred = 2*(true_sentiment-pred_sentiment)
    dloss_dvec = []
    vectors_iq = []
    vectors_index = []
    for word in tweet:
        index = unique.index(word)
        vectors_iq.append(vectors[index])
        vectors_index.append(index)
    product = np.prod(vectors_iq)
    for i in range(len(vectors_iq)):
        dloss_dvec.append(sigmoid_p(product)/vectors_iq[i])
    for i in range(len(vectors_index)):
        vectors[i] -= dloss_dvec[i] * 0.1
    return vectors
for epoch in range(100):
    print('EPOCH',str(epoch+1))
    for i in range(len(X)):
        pred_sentiment = predict_sentiment(new_X[i])
        vectors = adjust_vectors(pred_sentiment,y[i],new_X[i])

Basically, calculating the gradient based on the other words in the tweet, I can alter the vectors in the correct way, so as to get a higher accuracy when predicting the sentiment of a tweet.

Step 6| Observing Vectors:

import random
from matplotlib import pyplot as plt
num = 5
for i in range(num):
    random_num = random.randint(0,len(vectors)-1)
    vec = vectors[random_num]
    if vec < 0:
        vec_y = -1
    else:
        vec_y = 1
    vec_X = vec/vec_y
    word = unique[random_num]
plt.plot(vec_X,vec_y,'o')
    plt.annotate(word,(vec_X,vec_y))

This program allows us to see the severity and sentiment of the vectors, to observe the different conclusions that the program is drawing. Playing around with this and observing the results and finding obvious faults will improve results, after editing the way the dataset is cleaned and normalized.

Conclusion:

If you are still not convinced of using vectors to evaluate words, consider this property of vectors: Vectors have a magnitude, that can be calculated using the Pythagorean theorem. In all the vectors that we have looked at, they are all relative to the origin.

If we consider the X axis to represent the severity of the sentiment, and the y axis as the positivity/negativity, we know that the origin would be completely neutral. By calculating the magnitude of the vector, it calculates how much the opinion deviates from the origin, or how extreme the opinion is.

Here is the function that calculates this:

def calculate_magnitude(vec):
    if vec < 0:
        vec_y = -1
    else:
        vec_y = 1
    vec_X = vec/vec_y
    sum_value = vec_X**2 + vec_y**2
    return np.sqrt(sum_value)
calculate_magnitude(vectors[100])

My links:

If you want to see more of my content, click this link.


Related Articles