Using Word Embeddings to Identify Company Names and Stock Tickers

Published in

Towards Data Science

8 min readJul 13, 2021

Introduction

Project Goal: Using word embeddings identify company names and stock tickers from natural text.

Assumption: Stock tickers and company names are used in similar context in natural text such as a Reddit post or a tweet.

Under this assumption, word embeddings should be a good fit for identifying these target words as word embeddings are trained by the context in which words are found.

Plan :

Train a Word2Vec model on stock market related Reddit posts.
Create a vector that can be used to represent a target vector (more below).
Use representative vector to identify target words from Reddit posts.

In this post, I will skip describing what word embeddings are and how the Word2Vec algorithm works. I have written a much more detailed paper on the same project which can be found here. In this paper, I explain the details of what word embeddings are as well as how the Word2Vec Algorithm works. I also detail sentiment analysis via Naive Bayes. In this post, I will just present the code in an abridged format.

The full repo can be found here.

Data Import

Two different data sources were used. Both are collections of Reddit posts from the subreddit r/wallstreetbets, that I found on Kaggle.

Gabriel Preda, “Reddit WallStreetBets Posts.” Kaggle, 2021, https://www.kaggle.com/gpreda/reddit-wallstreetsbets-posts/
Raphael Fontes, “Reddit — r/wallstreetbets”. Kaggle, 2021. https://www.kaggle.com/unanimad/reddit-rwallstreetbets

In this first step, we are going to import this data and extract every sentence. In one of the datasets we are pulling the Reddit titles and the other we are pulling the body of the text and splitting on ‘.’ to separate sentences. Note: Gensim’s Word2Vec models are trained on sentences.

String Processing

The next step is to pre-process our text before we can use it to train the model. We will do the following :

lowercase
remove punctuation, numbers, and emojis
clean whitespace
tokenize
find bigrams
find trigrams

Now lets take a look at one of our processed sentences :

Excellent, we now have roughly 1.2 M sentences to train our model.

Model Training

Now we are ready to train or Word2Vec model. Text 8 is a data dump from Wikipedia which is great for normalizing the word vectors. I would encourage you to compare a model with and without this extra training data to see how much it helps. More information on text8 can be found here.

Now we can use Gensim’s most_similar() function to look at the most similar words in the vocabulary to a target word such as ‘gme’.

Perfect, so we can see that at least the top 10 most similar words are also companies or stock tickers. This gives us hope that our assumption was correct.

Creating a Representative Vector

We need a representative vector that we can use to compare to other words in the vocabulary. We can use this vector and some similarity threshold to identify target words in the posts.

Unfortunately, Gensim doesn’t have a method for creating an averaged vector from other vectors. So I pulled some of the code out of Gensim’s most_similar() function to get the functionality I was looking for. I also created a cosine similarity method which we will use to compare any two vectors. Cosine similarity is a way to compare two non-zero vectors (two identical vectors will have a cosine similarity = 1.0).

Awesome, now we can use it to create our own vector from some hand-picked target words.

Now let’s go ahead and use the vector that we created and take a look at the most similar vectors in our models’ vocabulary. Note: the numbers are the cosine similarity of the term and the representative vector.

I’m pretty happy with the results here. If you follow along you should explore all of these top 500, almost all of them appear to be stock tickers or company names. We can also take a look at the distribution of the similarity of our vector to each word in the vocabulary.

I added a vertical line at 0.55 which is the similarity threshold that I ended up choosing. Any word to the right of that line will be identified as a target. Based on this figure it looks like the threshold should probably be greater than 0.55. This choice was a result of a parameter analysis, and would hopefully increase with a more robust ground truth. More on this below.

Model Testing

Now we need some way to test our model. I randomly selected 1000 Reddit posts then hand-extracted any company names or stock tickers. Here’s what it looks like:

Now let’s go ahead and test it. There are a few different ways to score this, I chose a simple method of averaging a missed score and an over score.

Great, let’s put it to work.

Parameter Analysis

Gensim's Word2Vec model has a few parameters that I was interested in testing systematically. I chose to focus on 4 different parameters:

Similarity Threshold : [0.5, 0.55, 0.6, 0.65]
N-Gram : [single, bigrams, trigrams]
Window Size : [1,2,3,4,5]
Vector Size : [100, 200, 300, 400]

Note: more on these parameters below.

We can iterate through all of these parameters and test against our ground truth.

This obviously takes a long time as we end up having to train the model 60 times. I just let it run overnight. Let’s take a look at the results.

Now we can consider the parameters individually to see if we find any trends.

Similarity Threshold

The similarity threshold is the threshold we use to determine whether or not a word is extracted:

If cosineSimilarity(word, myVector) >= similarity_threshold:
       extract(word)
Else : 
       continue

After some trial and error, I chose to consider 4 different thresholds for these tests. I also chose to include the graphs for each of these four thresholds as I compare the other parameters as the parameters might have different effects at different thresholds.

N-gram

Although N-Gram is not a parameter of the word2vec model it is an important step in the pre-processing of the data fed into the model and can have a great effect on the task we are trying to achieve.

We see that the trends are not the same for each threshold. I chose to continue with bigrams as it is both the middle ground and It allows us to keep bigram company names together (e.g. ‘home_depot’).

Window size

Window size is a parameter of the Word2vec model and specifies the number of words to the left and right of the target word that we will consider a context word (more detail can be found in project description). This can have a large effect on the embeddings especially with less structured text like Reddit posts.

Vector Size

Vector size is another parameter of the word2vec model. Vector size is the length or dimensionality of the word embeddings that we are creating (more detail can be found in project description). Choosing the correct vector size really needs to be done by testing. We will consider 4 different options.

Once determining the best parameters : [n-gram=bigrams, window=1, vector-size=100, threshold=5.5] we were able to achieve an overall accuracy score of 89% on our ground truth. I was pretty happy with the results given this simple parameter analysis. Now we can go ahead and use the model to extract companies.

Extracting Companies and Stock Tickers

The extraction function is pretty simple. We basically just need to compare each word to our representative vector. The reason why we have it wrapped in a try statement is that the library of our model won’t have every possible term (param in word2vec model: min_count=25).

When applied to our original dataset of Reddit posts with a threshold = 0.6 this method was able to extract companies from nearly 300k posts. At this same threshold, 800 unique companies/tickers were extracted. Here’s a subset of some of the extracted companies/tickers and their counts.

Next Steps

The first thing I would like to note is that the principal factor in this project, the representative vector, was kind of glossed over. That’s because in the time I had for this project I didn’t think I could implement a systematic way of creating this vector. I think there is A LOT that could be done to make this algorithm much more accurate. The first thing that would need to be done would be to create a much larger ground truth, ideally pulling from Twitter or other forms of social media as well. With a more robust ground truth to test against we could probably figure out some cool ways to learn a better representative vector. A couple of ideas:

Simple method: get a subset of the company names/tickers and systematically test different combinations against the ground truth.
More advanced method: try to implement some sort of gradient descent that can learn a vector that will perform better against our, now awesome, ground truth.

I also think a lot more could be done with the processing of the strings. In this implementation, I chose to remove numbers and emojis. I think this is throwing out a lot of rich information, especially emojis. Im sure that our target words have a contextual relationship to both numbers and emojis. If I get around to messing with any of these ideas, I will certainly update this write-up.

Wrap Up

Please let me know If you have any suggestions, questions, or comments, I am a student and am learning all of these tools myself. In the full project, I also use Naive Bayes to determine sentiment for the different companies/tickers. You can read more on that in the full write up here, and can see the full repo here. Thanks for reading.