The world’s leading publication for data science, AI, and ML professionals.

Machine Learning and Veganism

Analyzing the Vegan conversation on Twitter using NLP and Topic Modeling

Photo by Matthias Zomer of Pexels.com
Photo by Matthias Zomer of Pexels.com

Intro 📄

This blog post corresponds to my third individual project at Metis. The project itself is an exploration of text data by the use of Natural Language Processing (NLP) and unsupervised learning. My specific project used Tweets with keyword "vegan" from 2016–2020 to understand the vegan conversation over the past five years. In this blog post, I will discuss the following topics:

  • Scraping Tweets with the Twitter API and snscrape
  • NLP: NLTK, Count Vectorizing, Tokenizing, and Stemming
  • Topic Modeling: LDA and NMF
  • Visualizations: t-SNE and Seaborn

Background 🔍

I’ve been on a plant-based diet for over a year, and while honing in on an NLP project, I thought it would be fun to look into the vegan conversation on Twitter. People primarily go vegan for three reasons: their health, the environment, and the animals. So, I was hoping to see these categories pop out after performing topic modeling. Let’s get into it…

Collecting Tweets: Twitter API vs. snscrape 🐦

Initially, I was collecting Tweets with the Twitter API. It was pretty fun to explore the streaming functionality within the API (collecting Tweets based on a specific keyword as they are tweeted). However, I needed archived Tweets for my project, and this is where the Twitter API fell flat. With the most basic, free access, I was only able to collect information within the past week, and there was a limit on the number of Tweets I could collect.

For my analysis, I wanted a lot more information, so snscrape came to the rescue. There’s no need for an access token or API key, and it’s easy to access public Tweets from any time period.

After installing snscrape, it was just a matter of manipulating one line of code to collect what I needed. This is an example of what I used:

snscrape --jsonl --max-results 8000 twitter-search "#vegan since:2019-06-01 until:2019-08-01" > Data/t2w2019.txt
  • snscrape – calls the package
  • jsonl – returns the output as JSONL (includes extra information like location, retweet status, retweet counts, etc.)
    • max-results # – specifies how many Tweets to return
  • twitter-search – specifies how to go about searching for Tweets (use twitter-hashtag to look for hashtags)
  • "#vegan since:2019-06-01 until:2019-08-01" – this phrase is what is entered in the search, so I was looking for keyword "vegan" between between these specific dates.
    • specify where to save the information after the angled bracket

I modified this query to collect Tweets from different years, but followed up with a function I defined ,txt_to_df(), to establish a pipeline that converts these files into a DataFrame with only the specific information I cared about. You can dive into the code here.

Natural Language Processing (NLP) 📚

The NLP process involved a lot of tweaking, but there were some basic concepts that really helped me dive into the project. I’ll uncover some tips as I talk about my process and what I learned.

First, there are a few different libraries that make NLP possible: Nltk, SpaCy, Gensim, and a few more. For this project, I utilized NLTK, so if this specific library is of interest to you, keep reading. Also, my first tip is to create an NLP pipeline class to save time when refining the model. Here is the one I used in my project with the help of the Metis instructors.

To start, you need a corpus (collection of documents) of documents (phrase to describe your texts; my documents were Tweets). The NLTK library has a lot of capabilities, with tokenization being one of its more important functionalities for my purposes. Tokenization splits the document into tokens (these could be words, sentences, bigrams, etc.), which is then fed into a vectorizer, creating a vector for each document containing information about the tokens. The most simple vectorizer is the Count Vectorizer. This vectorizer turns a document into a vector of the counts of each token within that document. The result is a sparse matrix, a matrix containing mostly zeros. If you developed an NLP project, you probably know all these terms, but when I started, these terms were foreign and having a basic description really helped.

Let’s move on to the pipeline. A pipeline is nice because there are various tokenizers and stemmers to process documents, and customizing a cleaning function based on the documents is important as well. Quick side note: a stemmer helps break down similar words into a single token (i.e. running, ran, and run would be reduced to their stem: run).

It is important to note that documents and desired output can vary from project to project, so no particular tokenizer, stemmer, or cleaning function is perfect for everything. However, I found the Tweet Tokenizer about five days into my project and it helped out a lot.

So, what ended up working for me? My goal was to create topics amongst the Tweets, so my vectorizer was a TF-IDF vectorizer (term frequency-inverse document frequency). TF-IDF ranks tokens that occur in many documents lower, thus removing importance from universal terms. Obviously, the Tweet Tokenizer was amazing, which helped understand the structure of tweets in order to tokenize better. Lastly, it was important to customize stop words, which are words like ‘the’ and ‘or’ that have no meaning, but I added ‘vegan’, since that term is irrelevant when all of the Tweets are about veganism. Some additional exploration led me to add more words to the stop words, which helped the topic modeling.

Here is a snip of my code that utilizes the NLPPipe class that I created:

nlp = NLPPipe(vectorizer=TfidfVectorizer(stop_words=stopwords, max_df=0.80, min_df=10), tokenizer=TweetTokenizer().tokenize, stemmer=SnowballStemmer("english"), cleaning_function=tweet_clean1)

Another tip: within the vectorizer, you can use max_df and min_df to have the vectorizer ignore tokens that appear above and below a specific threshold within the documents.

Topic Modeling 🐖

Topic modeling was a lot of fun within this project. I compared LDA and Nmf, but ending up sticking to NMF for my final topics. LDA (Latent Dirichlet allocation), on a very basic level, assigns topics to words and then returns the probability of a document being in a specific topic based on the words within that document. There is a very nice package called pyLDAvis that makes visualizing topics using LDA extremely simple. However, the LDA process is known to work better on larger documents, so I opted for NMF.

NMF (non-negative matrix factorization) utilizes linear algebra and the concept of matrix multiplication to reduce the features of the document-term matrix to create topics. The image below outlines how SVD (singular value decomposition) works, which is the basis of NMF.

Image by Author
Image by Author

When applying this to the project, I was able to specify how many topics I wanted to create, and NMF would provide a document-topic matrix with probabilities of each document being labelled as a specific topic. This is where the human element is very important, one must look at the words within the topics to decided what the topic is about.

After looking at my topics, I realized I had to fine-tune my pipeline. I would add stop words that would appear in multiple topics and had no meaning, I would change the number of topics through NMF to see how topics would change as I added more, and I even changed my cleaning function to remove mentions completely so I wouldn’t have Twitter usernames in my topics. Also, I used my domain knowledge to understand that the words ‘impossible’ and ‘beyond’ refer to vegan food companies and not the english definition of those terms.

In the end, 23 topics came out that made sense, which I further bucketed into broader categories.

Image by Author
Image by Author

Visualizations and Results 📈

With all of my documents categorized into topics, it was time to draw some conclusions. For Topic Modeling, it is very common to use t-SNE for visualizing the results. t-SNE is an algorithm that’s makes visualizing multidimensional data in two dimensions possible and easy. It’s not necessarily dimensional reduction, because the algorithm tries to emphasize separations in the data, but it’s a great way to create basic visuals for higher dimensional data.

The t-SNE algorithm was the most time-consuming algorithm of the project (from a computation aspect), but it provided an x and y coordinate for each document to allow for easy plots. Below is the progression of my corpus over time visualized through the use of t-SNE and plotted with seaborn. Take a look at the bottom right corner and you will see the ‘Fake’ Foods and Animal Activism topics consuming the conversation in 2020.

Image by Author
Image by Author

Using my data, I also calculated the percent of the documents in each topic, which emphasized the point in the previous gif through a line chart.

Image by Author
Image by Author

I thought it was interesting how vegans started talking more about animal activism along with ‘fake’ foods, although the ‘fake’ food trend itself wasn’t too surprising on its own.

Conclusion ✏️

Out of all of my projects so far, this was definitely the most exploratory. There was no R² value to beat or any recall metrics to increase, it was merely a project about learning from text. It’s interesting to see that no ‘environmental’ topic exposed itself during the analysis, especially since the world is so concerned about the environment, but I guess the Twitter vegans are more concern about animals and the food they’re eating.

Anyway, if you are learning about NLP or starting a project of your own, I hope this blog helped.

Some key takeaways:

  • Python’s open-source community has a solution for you (I wish I knew about the Tweet Tokenizer earlier)
  • Create an NLP Pipeline (it will save time in the long run)
  • Don’t be afraid to waste time during an exploratory project, that’s part of the fun

Check out the Github repository for all of the code behind the project. I try to make my notebooks easy to follow, but I’ll admit that these notebooks aren’t as clean as my other projects.


As I’m writing this, I only have one week left at the Metis data science bootcamp. It’s been a whirlwind of information, and I’m hoping to expand upon that foundation and keep you all informed along the way. Feel free to check out the blog on my third project which is about a Kaggle competition and credit card fraud. Also, come back for more updates as I continue to traverse the world of data science.

Feel free to reach out: LinkedIn | Twitter


Related Articles