Lessons learned from a data science project on meme popularity

Take-home messages from mentoring remote data science teamwork analyzing Covid-19 memes. Dank or Not?

Nóra Tenk
Towards Data Science

--

Dank or Not? — a good question in the world of memes, viral internet content filled with humor reflecting current trends spreading through the online world. A group of talented students at AIT Budapest targeted this topic as a core of their Data Science project just before the pandemic outbreak.

I joined them as a mentor helping to solve data science-related questions and found myself in an exciting research project which had grown from a university assignment to a research article, published just in the last few days¹.

As you can imagine, we have learned a lot from the project. Therefore I would like to share some general takeaways, data science-related ideas that can come in handy to anyone interested in data science or doing similar projects. So, let’s get started!

Image by author based on Unsplash.

Good data — good project

In the past few years, memes have become a social phenomenon, especially in the days of the COVID-19 pandemic, when people tend to interact online more than ever. Memes can draw attention to cultural and political themes and can express what the public is noticing most. They have become an interesting research topic, not just from the data science perspective, but also from network science and social computer science. With the group, we aimed to investigate the relationship between a meme’s popularity and its content with modern machine learning techniques.

First, we needed quality data, and we had to understand how to utilize our features to answer the targeted questions.

My students decided to investigate memes from Reddit, the largest social news and entertainment site². Although there are some datasets already available, for example on Kaggle, for data scientists to explore the world of memes³, I recommended we scrape memes from Reddit, to get all the metadata we wanted and have a larger dataset. Plus, we scraped our data from March 17th, 2020 to March 23rd, 2020 which constituted the beginning of the global coronavirus outbreak. After the first data cleaning steps, we had 80362 records of image-based memes with a large set of metadata, like the number of upvotes, the subreddit in which the meme was posted, the title, and so on.

I think it was the first step towards growing this project from university group work to a serious research project, that we gathered a good amount of interesting data to analyze. Next on the way to success, comes business understanding.

Understand your data — the key to every data science project

As we gathered a large amount of image-based memes, the next step was to clarify what features express the meme’s popularity.

We had the number of upvotes for every meme, but we could not simply use it as our target variable. Why?

Our image-with-text memes came from the largest meme subreddits, representing communities devoted to creating and sharing memes. The subreddits with more subscribers tend to get more upvotes, which indicates a positive correlation between the upvotes and subscribers. To eliminate this network effect, we had to normalize the number of upvotes by dividing by the number of subscribers to the subreddit where the meme was posted.

Now that we had our target variable, the next lesson was to identify that the viral nature of memes on Reddit makes this data well suited for a binary classification task. Let’s just look at the distribution of normalized upvotes.

The distribution of normalized upvotes. Source: Dank or Not?- Analyzing and Predicting the Popularity of Memes on Reddit

The distribution of normalized upvotes shows that most memes received few upvotes while few memes received many upvotes. That gave the intuition to formulate our binary classification label, called dank or not. Viral or dank memes usually differ by two orders of magnitude from not viral memes based on our target label, which we formulated by counting the top 5% of the memes as dank and the rest non-dank.

This way, we formulated our data science task, to use supervised learning algorithms to classify a meme regarding its popularity based on different feature sets.

Try different models

We dived into a world of diverse techniques, from Optical Character Recognition (OCR) to extract text from the images, through getting low-level image features with OpenCV in python and even to getting more features with Keras’ VGG-16 neural network. It was exciting to investigate how the text and image content of the most popular memes looked like just at the pandemic outbreak. We even generated a wordcloud from all the text in our scraped memes.

Word Cloud generated from all text in our scraped memes. Source: Dank or Not?- Analyzing and Predicting the Popularity of Memes on Reddit

We investigated whether image or text features have more predictive power for determining viral memes with Random Forest and Gradient Boosting models. We also tried transfer learning with Convolutional Neural Network to predict the meme’s popularity based on raw image data.

The next important takeaway was, in my perspective, that we did not only try one machine learning model, but we experimented with different ones on different feature sets. This way, we could better understand the data and could draw more general conclusions.

Accuracy vs AUC

It’s always important in any data science project to identify which metrics to use to evaluate the performance of your model. We learned that accuracy is not always what we’re looking for. Why?

Imagine our imbalanced dataset and our binary classification task, where 5% of the data is dank, labeled 1, while the rest are not dank, labeled 0. Let’s envision a dummy model that predicts label 0 for every instance. Then, this model would have 95 % accuracy! That’s not really what we want, is it?

AUC score, which is also known as the area under the ROC curve, measures the performance of binary classifiers. Its value is within the range of 0.5–1, where 0.5 is the performance of a random classifier, while 1 means a perfect model. For us, it’s a very good metric as it also works with imbalanced datasets. Furthermore, it has proven to be a better measure than accuracy⁴.

Sampling might be tricky

Handling imbalanced data sets is not easy, especially when we know that the Random Forest and Gradient Boosting ensemble classifiers do not perform that well with unbalanced data in their original form⁵⁶. We can try to use models designed for that, as we did by using Balanced Random Forest from the imbearn python package⁷, or one can experiment with sampling techniques.

When we experimented with upsampling, we encountered a common mistake: You should never upsample your dataset before splitting it into train and test sets.

It caused “perfect” metrics, and it was hard to guess what’s wrong. What happened is that with upsampling before splitting, the same instances could fall into both the train and the test sets. Then, it was easy to guess the test set with better metrics for our model, as there were some instances in the test set that the model was trained on.

It’s also not recommended to mess around with the test set⁸. If we modify its distribution, it won’t reflect our population.

What we learned was that it’s better if we apply stratified sampling at first, to get the test set. Then we can apply any sampling techniques on our training set.

Summary

It was a great experience and challenge to be a part of this data science project targeting the prediction of the popularity of memes with different machine learning techniques. I highlighted above some general takeaways, hoping that you also find them useful. To sum it up, I shared ideas about data understanding, different models, scoring metrics, and sampling. If you’re interested in the details of this project, I recommend you to read our article¹.

Acknowledgements

I’m extremely proud of my group members, Kate Barnes, Tiernon Riesenmy, Minh Duc Trinh, Eli Lleshi and our group leader, professor Roland Molontay. You all did an amazing job!

References

[1] Barnes, K., Riesenmy, T., Trinh, M.D. et al. Dank or not? Analyzing and predicting the popularity of memes on Reddit. (2021) Appl Netw Sci 6, 21.

[2] https://www.reddit.com/

[3] Sayan Goswami. Reddit Memes Dataset.(2018) Kaggle.com

https://www.kaggle.com/sayangoswami/reddit-memes-dataset

[4] Jin Huang, & Ling, C. X. Using AUC and accuracy in evaluating learning algorithms. (2005) IEEE Transactions on Knowledge and Data Engineering, 17(3), 299–310. doi:10.1109/tkde.2005.50

[5] Brownlee, J.: Bagging and Random Forest for imbalanced classification.(2020), Machine Learning Mastery

[6] Liu, S., Wang, Y., Zhang, J., Chen, C., Xiang, Y.: Addressing the class imbalance problem in Twitter spam detection using ensemble learning, (2017), Computers & Security 69, 35

[7] Guillaume Lemaître et al. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. (2017) Journal of Machine Learning Research, 18(1) 559–563.

[8] Why did sampling boost the performance of my model?, (2019) StackExchange.

--

--

Data Scientist working on industry projects with python. She teaches courses, and loves mentoring good data projects. https://www.linkedin.com/in/noratenk/