The world’s leading publication for data science, AI, and ML professionals.

Diving into my Tinder Dat(e)a

An exploration of three years of dating app messages with NLP

Photo by Alexander Sinn on Unsplash
Photo by Alexander Sinn on Unsplash

Introduction

Valentine’s Day is around the corner, and many of us have romance on the mind. I’ve avoided Dating apps recently in the interest of public health, but as I was reflecting on which dataset to dive into next, it occurred to me that Tinder could hook me up (pun intended) with years’ worth of my past personal data. If you’re curious, you can request yours, too, through Tinder’s Download My Data tool.

Not long after submitting my request, I received an e-mail granting access to a zip file with the following contents:

The ‘data.json’ file contained data on purchases and subscriptions, app opens by date, my profile contents, messages I sent, and more. I was most interested in applying natural language processing tools to the analysis of my message data, and that will be the focus of this article.

Structure of the Data

With their many nested dictionaries and lists, JSON files can be tricky to retrieve data from. I read the data into a dictionary with json.load() and assigned the messages to ‘message_data,’ which was a list of dictionaries corresponding to unique matches. Each dictionary contained an anonymized Match ID and a list of all messages sent to the match. Within that list, each message took the form of yet another dictionary, with ‘to,’ ‘from,’ ‘message’, and ‘sent_date’ keys.

Below is an example of a list of messages sent to a single match. While I’d love to share juicy details about this exchange, I must confess that I have no recollection of what I was attempting to say, why I was trying to say it in French, or to whom ‘Match 194’ refers:

Since I was interested in analyzing data from the messages themselves, I created a list of message strings with the following code:

The first block creates a list of all message lists whose length is greater than zero (i.e., the data associated with matches I messaged at least once). The second block indexes each message from each list and appends it to a final ‘messages’ list. I was left with a list of 1,013 message strings.

Cleaning the Data

To clean the text, I started by creating a list of stopwords – commonly used and uninteresting words like ‘the’ and ‘in’ – using the stopwords corpus from Natural Language Toolkit (NLTK). You’ll notice in the above message example that the data contains HTML code for certain types of punctuation, such as apostrophes and colons. To avoid the interpretation of this code as words in the text, I appended it to the list of stopwords, along with text like ‘gif’ and ‘http.’ I converted all stopwords to lowercase, and used the following function to convert the list of messages to a list of words:

The first block joins the messages together, then substitutes a space for all non-letter characters. The second block reduces words to their ‘lemma’ (dictionary form) and ‘tokenizes’ the text by converting it into a list of words. The third block iterates through the list and appends words to ‘clean_words_list’ if they don’t appear in the list of stopwords.

Word Cloud

I created a word cloud with the code below to get a visual sense of the most frequent words in my message corpus:

The first block sets the font, background, mask and contour aesthetics. The second block generates the cloud, and the third block adjusts the figure’s size and settings. Here’s the word cloud that was rendered:

The cloud shows a number of the places I have lived – Budapest, Madrid, and Washington, D.C. – as well as plenty of words related to arranging a date plans, like ‘free,’ ‘weekend,’ ‘tomorrow,’ and ‘meet.’ Ah, remember the good ol’ pre-plague days when we could casually travel and share a meal with folks we just met online?

You’ll also notice a few Spanish words sprinkled in the cloud. I tried my best to adapt to the local language while living in Spain, with comically inept conversations that were always prefaced with ‘no hablo mucho español.’

Bigrams Barplot

The Collocations module of NLTK allows you to find and score the frequency of bigrams, or pairs of words that appear together in a text. The following function takes in text string data, and returns lists of the top 40 most common bigrams and their frequency scores:

I called the function on the cleaned message data and plotted the bigram-frequency pairings in a Plotly Express barplot:

Here again, you’ll see a lot of language related to arranging a meeting and/or moving the conversation off of Tinder. In the pre-pandemic days, I preferred to keep the back-and-forth on dating apps to a minimum, since conversing in person usually provides a better sense of chemistry with a match.

It’s no surprise to me that the bigram (‘bring’, ‘dog’) made in into the top 40. If I’m being honest, the promise of canine companionship has been a major selling point for my ongoing Tinder activity.

Message Sentiment

Finally, I calculated sentiment scores for each message with vaderSentiment, which recognizes four sentiment classes: negative, positive, neutral and compound (a measure of overall sentiment valence). The code below iterates through the list of messages, calculates their polarity scores, and appends the scores for each sentiment class to separate lists.

To visualize the overall distribution of sentiments in the messages, I calculated the sum of scores for each sentiment class and plotted them:

The bar plot suggests that ‘neutral’ was by far the dominant sentiment of the messages. It should be noted that taking the sum of sentiment scores is a relatively simplistic approach that does not deal with the nuances of individual messages. A handful of messages with an extremely high ‘neutral’ score, for instance, could very well have contributed to the dominance of the class.

It makes sense, nonetheless, that neutrality would outweigh positivity or negativity here: in the early stages of talking to someone, I try to seem polite without getting ahead of myself with especially strong, positive language. The language of making plans – timing, location, and the like – is largely neutral, and seems to be widespread within my message corpus.

Conclusion

If you find yourself without plans this Valentine’s Day, you can spend it exploring your own Tinder data! You might discover interesting trends not only in your sent messages, but also in your usage of the app overtime.

To see the full code for this analysis, head over to its GitHub repository.


Related Articles