My friends gave me their Tinder data…

Jack Ballinger
Towards Data Science
12 min readJan 16, 2019

--

It was Wednesday 3rd October 2018, and I was sitting on the back row of the General Assembly Data Science course. My tutor had just mentioned that each student had to come up with two ideas for data science projects, one of which I’d have to present to the whole class at the end of the course. My mind went completely blank, an effect that being given such free reign over choosing almost anything generally has on me. I spent the next couple of days intensively trying to think of a good/interesting project. I work for an Investment Manager, so my first thought was to go for something investment manager-y related, but I then thought that I spend 9+ hours at work every day, so I didn’t want my sacred free time to also be taken up with work related stuff.

A few days later, I received the below message on one of my group WhatsApp chats:

This sparked an idea. What if I could use the data science and machine learning skills learned within the course to increase the likelihood of any particular conversation on Tinder of being a ‘success’? Thus, my project idea was formed. The next step? Tell my girlfriend…

A few Tinder facts, published by Tinder themselves:

  • the app has around 50m users, 10m of which use the app daily
  • since 2012, there have been over 20bn matches on Tinder
  • a total of 1.6bn swipes occur every day on the app
  • the average user spends 35 minutes PER DAY on the app
  • an estimated 1.5m dates occur PER WEEK due to the app

Problem 1: Getting data

But how would I get data to analyse? For obvious reasons, user’s Tinder conversations and match history etc. are securely encoded so that no one apart from the user can see them. After a bit of googling, I came across this article:

This lead me to the realisation that Tinder have now been forced to build a service where you can request your own data from them, as part of the freedom of information act. Cue, the ‘download data’ button:

Once clicked, you have to wait 2–3 working days before Tinder send you a link from which to download the data file. I eagerly awaited this email, having been an avid Tinder user for about a year and a half prior to my current relationship. I had no idea how I’d feel, browsing back over such a large number of conversations that had eventually (or not so eventually) fizzled out.

After what felt like an age, the email came. The data was (thankfully) in JSON format, so a quick download and upload into python and bosh, access to my entire online dating history.

The Data

The data file is split into 7 different sections:

Of these, only two were really interesting/useful to me:

  • Messages
  • Usage

On further analysis, the “Usage” file contains data on “App Opens”, “Matches”, “Messages Received”, “Messages Sent”, “Swipes Right” and “Swipes Left”, and the “Messages file” contains all messages sent by the user, with time/date stamps, and the ID of the person the message was sent to. As I’m sure you can imagine, this lead to some rather interesting reading…

Problem 2: Getting more data

Right, I’ve got my own Tinder data, but in order for any results I achieve to not be completely statistically insignificant/heavily biased, I need to get other people’s data. But how do I do this…

Cue a non-insignificant amount of begging.

Miraculously, I managed to persuade 8 of my friends to give me their data. They ranged from seasoned users to sporadic “use when bored” users, which gave me a reasonable cross section of user types I felt. The biggest success? My girlfriend also gave me her data.

Another tricky thing was defining a ‘success’. I settled on the definition being either a number was obtained from the other party, or a the two users went on a date. I then, through a combination of asking and analysing, categorised each conversation as either a success or not.

Problem 3: Now what?

Right, I’ve got more data, but now what? The Data Science course focused on data science and machine learning in Python, so importing it to python (I used anaconda/Jupyter notebooks) and cleaning it seemed like a logical next step. Speak to any data scientist, and they’ll tell you that cleaning data is a) the most tedious part of their job and b) the part of their job that takes up 80% of their time. Cleaning is dull, but is also critical to be able to extract meaningful results from the data.

I created a folder, into which I dropped all 9 data files, then wrote a little script to cycle through these, import them to the environment and add each JSON file to a dictionary, with the keys being each person’s name. I also split the “Usage” data and the message data into two separate dictionaries, so as to make it easier to conduct analysis on each dataset separately.

Problem 4: Different email addresses lead to different datasets

When you sign up for Tinder, the vast majority of people use their Facebook account to login, but more cautious people just use their email address. Alas, I had one of these people in my dataset, meaning I had two sets of files for them. This was a bit of a pain, but overall not too difficult to deal with.

Having imported the data into dictionaries, I then iterated through the JSON files and extracted each relevant data point into a pandas dataframe, looking something like this:

Usage Data with names removed
Message Data with names removed

Before anyone gets worried about including the id in the above dataframe, Tinder published this article, stating that it is impossible to lookup users unless you’re matched with them:

https://www.help.tinder.com/hc/en-us/articles/115003359366-Can-I-search-for-a-specific-person-on-Tinder-

Now that the data was in a nice format, I managed to produce a few high level summary statistics. The dataset contained:

  • 2 girls
  • 7 guys
  • 9 participants
  • 502 one message conversations
  • 1330 unique conversations
  • 6,344 matches
  • 6,750 messages received
  • 8,755 messages sent
  • 34,233 app opens
  • 94,027 right swipes
  • 403,149 left swipes

Great, I had a decent amount of data, but I hadn’t actually taken the time to think about what an end product would look like. In the end, I decided that an end product would be a list of recommendations on how to improve one’s chances of success with online dating.

And thus, with the data in a nice format, the exploration could begin!

The Exploration

I started off looking at the “Usage” data, one person at a time, purely out of nosiness. I did this by plotting a few charts, ranging from simple aggregated metric plots, such as the below:

to more involved, derived metric plots, such as the aptly-named ‘Loyalty Plot’, shown below:

The first chart is fairly self explanatory, but the second may need some explaining. Essentially, each row/horizontal line represents a unique conversation, with the start date of each line being the date of the first message sent within the conversation, and the end date being the last message sent in the conversation. The idea of this plot was to try to understand how people use the app in terms of messaging more than one person at once.

Whilst interesting, I didn’t really see any obvious trends or patterns that I could interrogate further, so I turned to the aggregate “Usage” data. I initially started looking at various metrics over time split out by user, to try to determine any high level trends:

but nothing immediately stood out.

I then decided to look deeper into the message data, which, as mentioned before, came with a handy time stamp. Having aggregated the count of messages up by day of week and hour of day, I realised that I had stumbled upon my first recommendation.

The First Recommendation:

9pm on a Sunday is the best time to ‘Tinder’, shown below as the time/date at which the largest volume of messages was sent within my sample.

Here, I have used the volume of messages sent as a proxy for number of users online at each time, so ‘Tindering’ at this time will ensure you have the largest audience.

I then started looking at length of message in terms of both words and letters, as well as number of messages per conversation. Initially, you can see below that there wasn’t much that jumped out… (here a ‘success’ is red)

But once you start to digging, there are a few clear trends:

  • longer messages are more likely to generate a success (up to a point)
  • The average number of messages into a conversation a ‘success’ is found is 27, with a median of 21.

These observations lead to my second and third recommendations.

The Second Recommendation:

Spend more time constructing your messages, and for the love of god don’t use text speak… generally longer words are better words. One caveat here is that the data contains links, which count as long words, so this may skew the results.

The Third Recommendation:

Don’t be too hasty when trying to get a number. ‘hey, ur fit, what’s ur number’ is probably the worst thing you can say in terms of your chances. Equally, don’t leave it too long. Anywhere between your 20th and 30th message is best.

Average message count of successful vs un-successful conversations

Having looked into length of word/message/conversation rather extensively, I then decided to look into sentiment. But I knew absolutely nothing about how to do that. During the course, we’d covered a bit of natural language processing (bag of words, one hot encoding, all the pre-processing required etc. along with various classification algorithms), but hadn’t touched on sentiment. I spent some time researching the topic, and discovered that the nltk sentiment.vader SentimentIntensityAnalyzer would be a pretty good shout.

This works by giving the user four scores, based on the percentage of the input text that was:

  • positive
  • neutral
  • negative
  • a combination of the three

Luckily, it also deals with things such as word context, slang and even emojis. As I was looking at sentiment, no pre-processing was done (lower-casing, removal of punctuation etc.) in order to not remove any hidden context.

I started this analysis by feeding each whole conversation into the analyser, but quickly realised this didn’t really work, as the conversation sentiment quickly tended to 1 after the first few messages, and I struggle to believe that a conversation of 100 messages was 100% confident the whole time.

I then split the conversations down into their constituent messages and fed them through one at a time, averaging the scores up to conversation level. This produced a much more realistic outcome in my opinion:

Split this data up by ‘Success’ or ‘No Success’, and I quickly saw a pattern emerging:

This tee’d up my fourth recommendation.

The Fourth Recommendation:

Be positive, but not too positive.

The average sentiment for a successful conversation was 0.31 vs 0.20 for a non-successful conversation. Having said that, being too positive is almost as bad as being too negative.

The final alley I explored was what effect various details about the first message had on the success of the conversation. Initial thoughts of things that could have an effect were:

  • length
  • whether a name was used
  • sentiment
  • presence of emojis
  • explicit content

As expected, the longer the first message, the greater the likelihood that that conversation will continue to a ‘Success’. As an extension, you double your probability of success by not just using a one word opener e.g. not just saying ‘hey’ or ‘hi’ or ‘daayyuumm’ (real example).

Somewhat more surprisingly, using a name in the first message had very little effect on the ‘Success Ratio’ (No. Successes/No. No Successes).

First message sentiment turned out to be about 0.09 higher for “Successful” conversations than “Unsuccessful” conversations, which wasn’t really a surprise… if you insult someone in a first message, they’re intuitively less likely to reply.

Analysing Emojis was a task I hadn’t really thought about, and had the potential of being tricky. Luckily, a package called ‘emoji’ exists, which automatically picks up the presence of emojis within text. Unfortunately, and much to my dismay, it appears using an emoji in a first message increases one’s probability of obtaining a ‘Success’.

Now onto explicit content… Another one that had the potential of being quite tricky, as there are no built in libraries that pick up use of expletives etc. (that I know of). Luckily I stumbled upon this:

I can assure you, there are some absolute crackers contained within it.

I then checked to see which first messages contained a word from this list, 40 of which did. As is always the case with things like this, i found some interesting edge-cases:

FYI this was a bloke talking about his rowing leggings…

Results? It turns out that none of the first messages that contained explicit content lead to a ‘Success’

This lead me to my fifth and final recommendation.

The Fifth Recommendation:

When sending a first message:

  • Be positive
  • 8 words is optimal
  • Use an emoji or two
  • Don’t be explicit

SO TO SUM UP

  1. Use Tinder at 9pm on a Sunday for maximum audience
  2. Spend time constructing messages and don’t use text speak
  3. Prepare to ask for a number or a date between the 20th and 30th message
  4. Be positive, but not too positive
  5. Send something other than ‘hey’ as a first message, aim for around 8 words, maybe use an emoji and don’t be explicit

A few pitfalls of the data:

  1. My dataset is a very, very small sample, rendering most insights useless
  2. The dataset is biased towards the type of people I know, as well as being biased towards men
  3. The dataset only contains one side of the conversation
  4. The message and usage stats don’t necessarily line up due to users uninstalling and reinstalling the app
  5. No NLP technique will be perfect due to sarcasm/variations in the way people speak

A few ideas for future work:

  • Gather more data
  • Do more to determine statistically significant results vs observations
  • Look into conversation analysis by topic — what type of messages make up the good and bad sentiment
  • Try to look into sarcasm
  • Investigate other apps (Bumble, Hinge etc.)
  • Some sort of classification analysis if more data was included, as we only had 70ish successes
  • Look more into gender splits if more data was included

A few interesting factoids from the data:

  • Most swipes by a single person in a single day: 8096
  • Guys are more likely to leave a long time (7ish days) before sending a second message
  • Asking a question in a first message actually decreases your chance of a success
  • Women swipe right on average 1% of the time, whereas men do ~50% of the time
  • Per app open, women swipe 3x as many times as men

Further reading:

  • A paper was published called ‘A First Look at User Activity on Tinder’, link here
  • There is a Tinder API, but unfortunately it is only for people using the app rather than giving access to a database of some kind. Anyhow, using it to test certain hypotheses could be interesting.
  • Tinderbox is a piece of software that can learn who you’re attracted to via dimensionality reduction. It also has a chatbot built in if you really want to automate the process…

Thanks for reading, any ideas for future work would be much appreciated!

--

--