Using Machine Learning to find compatible partners with R

A simple question like ‘How do you find a compatible partner?’ is what pushed me to try to do this project in order to find a compatible partner for any person in a population, and the motive behind this blog post is to explain my approach towards this problem in a manner as clear as possible.
You can find the project notebook here.
If I asked you to find a partner, what would be your next step? And what if I had asked you to find a compatible partner? Would that change things?
A simple word such as compatible can make things tough, because apparently humans are complex.
The Data
Since we couldn’t find any single dataset that could cover the variation in persona, we resorted to using the Big5 personality dataset, Interests dataset (also known as Young-People-Survey dataset) and Baby-Names dataset.
Big5 personality dataset: The reason we are choosing Big5 dataset is solely because it provides an idea about any individual’s personality through the Big5/OCEAN personality test which asks a respondent 50 questions, 10 questions each for Openness, Conscientiousness, Extraversion, Agreeableness & Neuroticism to measure them on a scale of 1–5. You can read more about Big5 here.
Interests dataset: which covers the interests & hobbies of a person by asking them to rate 50 different areas of interest (such as art, reading, politics, sports etc.) on a scale of 1-5.
Baby-Names dataset: helps in assigning a real and unique name to each respondent
The project is made in R language (version 4.0.0) With the help of dplyr and cluster packages
Processing
Loading the Big5 dataset, which has 19k+ observations with 57 variables including Race, Age, Gender, Country besides the personality questions.
Removing the respondents who did not respond to few questions & some respondents with vague age values such as: 412434, 223, 999999999
Taking a healthy sample of 5000 respondents, since we don’t want the laptop go for a vacation when we want to find Euclidean distances between thousands of observations for Clustering 🙂
Loading the Baby-Names dataset and adding 5000 unique and real names to identify each observation as a person than just a number.
Loading the Interests dataset, the dataset has 50 variables, each of them an interest or a hobby

After loading all of the datasets we combine them into one master dataframe and name it train, which has 107 variables which are shown here:

A few plots to see how our data lays out in terms of Age and Gender


Principal Component Analysis
Remember we saw little correlation in the heatmap? Well this is where the Principal Component Analysis comes in. PCA combines the effect of some similar variables into a Principal Component column or PC.
For those who don’t know what Principal Component Analysis is; PCA is a dimension reduction technique which focuses on creating a totally new variable or a Principal Component(PC for short) from all of the variables through an equation to grasp most variation possible, from the data.
In simple terms, PCA will help us in using only a few components which take into account the most important and most varying variables instead of using all 50 variables. You can learn more about PCA here.
Important: We run PCA on Interests variables and Big5 variables separately, since we don’t want to mix interests & personality.
After running the PCA on Interest variables, what we get is 50 PCs. Now here is the fun part, we won’t be using all of them, here’s why: the first PC would be the strongest i.e a variable that will grasp most of the variation in our data, the second PC would be weaker, and will grasp lesser variation and so on until 50th PC.
Our objective is to find the sweet spot between using 0 and 50 PCs and we will do that by plotting the variance explained by the PCs:
The result? we just shrank number of variables from 50 to just 14, which explain 60% of the variation in the original Interest variables.
Similarly, we do PCA on Big5 variables:


Now that we have reduced the variables in Big5 from 50 to 14 , and in Interests from 50 to 12, we combine them into a dataframe different from train. We call it pcatrain.
Clustering
As a good practice we first use Hierarchical Clustering to find a good value for k (the number of clusters)
Hierarchical Clustering
What is Hierarchical Clustering? Here is an example: Think of a house party of 100 people, now we start with every single person representing as a cluster of 1 person. The next step? We combine the two people/clusters standing closest into one cluster, then we label another two closest clusters as one, and so on. Finally we have gone from 100 clusters to 1 cluster. What Hierarchical Clustering does is form clusters on the basis of distance between the clusters and then we can see that process in a dendogram.

After doing Hierarchical Clustering we can see our own cluster dendogram here, as we go from bottom to top we see every cluster converging, the more distant each cluster is from the other, the longer steps it takes to converge; which you can see by looking at the vertical joins.
Based on the distance we use the red line to divide a healthy group of 7 diverse clusters. The reason behind 7 is that, the 7 clusters take longer steps to converge, i.e the clusters are distant.
K-Means Clustering

We use Elbow Method in K-Means to make sure that taking around 7 clusters is a good choice, we wont dive deep into it, but to summarize: The marginal sum of within-cluster distances between individuals & the marginal distance between the cluster centers is best at 6 clusters.
K-Means clustering with 6 clusters
We run K-Means clustering with k=6; check the size of each cluster; what cluster the first 10 people are assigned. Finally we add this cluster variable to our pcatrain dataframe, and now our dataframe has 33 variables.
Final steps
Now that we have assigned clusters, we can start finding close matches for any individual.
We select Penni as a random individual, for whom we will find matches from her cluster i.e cluster 2
On left, we first find people from Penni’s cluster, then filter out people those who are in the same country as Penni’s, opposite gender, and belong to Penni’s age category.
Okay so now we have filtered out people, is that it?
No. Remember the question we asked in the beginning?
‘How do you find a compatible partner?’
Even though we have found people with same interests and age-group, we must find people who have personality most similar to Penni’s.
This is where the Big5 personality variables come in handy.
Through Big5, we will be able to find people who have the same level of Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism as Penni’s.
What we did here is find the difference between the response of Penni and the response of filtered people for each personality variable, and then added the differences of all variables.
So now we know, if Penni is looking for a partner, she should first try to meet Brody.
A summary of what we did to find a compatible person for Penni:
- Clustered people on the basis of their interests.
- Found people who have similar interests, belong to same age-group as Penni’s.
- Ranked those filtered people on the basis of how closely their personality matches Penni’s personality.
Thank you for sticking till the end!
You can connect with me on: