Community detection of survey responses based on Pearson correlation coefficient with Neo4j
Just a few days ago a new version of Neo4j graph algorithms plugin was released. With the new release come new algorithms and Pearson correlation algorithm is one of them.
To demonstrate how to use Pearson correlation algorithm in Neo4j we will use the data from “Young People Survey” Kaggle dataset made available by Miroslav Sabo. It contains results of 1010 filled out surveys with questions ranging from music preferences, hobbies & interests to phobias.
The nice thing about using Pearson correlation in scoring scenarios is that it takes into account when voters are generally more inclined to give higher or lower scores as it compares each score to the average score of the user.
Import
Download the dataset and copy it into $Neo4j/import
folder. Each row in the responses.csv file represents a single survey with 150 questions filled out. We store it in Neo4j as a single node.
LOAD CSV WITH HEADERS FROM "file:///responses.csv" as row
CREATE (p:Person)
SET p += row
Preprocessing
Most of the answers range from one to five where five is defined as “Strongly agree” and one as “Strongly disagree”. They appear as strings in the csv file and we have to convert them to integers first.
MATCH (p:Person)
UNWIND keys(p) as key
WITH p,key where not key in ['Gender',
'Left - right handed',
'Lying','Alcohol',
'Education','Smoking',
'House - block of flats',
'Village - town','Punctuality',
'Internet usage']
CALL apoc.create.setProperty(p, key, toInteger(p[key])) YIELD node
RETURN distinct 'done'
Category properties
Some of the answers are categorical. An example is the alcohol question, where possible answers are “never”, “social drinker” and “drink a lot”.
As we would like to convert some of them to vectors let’s examine all the possible answers they have.
MATCH (p:Person)
UNWIND ['Gender',
'Left - right handed',
'Lying','Alcohol',
'Education','Smoking',
'House - block of flats',
'Village - town','Punctuality',
'Internet usage'] as property
RETURN property,collect(distinct(p[property])) as unique_values
Results
Let’s vectorize gender, internet and alcohol answers. We will scale them between one to five to match the integer answers range.
Gender encoding
MATCH (p:Person)
WITH p, CASE p['Gender'] WHEN 'female' THEN 1
WHEN 'male' THEN 5
ELSE 3
END as gender
SET p.Gender_vec = gender
Internet encoding
MATCH (p:Person)
WITH p, CASE p['Internet usage'] WHEN 'no time at all' THEN 1
WHEN 'less than an hour a day' THEN 2
WHEN 'few hours a day' THEN 4
WHEN 'most of the day' THEN 5
END as internet
SET p.Internet_vec = internet
Alcohol encoding
MATCH (p:Person)
WITH p, CASE p['Alcohol'] WHEN 'never' THEN 1
WHEN 'social drinker' THEN 3
WHEN 'drink a lot' THEN 5
ELSE 3 END as alcohol
SET p.Alcohol_vec = alcohol
Dimensionality reduction
There are 150 answers in our dataset that we could use as features. This is a great opportunity to perform some basic dimensionality reduction of the features.
I came across an article about dimensionality reduction techniques written by Pulkit Sharma. It describes twelve dimensionality reduction techniques, and in this post, we will use the first two, which are the low variance filter and the high correlation filter.
Low variance filter
Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, do you think it can improve the model we will build? The answer is no, because this variable will have zero variance.
We will use the standard deviation metric, which is just the square root of the variance.
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage']) as all_keys
UNWIND all_keys as key
MATCH (p:Person)
RETURN key,avg(p[key]) as average,stdev(p[key]) as std
ORDER BY std ASC LIMIT 10
Results
We can observe that everybody likes to listen to music, watch movies and have fun with friends.
Due to the low variance, we will eliminate the following questions from our further analysis:
- “Personality”
- “Music”
- “Dreams”
- “Movies”
- “Fun with friends”
- “Comedy”
High correlation filter
High correlation between two variables means they have similar trends and are likely to carry similar information. This can bring down the performance of some models drastically (linear and logistic regression models, for instance).
We will use the Pearson correlation coefficient for this task. Pearson correlation adjusts for different location and scale of features, so any kind of linear scaling (normalization) is unnecessary.
Find top 10 correlations for gender feature.
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keys
MATCH (p1:Person)
UNWIND ['Gender_vec'] as key_1
UNWIND all_keys as key_2
WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2] ,0)) as vector_2
WHERE key_1 <> key_2
RETURN key_1,key_2, algo.similarity.pearson(vector_1, vector_2) as pearson
ORDER BY pearson DESC limit 10
Results
Most correlated feature to gender is weight, which makes sense. The list includes some other stereotypical gender differences like the preference for cars, action, and PC.
Let’s now calculate the Pearson correlation between all the features.
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key_1
UNWIND all_keys as key_2
WITH key_1,key_2,p1
WHERE key_1 > key_2
WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2],0)) as vector_2
RETURN key_1,key_2, algo.similarity.pearson(vector_1, vector_2) as pearson
ORDER BY pearson DESC limit 10
Results
Results show nothing surprising. The only one I found interesting was the correlation between snakes and rats.
We will exclude the following questions due to high correlation from further analysis:
- “Medicine”
- “Chemistry”
- “Shopping centres”
- “Physics”
- “Opera”
- “Animated”
Pearson similarity algorithm
Now that we have completed the preprocessing step we will infer a similarity network between nodes based on the Pearson correlation of the features(answers) of nodes that we haven’t excluded.
In this step we need all the features we will use in our analysis to be normalized between one and five as now, we will fit all the features of the node in a single vector and calculate correlations between them.
Min-max normalization
Three of the features are not normalized between one to five. These are
- ‘Height’
- “Number of siblings”
- ‘Weight’
Normalize height property between one to five. We won’t use the other two.
MATCH (p:Person)
//get the the max and min value
WITH max(p.`Height`) as max,min(p.`Height`) as min
MATCH (p1:Person)
//normalize
SET p1.Height_nor = 5.0 *(p1.`Height` - min) / (max - min)
Similarity network
We grab all the features and infer the similarity network. We always want to use similarityCutoff parameter and optionally topK parameter to prevent ending up with a complete graph, where all nodes are connected between each other. Here we use similarityCutoff: 0.75 and topK: 5. Find more information in the documentation.
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy','Medicine','Chemistry','Shopping centres','Physics','Opera','Animated','Height','Weight','Number of siblings']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key
WITH {item:id(p1), weights: collect(coalesce(p1[key],3))} as personData
WITH collect(personData) as data
CALL algo.similarity.pearson(data, {similarityCutoff: 0.75,topK:5,write:true})
YIELD nodes, similarityPairs
RETURN nodes, similarityPairs
Results
- nodes: 1010
- similarityPairs: 4254
Community detection
Now that we have inferred a similarity network in our graph, we will try to find communities of similar persons with the help of Louvain algorithm.
CALL algo.louvain('Person','SIMILAR')
YIELD nodes,communityCount
Results
- nodes: 1010
- communityCount: 105
Apoc.group.nodes
For a quick overview of community detection results in Neo4j Browser, we can use apoc.group.nodes. We define the labels we want to include and group by a certain property. In the config part, we define which aggregations we want to perform and get returned in the visualization. Find more in the documentation.
CALL apoc.nodes.group(['Person'],['community'],
[{`*`:'count', Age:['avg','std'],Alcohol_vec:['avg']}, {`*`:'count'} ])
YIELD nodes, relationships
UNWIND nodes as node
UNWIND relationships as rel
RETURN node, rel;
Results
Community preferences
To get to know our communities better, we will examine their average top and bottom 3 preferences.
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Height','Number of siblings','Weight','Medicine', 'Chemistry', 'Shopping centres', 'Physics', 'Opera','Age','community','Comedy','Gender_vec','Internet','Height_nor']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key
WITH p1.community as community,
count(*) as size,
SUM(CASE WHEN p1.Gender = 'male' THEN 1 ELSE 0 END) as males,
key,
avg(p1[key]) as average,
stdev(p1[key]) as std
ORDER BY average DESC
WITH community,
size,
toFloat(males) / size as male_percentage,
collect(key) as all_avg
ORDER BY size DESC limit 10
RETURN community,size,male_percentage,
all_avg[..3] as top_3,
all_avg[-3..] as bottom_3
Results
Results are quite interesting. Just looking at the male percentage it is safe to say that the communities are almost all based on gender.
The biggest community are 220 ladies, who strongly agree with “Compassion to animals”, “Romantic” and interestingly “Borrowed stuff” but disagree with “Metal”, “Western” and “Writing”. Second biggest community, mostly male, agree with “Cheating in school”, “Action” and “PC”. They also don’t agree with “Writing”. Makes sense as the survey was filled out by students from Slovakia.
Gephi visualization
Let’s finish off with a nice visualization of our communities in Gephi. You need to have the streaming plugin enabled in Gephi and then we can export the graph from Neo4j using the APOC procedure apoc.gephi.add.
MATCH path = (:Person)-[:SIMILAR]->(:Person)
CALL apoc.gephi.add(null,'workspace1',path,'weight',['community']) yield nodes
return distinct 'done'
After a bit of tweaking in Gephi, I came up with this visualization. Similarly as with apoc.group.nodes visualization we can observe, that the biggest communities are quite connected between each other.
Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.