Community detection of survey responses based on Pearson correlation coefficient with Neo4j

Tomaz Bratanic
Towards Data Science
7 min readFeb 6, 2019

--

Just a few days ago a new version of Neo4j graph algorithms plugin was released. With the new release come new algorithms and Pearson correlation algorithm is one of them.

To demonstrate how to use Pearson correlation algorithm in Neo4j we will use the data from “Young People Survey” Kaggle dataset made available by Miroslav Sabo. It contains results of 1010 filled out surveys with questions ranging from music preferences, hobbies & interests to phobias.

The nice thing about using Pearson correlation in scoring scenarios is that it takes into account when voters are generally more inclined to give higher or lower scores as it compares each score to the average score of the user.

Import

Download the dataset and copy it into $Neo4j/import folder. Each row in the responses.csv file represents a single survey with 150 questions filled out. We store it in Neo4j as a single node.

LOAD CSV WITH HEADERS FROM "file:///responses.csv" as row
CREATE (p:Person)
SET p += row

Preprocessing

Most of the answers range from one to five where five is defined as “Strongly agree” and one as “Strongly disagree”. They appear as strings in the csv file and we have to convert them to integers first.

MATCH (p:Person)
UNWIND keys(p) as key
WITH p,key where not key in ['Gender',
'Left - right handed',
'Lying','Alcohol',
'Education','Smoking',
'House - block of flats',
'Village - town','Punctuality',
'Internet usage']
CALL apoc.create.setProperty(p, key, toInteger(p[key])) YIELD node
RETURN distinct 'done'

Category properties

Some of the answers are categorical. An example is the alcohol question, where possible answers are “never”, “social drinker” and “drink a lot”.

As we would like to convert some of them to vectors let’s examine all the possible answers they have.

MATCH (p:Person)
UNWIND ['Gender',
'Left - right handed',
'Lying','Alcohol',
'Education','Smoking',
'House - block of flats',
'Village - town','Punctuality',
'Internet usage'] as property
RETURN property,collect(distinct(p[property])) as unique_values

Results

Let’s vectorize gender, internet and alcohol answers. We will scale them between one to five to match the integer answers range.

Gender encoding

MATCH (p:Person)
WITH p, CASE p['Gender'] WHEN 'female' THEN 1
WHEN 'male' THEN 5
ELSE 3
END as gender
SET p.Gender_vec = gender

Internet encoding

MATCH (p:Person)
WITH p, CASE p['Internet usage'] WHEN 'no time at all' THEN 1
WHEN 'less than an hour a day' THEN 2
WHEN 'few hours a day' THEN 4
WHEN 'most of the day' THEN 5
END as internet
SET p.Internet_vec = internet

Alcohol encoding

MATCH (p:Person)
WITH p, CASE p['Alcohol'] WHEN 'never' THEN 1
WHEN 'social drinker' THEN 3
WHEN 'drink a lot' THEN 5
ELSE 3 END as alcohol
SET p.Alcohol_vec = alcohol

Dimensionality reduction

There are 150 answers in our dataset that we could use as features. This is a great opportunity to perform some basic dimensionality reduction of the features.

I came across an article about dimensionality reduction techniques written by Pulkit Sharma. It describes twelve dimensionality reduction techniques, and in this post, we will use the first two, which are the low variance filter and the high correlation filter.

Low variance filter

Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, do you think it can improve the model we will build? The answer is no, because this variable will have zero variance.

source

We will use the standard deviation metric, which is just the square root of the variance.

MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage']) as all_keys
UNWIND all_keys as key
MATCH (p:Person)
RETURN key,avg(p[key]) as average,stdev(p[key]) as std
ORDER BY std ASC LIMIT 10

Results

We can observe that everybody likes to listen to music, watch movies and have fun with friends.

Due to the low variance, we will eliminate the following questions from our further analysis:

  • “Personality”
  • “Music”
  • “Dreams”
  • “Movies”
  • “Fun with friends”
  • “Comedy”

High correlation filter

High correlation between two variables means they have similar trends and are likely to carry similar information. This can bring down the performance of some models drastically (linear and logistic regression models, for instance).

source

We will use the Pearson correlation coefficient for this task. Pearson correlation adjusts for different location and scale of features, so any kind of linear scaling (normalization) is unnecessary.

Find top 10 correlations for gender feature.

MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keys
MATCH (p1:Person)
UNWIND ['Gender_vec'] as key_1
UNWIND all_keys as key_2
WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2] ,0)) as vector_2
WHERE key_1 <> key_2
RETURN key_1,key_2, algo.similarity.pearson(vector_1, vector_2) as pearson
ORDER BY pearson DESC limit 10

Results

Most correlated feature to gender is weight, which makes sense. The list includes some other stereotypical gender differences like the preference for cars, action, and PC.

Let’s now calculate the Pearson correlation between all the features.

MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key_1
UNWIND all_keys as key_2
WITH key_1,key_2,p1
WHERE key_1 > key_2
WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2],0)) as vector_2
RETURN key_1,key_2, algo.similarity.pearson(vector_1, vector_2) as pearson
ORDER BY pearson DESC limit 10

Results

Results show nothing surprising. The only one I found interesting was the correlation between snakes and rats.

We will exclude the following questions due to high correlation from further analysis:

  • “Medicine”
  • “Chemistry”
  • “Shopping centres”
  • “Physics”
  • “Opera”
  • “Animated”

Pearson similarity algorithm

Now that we have completed the preprocessing step we will infer a similarity network between nodes based on the Pearson correlation of the features(answers) of nodes that we haven’t excluded.

In this step we need all the features we will use in our analysis to be normalized between one and five as now, we will fit all the features of the node in a single vector and calculate correlations between them.

Min-max normalization

Three of the features are not normalized between one to five. These are

  • ‘Height’
  • “Number of siblings”
  • ‘Weight’

Normalize height property between one to five. We won’t use the other two.

MATCH (p:Person)
//get the the max and min value
WITH max(p.`Height`) as max,min(p.`Height`) as min
MATCH (p1:Person)
//normalize
SET p1.Height_nor = 5.0 *(p1.`Height` - min) / (max - min)

Similarity network

We grab all the features and infer the similarity network. We always want to use similarityCutoff parameter and optionally topK parameter to prevent ending up with a complete graph, where all nodes are connected between each other. Here we use similarityCutoff: 0.75 and topK: 5. Find more information in the documentation.

MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy','Medicine','Chemistry','Shopping centres','Physics','Opera','Animated','Height','Weight','Number of siblings']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key
WITH {item:id(p1), weights: collect(coalesce(p1[key],3))} as personData
WITH collect(personData) as data
CALL algo.similarity.pearson(data, {similarityCutoff: 0.75,topK:5,write:true})
YIELD nodes, similarityPairs
RETURN nodes, similarityPairs

Results

  • nodes: 1010
  • similarityPairs: 4254

Community detection

Now that we have inferred a similarity network in our graph, we will try to find communities of similar persons with the help of Louvain algorithm.

CALL algo.louvain('Person','SIMILAR')
YIELD nodes,communityCount

Results

  • nodes: 1010
  • communityCount: 105

Apoc.group.nodes

For a quick overview of community detection results in Neo4j Browser, we can use apoc.group.nodes. We define the labels we want to include and group by a certain property. In the config part, we define which aggregations we want to perform and get returned in the visualization. Find more in the documentation.

CALL apoc.nodes.group(['Person'],['community'], 
[{`*`:'count', Age:['avg','std'],Alcohol_vec:['avg']}, {`*`:'count'} ])
YIELD nodes, relationships
UNWIND nodes as node
UNWIND relationships as rel
RETURN node, rel;

Results

Community preferences

To get to know our communities better, we will examine their average top and bottom 3 preferences.

MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Height','Number of siblings','Weight','Medicine', 'Chemistry', 'Shopping centres', 'Physics', 'Opera','Age','community','Comedy','Gender_vec','Internet','Height_nor']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key
WITH p1.community as community,
count(*) as size,
SUM(CASE WHEN p1.Gender = 'male' THEN 1 ELSE 0 END) as males,
key,
avg(p1[key]) as average,
stdev(p1[key]) as std
ORDER BY average DESC
WITH community,
size,
toFloat(males) / size as male_percentage,
collect(key) as all_avg
ORDER BY size DESC limit 10
RETURN community,size,male_percentage,
all_avg[..3] as top_3,
all_avg[-3..] as bottom_3

Results

Results are quite interesting. Just looking at the male percentage it is safe to say that the communities are almost all based on gender.

The biggest community are 220 ladies, who strongly agree with “Compassion to animals”, “Romantic” and interestingly “Borrowed stuff” but disagree with “Metal”, “Western” and “Writing”. Second biggest community, mostly male, agree with “Cheating in school”, “Action” and “PC”. They also don’t agree with “Writing”. Makes sense as the survey was filled out by students from Slovakia.

Gephi visualization

Let’s finish off with a nice visualization of our communities in Gephi. You need to have the streaming plugin enabled in Gephi and then we can export the graph from Neo4j using the APOC procedure apoc.gephi.add.

MATCH path = (:Person)-[:SIMILAR]->(:Person)
CALL apoc.gephi.add(null,'workspace1',path,'weight',['community']) yield nodes
return distinct 'done'

After a bit of tweaking in Gephi, I came up with this visualization. Similarly as with apoc.group.nodes visualization we can observe, that the biggest communities are quite connected between each other.

Register now for your copy of the O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler.

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN