A data science approach to personality models

Salva Rocher
Towards Data Science
19 min readOct 1, 2019

--

1. Introduction

I have personally always found interesting the differences in how people are or act, so this study will allow us to know a little more about it. We will have insights on how the characteristics of a person relate to their personality traits and habits, and even to be able to do predictions based on some demographics.

In particular, most of the current personality tests draw upon the generally accepted 5 factor model, also known as OCEAN, named after the first letters of the 5 factors that comprise it.

[O] Openness to experience. (inventive/curious vs. consistent/cautious)

[C] Conscientiousness. (efficient/organized vs. easy-going/careless)

[E] Extroversion. (outgoing/energetic vs. solitary/reserved)

[A] Agreeableness. (friendly/compassionate vs. challenging/detached)

[N] Neuroticism. (sensitive/nervous vs. secure/confident)

This model is basically a taxonomy method for personality traits that is generally accepted by the experts in the field because it allows to capture most relevant personality differences among people. However it has also been object of debate and critiques.

It has been argued that there are limitations to the scope of the Big Five model as an explanatory or predictive theory. It has also been argued that measures of the Big Five account for only 56% of the normal personality trait sphere.

One common criticism is that the Big Five does not explain all of human personality, in fact, some psychologists have dissented from the model precisely because they feel it neglects other domains of personality. Due to this reason, some have grown to call the method of the Big Five as the “psychology of the stranger”, because they refer to traits that are relatively easy to observe in a stranger and other aspects of personality that are more privately held or more context-dependent are excluded.

In light of the scope limitations critiques, maybe a new psychometric model could be developed in order to find relevant dimensions or alternative latent factors in explaining differences in personality. So this became the core of this paper, as I will explain in detail in the following sections.

2. Data

The dataset, which is not anchored in the OCEAN approach, is an interesting collection of answers for a bunch of varied questions ranging from music preferences to phobias, by way of personal interests, personality traits, lifestyle and personal characteristics. It was collected via survey and the respondents were 1.010 European people ranging from 15 to 30 years old. It is convenient to mention that there were few missing values in the answers, but since the sample size is large enough we could afford carefully removing completely those rows. To know more about the list of questions, please refer to Appendix.

The sampling units are the individual responses, structured in rows. In the columns we can find the variables. The ones I have focused my analysis in are, basic characteristics on one side, and personality traits on the other.

Basic characteristics

Gender. Categorical, Male, Female.

Age. Numerical variable, but transformed into an ordinal one following this criterion:

  • 1: 15–19 yo
  • 2: 20–24 yo
  • 3: 25–30 yo (entails one year more than the other groups, but made sense since is the smaller group in terms of respondents)

Education. Ordinal variable. Categories are the following:

  • 1: Primary school
  • 2: Secondary school
  • 3: College degree
  • 4: Masters degree and PhD degree (I have joined them both since there were very few Doctors in the respondents)

Place where respondents spent most of their childhood in. Categories:

  • Village
  • City

Personality traits

To be more precise, the list does not only embrace personality traits, but also beliefs and opinions about life, which in my view can be traced back to personality traits.

It is comprised of 54 questions all of them with suggested answers in a 1 to 5 scale, being 1 strongly disagree and 5 strongly agree. Actually there were 57, but for simplicity purposes 3 were removed since they were of a different nature than the rest.

Other data of interest

Apart from the personality data, there are also questions about interests, music preferences, phobias and life aspects as lifestyle.

The answers range as well from 1 to 5 scale.

3. Methodology

In order to move further in our objective of finding valuable insights on how personality traits relate to characteristics of a person, it is convenient to spend some time first working on the list of questions for personality traits. In this sense, we are talking about a Factor analysis. The rationale behind is two-fold.

First, we want to be able to group the variables (questions) in similar dimensions without losing power of information, in other words, we want a dimension reduction. The reason why this comes handy is because moving forward it is much more convenient to work with a few variables rather than with a lot of them (remember that we are talking about 55 questions, so that makes 55 variables); especially on the prospect of working with regression techniques down the road. So in a nutshell it is as if we were screening variables for subsequent analysis.

Secondly, it will allow us to discover latent structures which maybe can build up to a form of an alternative psychometric model for personality tests.

So as there is no model that the variables were created with, or any reference we can compare them to, we need to explore if we can find the relevant dimensions underlying, so the analysis is actually an exploratory factor analysis.

Once the rationale is clear, how do we proceed?

First, it is good to have a sense on how the variables relate to one another. For this there are 2 options, one is to plot a correlation heatmap matrix, the other is to run a PCA on all the questions to see them regressed on the first two dimensions and to observe how they relate to each other. See that PCA is a way of dimension reduction, but for the purpose here it was better to draw upon other methods. Both were computed as we will be able to see in the results section.

In particular, the correlation visualization, was plotted using a hierarchical clustering which was based on Ward´s minimum variance agglomeration method. Ward´s clustering follows a criterion for merging two clusters at each node of the tree where it tries to maximize the separation of the new cluster´s mean from the mean of other clusters, which is equivalent to minimizing the dispersion within the newly combined cluster. R package Corrplot allows us to compute this.

After this, the method used is the principal axis analysis. The main point here is the rotation after extraction of factors to keep finding latent structures. The rotation method used was varimax rotations. Factanal package in R was used to compute this.

A question that arises at this point is how many rotations do we need to perform? Meaning, where do we stop getting factors or dimensions?

Here a scree plot comes handy. A scree plot depicts the Eigenvalues of the correlation matrix of the variables (questions) against the factor number. Nscree package in R aids us in this matter.

Once we have come up with a small number of factors that explain most of the variance observed in the original variables, the idea is to regress them on people characteristics in order to discriminate or classify people. This is done through classification and regression trees (CART).

Regression trees develop easy-to-visualize decision rules for predicting a categorical variable “being classified” in one way or another based on some relevant variables. R was also used to compute this, in particular, the rpart.plot package.

This is basically the methodology used in the backbone of the project. However in addition, I wanted to get some other insights. In particular, I wanted to analyze healthy lifestyle. For that I resorted to Correspondence Analysis. Its main purpose is to depict visually in a map (usually from a multidimensional space projected onto a plane) the relationships between answers — from 1 to 5 being 1 strongly disagree and 5 strongly agree, respond to the following statement “I live a very healthy lifestyle”; and categories, which in this case where the ones for the variables gender, age (in groups) and education.

4. Results

Before starting, the list of original questions that can be found in Appendix, has been given with numbers too for convenience of the visualization, since 54 variables together sometimes makes it sometimes difficult to read the graphs.

PCA

First let´s have a look at the PCA run into the first two dimensions. For that we need to apply PCA to our dataset as in below code.

Thanks to Prof. Michael Greenacre for its PCA customized function

After running the code, the result looks like this:

Here we can have a first rough approximation on how the variables relate between them in the first two dimensions. We can see that variance explained by them accounts for around 16%, which hints we would be better off exploring and taking into account more dimensions.

Looking at the PCA, for instance in dimension 2, we can see that there is a group of variables that we will be able to see them in more detail later on, that are very negatively correlated (as represented by the opposite arrows), those are questions 52, 47 and 31, that is, interests or hobbies, energy levels and number of friends as opposed to questions 27 and 24, that is, changing the past (remorse, melancholy) and loneliness. As we can see both sides evoke totally opposite sign feelings.

Correlations heatmap matrix

Let´s move on now to see relationships between questions into more detail. Here below is the correlation matrix.

X and Y axis show the 54 questions. The sorting already gives some ideas and hints of what we will find down the road when computing the factor analysis. In particular, the size and color of each square in the matrix cells indicates strength of the Pearson´s correlation coefficient between each pair.

Additionally, I have highlighted at discretion 9 rectangles which show groupings between similar traits. For instance, questions 2, 3 and 4 (2th rectangle starting from the bottom right corner), “prioritizing workload”, “writing notes”, “workaholism” are concepts that embrace an underlying similar concept, which is the diligence, the drive for the hard work. (Please note that the sample units are young people from 15 to 30, so in most of cases their equivalent to older adults of working is studying). Also, funnily enough, this group I have mentioned, correlates negatively to questions 25, 51 (3rd rectangle starting from the top left corner), “cheating in school” and “difficulty to get up soon in the mornings”, which makes total sense.

In any case, the correlations matrix above is a very nice starting point, but what is going to tell us which are the main latent factors beneath the survey questions, is the factor analysis.

So let´s get right to it.

Factor analysis

First, we define the function to plot the dimensions that we will be getting and its arguments.

Then, we invoke factanal package to obtain the factors using varimax rotation method.

Lastly, we call the function we have defined before over the variable that cointains the factors.

The way to read the following factor graphs is basically focusing on the extremes labels of the X and Y axis since they are the ones that characterize or explain most of the variance for each dimension. In other words, they represent the variables with the highest loadings for each factor.

Factor 1

Dimension of factor 1 can be read horizontally and the one for factor 2, vertically

Again, meaning of each variable can be found in Appendix.

F1. Positiveness

I have called it Positiveness factor because they relate to positive feelings, it seems that they are positive people who feel happy, energetic, having good dreams at night, and who think positively about themselves and their personality. On the other side of the spectrum we find to those people who don’t feel good because they feel lonely (very negatively correlated with both “happiness in life” and “energy levels” as we can see in previously shown correlations matrix), and who experience negative emotions maybe as a result of them not being positive, like remorse and willingness to change the past, and instability on their mood.

Factor 2

F2. Extroversion

The second factor is a valid measure of the extroversion/introversion personality dimension. Those people who are extroverts find it easy and stimulating to enter and adapt to new environments with new people, enjoy meeting new people, have an overall large number of friends, have many interests and hobbies, and also it is easier for them to be assertive. Assertiveness is something which is not always easy to apply, since the other part of the relationship might not take it well, an introvert, more sensitive and with less friends in general, might want to steer clear of. On the other hand, we see typical traits for introverts (introvert spring), they tend to feel uncomfortable when surrounded by a lot of people, so they tend to dislike/fear public speaking; they also tend to ruminate and think over things quite a bit, hence decision making and responding to a serious letter requires time. Introvert also tend to be introspective, hence worrying about health relatively more than extroverts on average, which it also makes sense, after all, an introvert is more aware of what is going on inside of them as compared to an extrovert, so spotting health irregularities and ruminating over them is more likely to happen for them than for the extroverts.

Factor 3

Dimension for factor 3 can be read vertically
F3 — Diligence

The 3rd dimension is diligence. As we mentioned briefly commenting on the PCA, this factor represents people who are diligent upon their duties, who can be considered as disciplined and hardworking people and who in consequence is also reliable. Concepts like writing notes and workaholism reflect this clearly. Others like prioritizing workload and thinking ahead might not be so obvious upfront, but thinking about it carefully, to do that in a systematic way is not easy and let´s say comfortable, so it shows a certain amount of diligence and drive to fulfill the duties well and on time. On the opposite side we have those who find difficulties in getting up in the mornings and who tend to cheat at school, in a nutshell, those who in general are lazier and not so diligent.

Factor 4

Dimension for factor 4 can be read vertically
F4 — Warmth

The 4th personality dimension is “warmth”, as in a compendium of good-heartedness, sensitiveness and honesty. It reflects aspects of personality like for instance, empathy towards others, suffering when seeing others suffering, being generous giving presents when it is time to, crying when one feel is struggling, and enjoying children´s company. The first two show empathy, the 3rd one good-heartedness, the 4th shows someone who does not refrain from expressing emotions like crying, so is honest in reflecting how they really are and feel; and the 5th shows also warmth, after all, small children are playful and smiling often which forces to somehow reciprocate, so if someone is kind of cold, serious or don’t want to be bothered with such things will tend to enjoy less children´s company.

On the other side of the spectrum, we have those which are doubled faced, those who prefer big dangerous dogs to small ones and those who fall easily for someone but loss interest just as fast. They might seem apparently unconnected topics, but to me they are not, since they picture those people who tend to build a wall around them so others don’t really see who they are or how they feel. The 1st seems obvious as they state it straight away, the 2nd might match those who build a wall by wanting to come across as strong and dangerous, and the 3rd reflects those who have a lot of difficulties in opening up, showing for how they really are, sympathizing and connecting at a deeper level with others, hence they loss interest quick and change partner constantly.

Factor 5

Dimension for factor 5 can be read vertically
F5 — Superficiality

I have called the 5th factor superficiality as in a way to represent those types of personalities who give a predominant importance to image, status and that often is linked to materialism, egocentricity and shallowness. The 1st variable seems to me that puts the focus of the personal relationships in how useful knowing a person might be for him/her, which comes through to me as self-interested, the 2nd one is linked to the importance they give to their physical image, and the 3rd reflects those who brag about their achievements.

On the opposite side we have personality traits who move away from materialism and selfishness since the 1st question respond to those who hand in items lost that, although valuable, are not theirs; and the 2nd one responds to those who value more friends than money.

So far, we are coming up with factors that apart from making sense, they also explain variance of the values of the variables. So, the question that comes up at this point is, how many factors to select?

Above scree test sheds some light onto this. We can see that there are 17 Eigenvalues above the mark of 0, which tells us that there are up to 17 factors that explains at least some variance. 10 is considered to be the optimal number of factors, however I tried, and they were explaining little of the variance, apart from not making much sense. So, according to my findings and to the visual exploration of the eigenvalues curve on the number of factors, we can see as the slope of the curve diminishes, adding new factors do not seem to capture and explain as large amounts of variance, so I have decided to stop at 5 factors.

It seems fantastic to have been able to bring 54 questions down to 5 dimensions that pick up most of the variance of all the variables. This is Factor 1: Positiveness, F2: Extroversion, F3: Diligence, F4: Warmth and F5: Superficiality.

But before moving on to classification trees, there is one gap to be bridged.

As stated previously, the idea was to simplify the variables, and we have got so, however, we know that the variables grouped into the factors have opposite signs. If we don’t transform the components of the factors we are not truly reflecting the nature of it. So for instance, “happiness in life” gives 5 to people who feel very happy about themselves and their lifes, but we also can find “loneliness”, which gives 5 to the maximum intensity of loneliness, hence if we leave it as it is, a happy people could give 5 “happiness in life” and 1 to “loneliness”, so we would have an average of 3. Then a lonely, negative person, gives 1 to “happiness in life” and 5 to “loneliness” giving again an average of 3. If 2 totally opposite people, as far as this personality trait is concerned, give the same average to the dimension, how can we expect the dimensions to classify properly on people characteristics?

It seems evident that a transformation has to be done. What I have done is to invert the rating for the “opposite” variables, so 1 after the transformation is 5, 2 is 4, and so on. Applying this to all the factors obtained allows us to calculate averages and use them as explanatory variables in the classification model that comes next.

Regression trees

What we are going to try to do here is to classify and predict which is the probability of a person belonging to a given category of the characteristic variables analyzed, given their score on the most discriminating dimensions at each level of the tree.

Age

Based on the groups decided and explained in data section, we can see from the top node of the tree that the largest group of respondents belong to people aging from 20 to 24. The 2nd line of the top node (and of each node) represents the expected probability of belonging to the 1st, 2nd and 3rd group respectively (equivalent to their frequencies).

We can see that scoring 3.2 in diligence dimension is the one that more favorably predicts in favor of people aging from 25 to 30 (from the observed 9% to 15%). Then, diligence below 3.2, positiveness below 3.1 but warmth above 4.2 is the combination that further helps predicting to classify in favor of “middle-aged” people (from 48% observed up to 75%). Lastly, the combination that better predicts classifying in favor of the youngest (15–19yo) is diligence below 3.2, positiveness above 3.2 and superficiality above or equal to 3.9.

So, in conclusion, the tree classifies the “oldest” as the most diligent among the respondents, the youngest as the most superficial, least positive and least warm, and the “middle aged” as the most positive, however if the respondent would score less than 3.1 in positiveness on average, according to this model they would be the warmest.

Childhood location

In few words, the model predicts that the warmest and least superficial have spent a considerable amount of time of their childhood in a village. Combination of average warmth score above 4.2 and superficiality below 2.7 predicts a respondent having spent the childhood in a village with a probability of 63%. Pretty high if we take into account that they only account for 29% of the observations. So according to the factors and to the tree, city equals to less warmth and more superficiality, which seems pretty accurate.

Gender

In a nutshell, more warmth (average score of 3.4%) predicts in favor of female, up to 78% from 59% of female respondents. Again, the factor that discriminate the most and seems more relevant when trying to classify people gender-wise is the warmth factor. Average scores below 2.7 predict being male with a probability of 80% (41% of males responding the survey).

Correspondence analysis

Once the backbone of the project is analyzed, let´s divert our focus from personality traits to people’s lifestyle.

In particular the aim is to get some visual insights on how having a healthy lifestyle is determined by some characteristics as gender, age and education.

To the question, “I live a very healthy lifestyle” people have been asked to strongly agree (5), agree (4), neutral (3), disagree (2) or strongly disagree (1).

Here are some cross-tabulations for age-gender and education.

“ma” and “fa” stand for male and female respectively, while 1, 2, 3 corresponds to what we are considering the youngest, the “middle-aged” and the “oldest” as explained in data section.

Gender-age

Correspondence analysis comes handy for our purpose. Let´s have a look by gender-age.

In above CA map we can see the dimensions included explain respectively 72.4% and 16.2% of the total inertia, totaling 88.6% for the two-dimensional solution, which seems a very high value, conveying that the rest of inertias associated with the dimensions remaining are not really relevant in explaining variance, hence the 2 dimensions used are the optimal ones for interpreting category differences.

What can we infer from the map?

1st dimension is determined by the opposition between those who have an overall healthy lifestyle to those who position themselves on the extremes, that is, either having a very healthy lifestyle or having an overall unhealthy lifestyle.

The 2nd dimension seem to separate between the sign the deviation from the overall good is.

So, we can observe female on one side of the first dimensions and the males in the other. This means that young women in general are overall more concerned with carrying a healthy lifestyle than young men, the variability of opinions seems less dispersed and “polarized” than in men. Whereas for males, we can see a clear distinction between youngest ones (15–24yo) and grown-ups (25–30yo) where the grown-ups follow a very healthy lifestyle as opposed to younger people, and especially to young males who seem to follow quite an unhealthy lifestyle on average.

Education

This time 99.3% of the total inertia is explained by the 2 dimensions of the biplot confirming that is the optimal number of dimensions in explaining variance between the variables.

In particular we can observe as people with lower levels of education tend to have a less healthy lifestyle on average than those with superior studies.

In any case, let´s take this with a grain of salt, because since sampling units age from 15 to 30, age is highly correlated with studies.

5. Conclusions

We have applied factor analysis resulting in interesting personality dimensions that might not be captured from the standard questionnaires that follow the OCEAN approach. From them, we have been able to see partially different personality factors that could be the base for alternative personality psychometric models. After using regression trees on the factors obtained we have also been able to find interesting classifications for different basic categorical characteristics.

The limitation perhaps is that the explanatory factor analysis that led to obtaining the 5 dimensions explained, should be continued with a confirmatory factor analysis in order to build models with the factors that are sure to be statistically significant. This for sure would be the extension I would envisage for this kind of project.

6. References

https://www.kaggle.com/nowaxsky/visualization-and-prediction-of-shopping-habits/data

https://cran.r-project.org/web/packages/corrplot/corrplot.pdf

https://www.statisticssolutions.com/factor-analysis-2/

https://github.com/rsangole/PersonalityTraitFactorAnalysis/blob/master/FA_Traits.R

https://www.statmethods.net/advstats/factor.html

https://cran.r-project.org/web/packages/nFactors/nFactors.pdf

http://www.milbo.org/rpart-plot/prp.pdf

http://trevorstephens.com/kaggle-titanic-tutorial/r-part-3-decision-trees/

https://en.wikipedia.org/wiki/Big_Five_personality_traits

7. Appendix

List of statement/questions of personality traits

--

--