Uncover the Latent Structure of Personality with lavaan in R

Reveal hidden relationships within the data with structural equation modelling (SEM)

Published in

Towards Data Science

29 min readJan 24, 2022

Photo by Kévin et Laurianne Langlais on Unsplash

For a psychologist, one of the most challenging tasks is the prediction of people’s behaviour. Because we are a quite complex species, this requires knowledge about what makes a person tick — something that actually remains concealed. Statistically, we can infer underlying relationships via structural equation modelling (SEM), a set of statistical methods that especially useful if you are forced to model multiple cause-effect relationships indirectly instead of directly. For instance, while it is naturally possible to directly measure height and weight to characterize a physiological shape, the size of an elephant for instance, we must rely on indicators (e.g., tasks) to assess the latent intelligence of a person. Intelligence, as it is commonly stated, is structured hierarchically and can be split into multiple dimensions: A general intelligence (g-factor) is the most abstract part of the structure as it predicts more specific factors, such as a person’s individual reasoning ability, processing speed, memory span and so on. In turn, these subfactors are approximated using specific tasks. These task performances are thus the indicators of their latent constructs and include an error term (e.g., an individual’s day-to-day performance). For example, this is how the Wechsler Adult Intelligence Scale — Fourth Edition model (WAIS-V) is modelled (Benson, Hulac & Kranzler, 2010) :

Disclaimer: All images are created by the author unless stated otherwise.

Once our work goes far beyond exploratory analyses, our shared knowledge must be used to confirm the validity of any newly developed test: For example, does the common structure of intelligence hold true across different IQ tests? It should because we measure the same concept with different tests, so people’s performance on IQ test A should correlate with their performance on IQ test B. The hierarchical structure that we have formed a-priori must be somewhat represented in the data or something suspicious maybe going on — either the design of our measure — or the theoretical model needs a revision because of inconsistency with the data. If you are passionate about data science like I am, you probably have already heard of exploratory factor analysis (EFA) which serves the investigation of hidden factors within our sample, a scenario in which we let the data speak by themselves. In contrast, SEM and especially confirmatory factor analysis (CFA) are characterized by a more deductive approach: to confirm a theoretical structure, you specify the number of underlying factors and pattern of loadings in advance, often after running an EFA on a pilot sample. Usually, we can even run multiple models about the latent structure of a phenomenon under study to find out which of those are most strongly supported by the data. Technically, this done via covariance matrices that demonstrate the closeness between variables. You can think of a covariance as an unstandardized correlation. It indicates the degree to which variables are linearly related to each other in their scale of measurement. To make them more interpretable if different scales are used in the dataset (e.g., age and intelligence), they can be easily converted into correlations, a standardization that will serve us later. But first, the hypothesized structure (e.g., the intelligence model) will be used to simulate a covariance matrix that we could theoretically expect — this is our estimated population covariance matrix. We can tell something about the fit of the model when comparing it to the covariance matrix drawn from our actual sample data (e.g., measured test scores). So, the interrelations that we can expect under a certain model get compared with the ones observed in the data — how consistent are they? The closer they are, the better the fit (Ullman, 2006).

The Big Five of personality

Most people may be relieved to hear that you are by far not only defined by your intelligence, but by other attributes that describe you as a person: your personality which is shown by the way you behave, feel, and think. To compare people’s personality with each other numerically, we need to somehow quantify it by means of items on a scale. This way, we can try to use this assessment to match people to the right jobs or to guide personal development. But how is something like “personality” structured in the first place? Given that there are so many nuances and facets that play into the patterns that are typical for a person, how can we even break it down to a short but yet precise description?

Art Woman GIF By Sterossetti via GIPHY

Fortunately, we are not the first ones who ask these questions: Within the last 20 years of psychological research, five different dimensions have emerged as universal patterns of personality — often called the Big Five of personality (John, Naumann & Soto, 2008):

Openness — is also called Intellect or Imagination; describes the degree to which a person is interested in (and the extent of engagement with) new experiences, experiences and impressions. Open people are explorative, have a lot of fantasy and often search for intellectual inspiration.
Conscientiousness — describes the degree to which a person controls their behaviour in the pursuit of goals. Conscientious people tend to be well-organized, detail-oriented and reliable.
Emotional Stability — is also called Neuroticism; describes the degree to which a person is vulnerable to their own emotions. Emotionally stable people are stable, seldomly nervous or anxious and able to maintain their posture in stressful situations.
Agreeableness — describes the degree to which a person is focused on harmonious relationships. Agreeable people are friendly, empathetic, cooperative and able to minimize conflict.
Extraversion — describes the degree to which a person is sociable, friendly, adventurous and active. Extraverted people easily affiliate with other people, love excitement and are rather talkative, energetic and optimistic.

These are based on the so-called lexical approach, the idea that person descriptions are naturally manifested in each language, for instance by the words shy, productive, cheerful, manipulative etc. By asking participants to rate a known person using these adjectives, researchers have collected many thousand data points to split the personality space into distinct dimensions using exploratory factor analysis.

Even if the five factors are commonly accepted, there have been debates about the precise nature and structure of universal personality traits. For instance, are there higher order factors for personality dimensions just like there is a g-factor for intelligence? Are there really only five factors of personality or maybe more (like advocates of the HEXACO model argue)? If there are certain true personality dimensions, to which extent are they caused by the tendency to respond to questionnaires in a specific manner? And relatedly, do questionnaires actually reveal truths about personality in the first place?

As you may guess, there are a lot of questions in respect to the latent structure of personality which are still unanswered. The last one strikes me the most: Can we really rely on self-reports to assess aspects of human personality? Questionnaires that are designed to measure traits that are valued positively in our society (e.g., extraversion) have been criticised to naturally evoke socially desirable answers. Thus, due to these design flaws, people tend to rate themselves more favourably than expected based on their actual behaviour. Even with the best intentions, people make themselves look better on paper than they are. This is even more extreme in so-called high-stakes situations (e.g., job interviews, questions on dating apps), so conditions under which there is much to win or lose. Okay, this would not be much of a problem if a) all people would “fake” their responses to the same extent and if b) it would not affect the prediction of their behaviour under different conditions. For example, if a thin-skinned job-applicant who once pretended to be particularly emotionally robust was exposed to a tough work environment, it is a lose-lose situation for the applicant and for the employer. But unfortunately, this is exactly what happens in practice (Ziegler & Bühner, 2009) and smart people are particularly good at it (Geiger, Olderbak & Wilhelm, 2018). But even if people do not intentionally lie, the way the test is constructed makes it hard to accurately and objectively report how they behave in real life. Apart from social desirability, answers on personality questionnaires are confounded with a range of other biases such as transient mood states, ambiguity of the statements presented and implicit theories about personality (e.g., “Intellectual people are often introverts, so if I am a quick thinker I must be a calmer person, too.” Or “Still waters run deep.”). In this case study, I will show you to test the hypothesis that touch upon these topics and demonstrate how you can model latent structures yourself with lavaan in R.

The dataset

To stay with the five-factor model introduced previously, we will work with a Big Five dataset found on kaggle. It contains more than a Million (1,015,342) answers collected by Open Psychometrics online. This opensource platform provides free personality assessments for developmental purposes. According to the website, users give consent in advance that data would be collected and stored anonymously for scientific purposes.

Specifically, the test we will focus on is called the Big-Five Factor Markers which was developed by Goldberg (1992). It includes 50 questions (10 per personality dimension) which participants rate by themselves on a five-point-Likert scale where a 1 represents complete disagreement, a 3 for a neutral response and a 5 for full agreement. These are the questions:

EXT = Extraversion

EXT1 I am the life of the party.
EXT2 I don’t talk a lot.
EXT3 I feel comfortable around people.
EXT4 I keep in the background.
EXT5 I start conversations.
EXT6 I have little to say.
EXT7 I talk to a lot of different people at parties.
EXT8 I don’t like to draw attention to myself.
EXT9 I don’t mind being the center of attention.
EXT10 I am quiet around strangers.

EST = Emotional Stability

EST1 I get stressed out easily.
EST2 I am relaxed most of the time.
EST3 I worry about things.
EST4 I seldom feel blue.
EST5 I am easily disturbed.
EST6 I get upset easily.
EST7 I change my mood a lot.
EST8 I have frequent mood swings.
EST9 I get irritated easily.
EST10 I often feel blue.

AGR = Agreeableness

AGR1 I feel little concern for others.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I sympathize with others’ feelings.
AGR5 I am not interested in other people’s problems.
AGR6 I have a soft heart.
AGR7 I am not really interested in others.
AGR8 I take time out for others.
AGR9 I feel others’ emotions.
AGR10 I make people feel at ease.

CSN = Conscientiousness

CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I pay attention to details.
CSN4 I make a mess of things.
CSN5 I get chores done right away.
CSN6 I often forget to put things back in their proper place.
CSN7 I like order.
CSN8 I shirk my duties.
CSN9 I follow a schedule.
CSN10 I am exacting in my work.

OPN = Openness to new experiences

OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I have a vivid imagination.
OPN4 I am not interested in abstract ideas.
OPN5 I have excellent ideas.
OPN6 I do not have a good imagination.
OPN7 I am quick to understand things.
OPN8 I use difficult words.
OPN9 I spend time reflecting on things.
OPN10 I am full of ideas.

As a first step, let us read the data into our work environment and prepare it for subsequent analysis: we will remove incomplete cases and select personality variables only. Because we have such a huge dataset that most machines cannot really handle computationally, we must reduce the sample size to 10,000 observations. This is an arbitrary number that I have chosen to make it work on my own computer. Moreover, some of the responses are negatively keyed, for example. “I get stressed out easily” does not speak for an emotionally stable person, the contrary is true. Therefore, we need to reverse-recode them in a way that strong agreement should theoretically decrease a person’s score according to test’s scoring key. So, people who get stressed out easily are more likely to be emotionally less stable. We can wrap all of these pre-processing steps in a convenient dplyr-structure using the %>% operator just as follows:

How do the data actually look like now? Let us use the skimr-package to get some descriptive statistics.

# get some descriptives
skimr::skim(big5)

The little histograms on the right reveal a lot about the so-called item difficulty of the personality statements — how easily is a strong agreement evoked by each of the statements? For example, we can see a very flat distribution for item EXT9 — “I don’t mind being the center of attention.” This means that there is a similar amount of people who (slightly) agree in relation to people who (slightly) disagree, allowing for a wide distribution among the whole rating scale. Relatedly, it does not seem to be clear to the respondent if there is any right or wrong answer to this question and thus can make it more difficult for them to respond spontaneously. We cannot really make any conclusions about the mechanisms that are responsible for the vast variability of the responses because we cannot test our assumptions experimentally, but we could hypothesize that it may be either attributed to random guessing or true differences among individuals. Item EST3 (“I worry about things.”) on the other hand is strongly skewed to the right — because the item is reverse-coded, the high density of responses in the lower range means that way more people who agreed than people who disagreed, making worrying about things an attribute people commonly relate to. Based on the frequent agreement, this self-description seems to be rather normal than an exception. People seem to be overall pretty concerned!

Model assumptions

Before we dive into the actual analysis, let us check if we meet all required assumptions for SEM analysis. Firstly, structural equation analyses only work with a large amount of data. Here is general rule of thumb: We need more than 200 observations (but actually no less than 400 especially when observed variables are not normally distributed) or 5–20 times the number of parameters to be estimated, whichever is larger (e.g., Kline, 2005, pp. 111, 178). To select the right sample size depending on the complexity of our model, we can also simply count the number of parameters estimated by the model: to find out, we take variances and covariances where p is the number of measured variables. Because we have 5*10 = 50 measured variables in total, there are = 1275 parameters (50 variances and 1225 covariances) to be estimated. To get a reliable estimation of the model, there must be more datapoints than parameters to be estimated which is true in our case since 10.000 exceeds 1275. Statistically, this is a critical point that comes down to model identification, defined as the unique numerical solution for each parameter in the model. If you have less data points than parameters in the model, lavaan will give you an error because your model is under identified and parameters cannot be estimated. If there are the same number of data points as there are parameters in the model, your model is just identified, so the parameters perfectly reproduce the sample covariance matrix and any test statistic is zero. Why? You can only test your hypothesis about the adequacy of the model if you are able to compare the covariance matrix from your sample to an estimated population covariance matrix — but if they are perfectly identical in the first place because there are not enough data to do simulations on, the test statistic becomes zero and cannot be interpreted. This is why you need an over identified model, a situation in which you have more datapoints than parameters to be estimated. Apart from model identification, multivariate normality is another requirement for SEMs. We can test it by running mardia() from the psych package:

# Mardia test of multivariate normality
psych::mardia(big5, plot = FALSE)

There are two estimates in the output: b1p for mardia’s estimate of multivariate skew and b2p for mardia’s estimate of multivariate kurtosis, for of which can indirectly indicate if variables are normally distributed. For multivariate normality, both p-values of skewness and kurtosis statistics should be greater than 0.05. which does not happen to be the case for our data. We can see that both p-values (denoted as probability) are zero, thus we need to reject multivariate normality. The plot argument would usually give you a nice plot that allows you to inspect any outliers right-away, but because we have so many data points, R won’t be able to display the plot anyways. However, multivariate normality is especially tied to maximum-likelihood-estimation as you can read here, so to overcome the problem we can use a more robust version of ML which is less impacted by violations of normality.

Model specification

It is finally time to introduce lavaan! It is a fabulous R-package for latent variable analysis developed by Yves Rosseel, Terrence D. Jorgensen and Nicholas Rockwood among many contributers and now we will make use of it.

Let us test whether the classic structure assumed for the big holds true for this sample. We assume five distinct dimensions that are allowed to correlate with each other and have indicators that load only on their respective dimension, that is the 10 items per trait which have been introduced before. This is a graphical representation of the model:

To make it easier for you to read such a path diagram, here is a guide on SEM conventions: Everything that we can actually measure, so the observed variables, indicators, or manifest variables are represented by squares or rectangles, so the tiny little boxes on the bottom. All factors must have two or more indicators for mathematical reasons and for the reduction of measurement error. These factors are called latent variables, constructs, or unobserved variables. They are typically represented by circles or ovals. Lines indicate relations between variables — so if there is no such a line connecting variables, this implies that no direct relationship has been hypothesized. Lines have either one or two arrows: A line with one arrow represents an assumed direct relationship between two variables. A line with an arrow at both ends indicates a covariance between the two variables with no implied direction of effect. Extraversion, Agreeableness, Emotional stability, Conscientiousness and Openness are latent variables in our example which are expected to covary with each other. Each of the specific items (e.g., EXT1-EXT10) is directly predicted by their respective factor (e.g., Extraversion). We cannot directly observe what is going on inside of a person’s mind while answering a survey and how this relates to their personality, but there has been some theoretical considerations from the psychometric literature which help us to understand the logic: According to item response theory (IRT) models, we can estimate a person’s traits from their response to test items. So theoretically, a very extraverted person would likely rate item EXTi accordingly high because the person’s standing on the latent trait predicts their responses. This is because the probability of strong agreement to, let’s say “I feel comfortable around people”, should increase linearly as the level of extraversion increases (Brown, 2017). Therefore, the best we ca do is to take the responses as indicators for personality traits even if we cannot measure them directly. It is obvious that we cannot make a perfect job here: Note that there is a dashed two-headed arrow pointing towards each indicator in the path diagram: this denotes the error term, or residual variance, which describes the amount of variance that is not accounted for by the respective factor. From a more general perspective, the part of the model that relates the measured variables to the factors is called measurement model, i.e. the way each item is assigned to a personality dimension. If we want to simply estimate this measurement model and test its compatibility with the data, we often run a confirmatory factor analysis which is a certain type of SEM. The theoretical relationship among constructs, i.e. the covariance between personality dimensions, is called structural model. To tell lavaan about the structure we expect in our model, we need to specify a certain syntax before we run our confirmatory factor analysis.

Because we have quite a few variables in our model and I find typing all of them very tedious, I have defined a smart function that facilitates the job for us: it takes a certain abbreviation (e.g., EXT for extraversion) to select all matching variable names from the dataset, collapses them into one string and adds spaces as well as ‘+’ signs in between. This way we have the part that defines the manifest variables of a latent variable already prepared and can simply paste it into the syntax.

Generally, lavaan reads the model like this:

Fortunately, lavaan already assumes that all latent variables are correlated, so we do not need to specify this here in addition to the measurement model.

Now all we need to do is to run the cfa()function on our model syntax. We set st.lv to TRUE to standardize our latent variables — this way all latent variances are fixed to unity and easier to interpret because the factor loadings will be easier to interpret. The goal of estimation is to minimize the difference between the unstructured covariance matrix (actual data) and structured covariance matrix (prediction). We use ‘MLM’ as our estimator, that is a maximum likelihood estimation with robust standard errors and a Satorra-Bentler scaled test statistic. For more information about the estimators available for lavaan, you can click on the link here.

Actually, maximum likelihood estimation is not recommended for ordinal data (e.g., a Likert scale from 1–5) because they violate the assumption of normality, as the mardia test also has suggested for our data. Therefore, an estimator that has no distributional assumptions like the diagonally weighted least squares method (DWLS) would be more appropriate. It is related to the ADF estimator but is less computationally intense (Newsom, 2018). However, there are some arguments that speak against its use: Firstly, we would not be able to compare our results to the literature later because most studies on the latent structure of the Big Five have traditionally used maximum likelihood estimation (e.g., see Ashton, 2009). Even if a comparison between different Big Five personality tests and research designs is also certainly not perfect, a horribly huge mismatch between test statistics could indicate there is something wrong. By using DWLS estimation we would not have such a reference. Although this is not the strongest argument, some authors argue that Likert-scaled data can be interpreted as being continuous if lots of observations are available. 10.000 cases could be indeed considered to be a large dataset that allows for robust versions of maximum likelihood estimation. Thirdly, Shi and Maydeu-Olivares (2019) have shown that the estimator itself has a huge impact on the fit metrices even if the same (!) data were used which can give you a biased impression about the model’s performance. The authors recommend to either use different or less restrictive cut-off values for common indicators like the comparative fit index (CFI) or root mean squared error of approximation (RMSEA) or to directly use the standardized root mean square residual (SRMR) as it is most robust against the method used for estimation. Thus, we use a robust maximum-likelihood estimator while keeping our knowledge about fit indices in mind.

Model evaluation

By running the summary() function in our model object, we get a detailed model output. We set standardized = TRUE to get more meaningful estimates on latent variables and rsquare = TRUE to learn more about the degree to which our model can actually explain the participants responses. We set fit.measures = TRUE to get more fit statistics apart from the Chi-square estimate.

Lavaan ended normally and did not throw any error messages. The output neatly summarizes that we have 10000 observations and 110 model parameters and used maximum likelihood estimation. Now we can compare the test statistics between ‘our’ model and the baseline model. This is — you may have guessed it — a kind of hypothesis test. The baseline is a null model which means that all covariances are fixed to zero, so a scenario in which the items would be independent from each other and covary randomly. Why would we be interested in such a bad model? Because this way we can compare our model to a covariance matrix that could be expected when chance was operating alone. It is also the baseline against which your fitted model is compared to in order to calculate relative indexes of model fit (e.g., CFI or TLI). For this purpose, the Chi-square statistic is used as an indicator of model fit and is a measure that represents the smallest possible value for the difference between the structured/model-specific and unstructured/observed covariance matrix. The p-value in the output indicates whether the predicted model is compatible with the data observed, in other words it tests whether the difference between the covariance matrix of the model vs. the data could have appeared just naturally by chance (null hypothesis). Unlike for the hypothesis tests you may be familiar with, this is what we actually want. Therefore, larger probability values (p > .05) actually support our model because you do not have to reject the null hypothesis. Strictly interpreted, a small p-value thus indicates that our model does not fit the data because the estimated sample covariance matrix and the estimated population covariance matrix differ significantly. HOWEVER, it is very unlikely to find a case in which both the covariance matrices are perfectly congruent and is therefore an unreasonable benchmark. Moreover, the chi-square estimate is very sensitive to sample size in a way that a lot of observations inflate the estimated deviance between model and sample covariance matrices since that value is multiplied by N-1. As the sample size (N) increases, so does the difference value. Interpreting this value alone does not make much sense, so usually other more informative fit indices are used that make use of the chi-square estimate. Let us quickly go through a few of them: We distinguish comparative from absolute fit indices. Comparative fit indices are relative indexes of model fit — they compare the fit of your model to a null model (that baseline model from above). For example, the comparative fit index (CFI) or the Tucker-Lewis index (TLI) indicate the degree to which your model is consistent with the where values above 0.95 traditionally indicate a good fit (Hu & Bentler, 1999), nevertheless these fit indices are probably not comparable across disciplines. Absolute fit indices on the other hand compare the fit of your model to a perfect fitting model, so a scenario in which the model would not make any mistakes in predicting the covariance matrix. Root mean squared error of approximation (RMSEA) is such an indicator for which values below 0.06 indicate a “good fit”. But because the RMSEA tends to overly punish small sample sizes (N < 150) which is makes the SRMR a better candidate to indicate absolute model fit, among its robustness against model misspecifications (Shi & Maydeu-Olivares, 2019). In our example, the robust versions of the CFI and TLI amount to 0.75 and 0.74, respectively. The TLI estimate from our actually comes really close to an empirical average value of 0.73 — a result from a meta-analysis from Chang, Connelly & Geeza in which the authors aggregated different studies on the model structure of the big five (e.g., correlated traits, a single method). The RMSEA probably lies somewhere between 0.063 and 0.064 which indicates an acceptable but not very good fit. In the section on latent variables, we can see the extent to which the items load on their expected latent variable. We focus on the Std.all column — it is standardized because both latent and observed variables have a variance of one. For example, the standardized loading of AGR4 on agreeableness is 0.76, so if you take the response to “I sympathize with others’ feelings.”, this nicely points the degree to which a person aims for harmony in social interactions (not a surprise!). Our latent variables are not expected to be orthogonal, that is they are allowed to covary. The Std.all column can be interpreted the estimated correlations between latent variables given our model, here we have small to moderate values ranging from 0.02 (agreeableness — emotional stability) to 0.35 (extraversion — agreeableness). Note that in the Variances: section, the output is an estimate of the residual variance, so the left-over variance that is not explained by the predictor(s). Larger values could therefore suggest that the items might be confounded with other influences, for instance take the statement “I insult people.” (AGR4): a reverse of this response may not directly translate to agreeableness, but probably also measures the degree of aversion against others. Taking a look at the r-square values below, we see the estimated total variance in the manifest variable explained by their respective latent variable. For example, the latent emotional stability explains about 48%, so a bit less than half-, of the variation from responses to EST1 (“I get stressed out easily.”).

By the way, if you want to quickly access the fit indices that matter the most to you, you can type the following:

Note however that the indices are not 100% identical to the output from the summary function but come very close, even if we do take the standard (not robust) versions of the CFI and RMSEA. Honestly, I have no idea where this comes from. Do you? Please write in the comments below.

Model modification

Some people suggest to look for so-called modification indices to test if beneficial changes on the model have been overlooked. If you call modificationindices(big5_cfa, sort = TRUE) for instance, you get a table with suggestions on how to modify your syntax (e.g., add regressions based on the sample covariance matrix) to improve model fit. By setting the sort argument to TRUE, you get the modifications with the largest impact on the chi-square estimate first. Nonetheless, if you are a fan of machine learning, this procedure should ring a bell: this sounds like overfitting your model to your data! Once you have done massive modifications on your model based that were not planned a-priori, your confirmatory analysis becomes an exploratory analysis. This is fine if you are aware of that change, report every detail on your decision-making process and make sure to include only the terms that make sense from a theoretical perspective. In our case, the output suggests to define emotional stability with a single indicator that is supposed to measure conscientiousness, t.b.s. EMO =~ CSN4, which is complete nonsense. A safer way is to test plausible models against each other, which is what we do now, building on our pinch of suspicion in respect to questionnaires. The focus on biases in respect to personality questionnaires is not new because applied researchers have long been worried that answers might be contaminated by shared variance that comes with the use of self-reports. The idea is the following: all latent variables (e.g., extraversion, agreeableness etc.) are not only confounded with random measurement error which is what we expect due to the nature of data in general but with a systematic bias associated with questionnaires themselves. Our substantive variables are probably contaminated by a common unmeasured factor that affects all variables to a similar extent. To model that common method factor (CMV), we must borrow some knowledge from classical test theory: Accordingly, a test score (e.g., answer on a statement tapping into personality) can be understood as the sum of a person’s true score and a measurement error. Such an error is supposed to be due to random or unsystematic influences. But there are also cases in which measurement error can be something different: a non-random influence that can be attributed to a combination between the context (e.g., filling in an online survey) and the person. That so-called spurious measurement error (Schmidt, Lee and Ilies, 2003) occurs not always, but under the same circumstances and therefore can be seen as systematic. That error increases correlations among items regardless of the trait it is supposed to measure, it cannot be distinguished from true score variance. Therefore, CMV also increases correlations between traits because this is variance all of the items have in common. This is not what we want because it appears as if latent variables are correlated in a meaningful way, but what if they actually aren’t?

We can understand common method variance as spurious measurement error which can be modelled as a latent variable using SEM (Podsakoff, MacKenzie, Lee and Podsakoff, 2003). That common method variance comes down to the way the data were measured that has nothing to do with what is attempted to be measured, therefore we define CMV as a factor that underlies all items regardless of the personality dimension they are supposed to measure whereas it does not have anything to do with the actual personality dimensions. Therefore, CMV includes all manifest variables and we need to make it orthogonal to our big five variables by including EXTRA + AGREE + EMO + OPEN + CON ~~ 0*CMVinto our model syntax.

Okay — maybe the model that includes that CMV-factor provides a better explanation of the data? Before diving into a one-to-one-comparison, we will take a look at the summary.

Just like for our classical big five model, the Chi-square statistic suggests that both our baseline model and our user model do not fit the data, but as discussed previously, taken by itself this is not a good indicator of model fit due to its sensitivity to sample size. Our goodness-of-fit indicators propose a better model fit compared to our original model as the CFI (0.79 vs. 0.75) and TLI (0.77 vs. 0.74) have increased. Similarly, the badness-of-fit indicators point towards a similar direction because the RMSEA (0.065 vs. 0.064) shows no improvements over the original model. Nevertheless, the SRMR (0.062 vs. 0.075) has decreased and we should attach special value to this index because it has shown to be most robust against model misspecifications (Shi & Maydeu-Olivares, 2019). Interestingly, our standardized factor loadings between items and our personality variables have decreased to a moderate extent, suggesting that common method variance may account for a part of the participant’s responses. Because we are interested in the extent to which such a method factor distorts the relationships between personality dimensions, we are particularly interested in the covariances section. It appears that after including the CMV factor, the correlations among personality variables more or less disappeared, speaking for the assumption that they may have resulted from responses biases (e.g., making oneself a better person overall on paper) in the first place. There is one exception: the correlation between emotional stability and conscientiousness dropped by a negligible amount (0.23 vs. 0.25) which suggests that the relationship between a person’s degree of neuroticism and their diligence is not as heavily influenced by a common method factor. Based on our r-square output, we can see that the estimated variance in item responses explained by our model overall somewhat improved for some items, but stayed the same for others. However, there are some limiting factors that affect the interpretation of a common method variance factor in this case: Actually, Multi-Trait-Multi-Method (MTMM) are notorious for such problems (e.g., Chang, Connelly & Geeza, 2012), so more advanced research designs make use of different methods (e.g., external reports vs. self-reports) as control conditions to extract true personality differences, something that should be independent from the sources they come from. For example, a truly imaginative person should be rated as such just equally both from themselves as well as from others. We do not have such control conditions and therefore we cannot definitely tell something about the nature of our CMV factor. For instance, some researchers argue that we can attribute the variance captured by the CMV factor to responses biases, but alternatively to the presence of a higher order personality factor, similarly like the g-factor of intelligence (something that would make you a “super” person, for a critical review see Ashton et al. 2009; Chang et al. 2012 or McCrae et al. 2008) . So, the question whether the intercorrelations between manifest variables are a sign for response biases or an even bigger meta personality dimension remain unanswered and are a bit beyond the scope of this article.

Okay, this may feel a bit unsatisfying to you because a) there are theoretical limitations and b) going through the model output one by one does not provide a clear response to the question which of the models does a better job. This is why I have a nice extra for you…

Model comparison

If your models are nested (e.g., an original model A is just a subset of a new larger model B), you can use a chi-square test to find out whether the difference in performance is statistically significant. But if we try, we lavaan throws an error saying that “some models are based on a different set of observed variables”), suggesting that our original model is not nested within our cmv model which makes the chi-square test uninterpretable. However, we can use the Akaike-Information-Criterion (AIC) for model comparison. It is a relative indicator of model fit for competing models used on the same dataset, having it’s root in information theory: Whenever we use a model, it is just a rough representation of processes that occur outside in the wild, this is why a model can never be exact. The AIC estimates the relative amount of information lost due to the modelling itself — the smaller the loss, the higher the quality of our candidate compared to competing models. In particular, the AIC has a term against overfitting because more complex models need to show some extra power above and beyond the advantage in prediction that come simply from having more parameters to approximate the data: mathematically, the log likelihood of the complex model must be greater than the log likelihood of the simple model by at least the number of additional parameters for the AIC to decrease. So, the more complex model needs to try harder for a better fit. In other words, the AIC will go up by 2 for every additional parameter included in the model — for goodness of fit to go down again, the model needs to make up for this and provide a log likelihood that is at least 2. We can plug in the desired fit index into the fitmeasures() function to compare our two models directly: The AIC is considerably smaller for the model that includes a common method variance factor compared to the original model, suggesting that our cmv-model results in less information loss even though it is the more complex model and thus fits the data better.

A note on causality and limitations

Structural equation models are supposed to give you some hints about the covariances hidden in the data and are useful tools to confirm a strong theory. SEMs allow to conduct a complex, multidimensional and fine-grained analysis of your data. But if we really wanted to know to which extent response biases messed up with personality measures, we would need to employ experimental designs as briefly discussed above (this has already been done, e.g. Ziegler & Bühner, 2009). In many SEM analyses however, we do not have an experimental control condition to compare our CFA model to, in fact, we rely on the validity of our own theory. A correlation (or covariance) is neither intuitively satisfying nor explanatory powerful estimate. Just like it is not enough to prove a brain area is responsible for the perception of a certain category like houses or faces when it lightens up under the scanner after their presentation, a good model fit alone does not prove that our theory is right. From a philosophical perspective (see Tarka, 2018), this is closely related to model equivalence: we can construct completely different models that yet have the same fit to the data means that once we have a “good result”, this does not prove whether or not there is an even better model out there. This is why this might not be the end of our analysis yet. There is probably so much more to discover!

References

[1] N. Benson, D. M. Hulac, and J. H. Kranzler, Independent Examination of the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV): What Does the WAIS-IV Measure? (2010), Psychol. Assess., vol. 22, no. 1, pp. 121–130.

[2] J. B. Ullman, Structural equation modeling: Reviewing the basics and moving forward (2006), J. Pers. Assess., vol. 87, no. 1, pp. 35–50.

[3] O. P. John, L. P. Naumann, and C. J. Soto, Paradigm shift to the integrative Big Five Trait taxonomy: History, measurement, and conceptual issues (2008), Handb. Personal. Theory Res., no. January 2008, pp. 114–158.

[4] M. Ziegler and M. Buehner, Modeling socially desirable responding and its effects (2009), Educ. Psychol. Meas., vol. 69, no. 4, pp. 548–565.

[5] M. Geiger, S. Olderbak, R. Sauter, and O. Wilhelm, The ‘g’ in faking: Doublethink the validity of personality self-report measures for applicant selection (2018), Front. Psychol., vol. 9, no. NOV, pp. 1–15.

[6] Opensource Psychometrics Project (2019)

[7] L. R. Goldberg, Possible questionnaire format for administering the 50-item set of IPIP Big-Five factor markers (1992), Psychol. Assess, 4, 26–42.

[8] R. B. Kline, Principles and practice of structural equation modeling (2005), 2nd ed. New York: Guilford, 3.

[9] A. Brown, Item response theory approaches to test scoring and evaluating the score accuracy (2017), Wiley Handb. Psychom. Test. A Multidiscip. Ref. Surv. Scale Test Dev., vol. 2–2, pp. 607–638.

[10] Newsom, Structural Equation Modeling (2018), Psy 523/623.

[11] M. C. Ashton and K. Lee, Higher order factors of Personality : Do They Exist ? (2009), Society, vol. 13, no. 2, pp. 79–91.

[12] D. Shi and A. Maydeu-Olivares, The Effect of Estimation Methods on SEM Fit Indices (2020), Educ. Psychol. Meas., vol. 80, no. 3, pp. 421–445.

[13] L. T. Hu and P. M. Bentler, Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives (1999), Struct. Equ. Model., vol. 6, no. 1, pp. 1–55.

[14] L. Chang, B. S. Connelly, and A. A. Geeza, Separating method factors and higher order traits of the Big Five: A meta-analytic multitrait-multimethod approach (2012), J. Pers. Soc. Psychol., vol. 102, no. 2, pp. 408–426.

[15] F. L. Schmidt, H. Le & R. Ilies, R., Beyond alpha: An empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual-differences constructs (2003), Psychological methods, 8(2), 206.

[16] P. M. Podsakoff, S. B. MacKenzie, J. Y. Lee, and N. P. Podsakoff, Common Method Biases in Behavioral Research: A Critical Review of the Literature and Recommended Remedies (2003), J. Appl. Psychol., vol. 88, no. 5, pp. 879–903.

[17] R. R. McCrae et al., Substance and Artifact in the Higher-Order Factors of the Big Five (2008), J. Pers. Soc. Psychol., vol. 95, no. 2, pp. 442–455.

[18] P. Tarka, An overview of structural equation modeling: its beginnings, historical development, usefulness and controversies in the social sciences (2018), Qual. Quant., vol. 52, no. 1, pp. 313–354.