The world’s leading publication for data science, AI, and ML professionals.

Factor Analysis on “Women Track Records” Data with R and Python

Not just dimensionality reduction, but rather find latent variables

Hands-on Tutorials

Photo by Nicolas Hoizey on Unsplash
Photo by Nicolas Hoizey on Unsplash

Factor Analysis (FA) and Principal Component Analysis (PCA) are both dimensionality reduction techniques. The main objective of Factor Analysis is not to reduce the dimensionality of the data. Factor Analysis is a useful approach to find latent variables which are not directly measured in a single variable but rather inferred from other variables in the dataset. These latent variables are called factors. So, factor analysis is a model of the measurement of latent variables. For example, if we find two latent variables in our model, it is called a two-factor model. The main assumption of the FA is that there exist such latent variables in our data.

Today, we perform Factor Analysis using the principal component method which is much similar to the Principal Component Analysis. The data we use here is National Track Records for Women representing 55 countries in seven different events. You can download the dataset here. The dataset looks like as follows.

First 5 rows of women track records data
First 5 rows of women track records data

The description of the variables is as follows.

Prerequisites

This article assumes that you have knowledge in Principal Component Analysis. If you haven’t, please read my previous articles:

Perform FA on women track records data with R

Select the best number of factors for the dataset

Here, we use the principal() function to apply FA on our dataset. The principal() function performs factor analysis with the principal component method.

The following code block performs FA on our data.

The rotation is set to none for now. In addition to that, we request 4 factors for our data. Later, we can request the best number of factors which will be less than 4. The covar argument is very important. By setting covar = FALSE, the FA process should use the correlation matrix instead of the covariance matrix. Factors derived from the correlation matrix are the same as those derived from the covariance matrix of the standardized (scaled) variables. Therefore, by setting covar = FALSE, the data will be centred and scaled before the analysis and we do not need to do explicit feature scaling for our data even if there are significant differences in the scale between the features of our dataset.

Let’s get the eigenvalues by running fa$values.

Rounding these values up to 3 decimal places will give:

Therefore,

  • The variance explained by the first factor alone is: (5.068/7) x 100% = 72.4%
  • The variance explained by the first two factors together is:((5.068+0.602)/7) x 100% = 81%
  • The variance explained by the first three factors together is: ((5.068+0.602+0.444)/7) x 100% = 87.34%

and so on…

Let’s create the scree plot which is the visual representation of eigenvalues.

The bend occurs clearly at the 2nd eigenvalue. According to Kaiser’s rule, it is recommended to keep the components with eigenvalues greater than 1.0. We have obtained eigenvalues and only the first one is greater than 1.0. Keeping the first factor corresponding to the eigenvalue of 5.068 explains only about 72.4% variance in the data. So, we keep the second one also. Then, the first two factors together explain 81% variance in the data.

So, we keep two factors for our data.

Find out whether a factor rotation is needed for the data

Let’s perform FA again with nfactors = 2 (previously, nfactors = 4). This is because we have decided to keep only two factors for our data. Then, we get the following factor loadings.

By looking at the output, we can determine that the original factor loadings do not yield interpretable factors. This is because the loadings on each factor do not spread very well. Therefore, we try factor rotation which is an orthogonal transformation of original factors. Factor rotation is done for the purpose of interpretation. By setting rotate = varimax in the principal() function, we perform FA again with the varimax factor rotation. Then, we get the factor loadings again.

Calculate the factor loadings, communalities and specific variances for the selected model

After running the fa2 model, you will get the following output.

The above output contains rotated factor loadings, communalities and specific variances in one table. The RC1 and RC2 columns denote loadings for two factors that we selected. The h2 column represents communalities for each variable. The u2 column represents specific variances for each variable. Running fa2$loadings and fa2$communality will also give the rotated factor loadings and communalities separately.

Interpret the communality values and specific variances

[Communalities]

  • X1: About 86% of the variability of X1 is explained by the two factors that we selected.
  • X2: About 87% of the variability of X2 is explained by the two factors that we selected.
  • X3: About 78% of the variability of X3 is explained by the two factors that we selected.
  • X4: About 85% of the variability of X4 is explained by the two factors that we selected.
  • X5: About 80% of the variability of X5 is explained by the two factors that we selected.
  • X6: About 74% of the variability of X6 is explained by the two factors that we selected.
  • X7: About 78% of the variability of X7 is explained by the two factors that we selected.

Clearly, a high variance of all variables is explained by the two factors that we selected.

[Specific variances]

  • The effect of the specific factor on X1 is about 14%.
  • The effect of the specific factor on X2 is about 13%.
  • The effect of the specific factor on X3 is about 22%.
  • The effect of the specific factor on X4 is about 15%.
  • The effect of the specific factor on X5 is about 20%.
  • The effect of the specific factor on X6 is about 26%.
  • The effect of the specific factor on X7 is about 22%.

Identify the factors in the model

Let’s see the rotated factor loadings again.

It is clear that variables X1, X2, X3 define factor 1 (high loadings on factor 1, relatively small loadings on factor 2) while variables X4, X5, X6 and X7 define factor 2 (high loadings on factor 2, relatively small loadings on factor 1). But the variable X4 has aspects of attributes represented by both factors (approximately equal loadings on both factors).

To give names for the two factors, let’s focus on the domain knowledge of the field. In the given problem, the variables X1, X2, …, X7 have the following meanings.

The given dataset represents the National Track Records for Women representing 55 countries in seven different events. Generally, in short-distance running (e.g. 100m, 200m, 400m), athletes should mainly focus on the speed. In long-distance running (e.g. 1500m, 3000m, Marathon), the athletes should mainly focus on tolerance or endurance. In our analysis, factor 1 represents short-distance track records (since X1, X2 and X3 define factor 1) and factor 2 represents long-distance track records (since X4, X5, X6 and X7 define factor 2). Therefore, we can give relevant names for the two factors as follows.

  • Factor 1 → speed factor
  • Factor 2 → tolerance or endurance factor

But also note that the athletes who participated 800m event (represented by X4) should have maintained a good balance between speed and tolerance. This is because X4 has approximately equal loadings on both factors.

Image by author
Image by author

These are the actionable insights that can be obtained by performing Factor Analysis (FA) on "women track records" data.

Perform FA on women track records data with Python

In Python, FA can be performed by using the factor_analyzer library with the help of other libraries such as matplotlib, pandas, numpy and scikit-learn. Here, we obtain exactly the same results, but with a different approach. Instead of using the correlation matrix, we use the covariance matrix and we perform the feature scaling manually before running the FA. Then, we provide standardized (scaled) data into the object created from the FactorAnalyzer() class.

To perform FA, we first create an object (called fa) from the FactorAnalyzer() class by specifying relevant values for the hyperparameters. Then, we call its fit() method to perform FA. We provide scaled data to the fit() method. Then we call various methods and attributes of the fa object to get all the information we need. The outputs are in the form of numpy arrays. We can use several print() functions to nicely format the output. Here, we also create the scree plot. Finally, we call the transform() method of the fa object to get the factor scores. Then, we store them in a CSV file and an excel file for future use. The dimension of the new (reduced) dataset is 55 x 2. There are only two columns. This is because we decided to keep only two factors which together explains about 81% variability in the original data.

The following Python code block performs FA on our dataset.

The following code block creates the scree plot.

After running the following code block, the factor scores are stored in a CSV file (track_records_81_var.csv) and an excel file (track_records_81_var.xlsx) which will be saved in the current working directory. The dimension of the new (reduced) data is 55 x 2. This is because we decided to keep only two factors which together explains about 81% variability in the original data.

The following image shows the first 5 observations of the new dataset. RF1 stands for Rotated Factor 1 and RF2 stands for Rotated Factor 2. The shape of the dataset is 55 x 2.

First 5 observations of the reduced dataset
First 5 observations of the reduced dataset

We can use the new (reduced) dataset for further analysis.

Summary

Factor Analysis is not just for dimensionality reduction, but rather find latent variables. Here, we have used two different programming languages to perform FA. R’s outputs are nicely formated although both languages provide high-level, built-in functions. We have obtained exactly the same results, but with two different approaches. First, we used the correlation matrix by setting covar = FALSE in the principal() function. As our second approach, we used the covariance matrix with scaled data by setting is_corr_matrix = False in the FactorAnalyzer() function.

Selecting the best number of factors is subjective. It depends on the data and its domain. Sometimes, we cannot decide the best number of factors by looking at the scree plot alone. Kaiser’s rule is also not a hard rule. There is always flexibility. The general thing is that we should often maintain a good balance (trade-off) between the number of factors and the amount of variability explained by the selected factors together.

Thanks for reading!

This tutorial was designed and created by Rukshan Pramoditha, the Author of Data Science 365 Blog.

Read my other articles at https://rukshanpramoditha.medium.com

Technologies used in this tutorial

  • Python & R (High-level programming languages)
  • pandas (Python data analysis and manipulation library)
  • matplotlib (Python data visualization library)
  • Scikit-learn (Python machine learning library)
  • Jupyter Notebook & RStudio (Integrated Development Environments)

Machine learning used in this tutorial

  • Principal Component Analysis (PCA)
  • Factor Analysis (FA)

Statistical concepts used in this tutorial

  • Correlation matrix
  • Variance-covariance matrix

Mathematical concepts used in this tutorial

  • Eigenvalues

2021–02–05


Related Articles