The world’s leading publication for data science, AI, and ML professionals.

Introduction to Biplot Analysis: Get Insights based on Indonesia Poverty Data

Biplot analysis is a graphical representation of multivariate data that simultaneously plots information on the observations and the…

Photo by Dikaseva on Unsplash
Photo by Dikaseva on Unsplash

Hands-on Tutorial

Biplot analysis is a graphical representation of multivariate data that plots information between the observations and variables in Cartesian coordinates

BACKGROUND

According to Nurwati (2008), poverty is a problem that is always faced by a human. Its problem is as old as humanity itself and the implications can involve various aspects of human life. In other words, poverty is a global social problem, meaning that the problem has become the world’s attention and exists in all countries around the world although the impact is something different. Even so, sometimes poverty is often not recognized by humans as a problem. For those who are classified as poor people, it’s something that is real in their daily lives because they are living in poverty. However, it is hard to conclude that they are aware of the condition they live in.

The factors that cause poverty for instance the low level of education awareness, poor health, limited employment opportunities, and isolation condition.

Therefore, the biplot analysis is needed to identify any characteristics of the provinces in Indonesia based on the factors of population and poverty. Its analysis can provide an easy understanding through graphic representation that is more attractive, informative, communicative, and artistic. Further, using the biplot analysis, the relationship between the population and poverty in each province can be identified visually.

Objectives

The objectives of the study are:

  • to identify the relationship between the variables of population and poverty in Indonesia, such as the population in each province, the poverty line, the percentage of the farmers’ population, the average poverty gap index (P1), and an average poverty severity index (P2)
  • to identify the relative position between provinces in order to look the similar characteristics
  • to obtain the characteristics of provinces on the variables of population and poverty in Indonesia
  • to find out the provinces with the lowest poverty level based on the population and poverty variables

Benefits

The benefits of the study are:

  • to identify the variables that have a high relationship with the poverty level in Indonesia
  • to identify the characteristics of provinces based on the population and poverty variables so that the findings can be considered in public policy discussion
  • to evaluate the performance of local governments in carrying out regional development in order to alleviate the poverty level in Indonesia

Scopes

The scope of the study is that the data used is the secondary data coming from Indonesia poverty data 2010 and village potential data 2010. These data can be acquired at the Indonesia Central Bureau of Statistics. The documentation is available at https://github.com/audhiaprilliant/Multivariate-Analysis.

METHODOLOGY

Data

The data used is the secondary data coming from Indonesia poverty data 2010 and village potential data 2010. These data can be acquired at the Indonesia Central Bureau of Statistics. The data has several variables as follows:

Table 1 Data for research (Image by Author)
Table 1 Data for research (Image by Author)

The procedure of data analysis

The steps carry out in research of correspondence analysis for the Indonesia poverty and village potential in 2010 are as follows:

  1. aggregate the Indonesia poverty data by provinces. The variables used are the percentage of the poor population, the P1 index, the P2 index, and the poverty line
  2. aggregate the village potential data by provinces. The variables used are the male population, female population, household population, percentage of the farmer population
  3. merge the Indonesia poverty and village potential data based on the province name

RESULT AND DISCUSSION

Data pre-processing and Integration

The initial step in this study is data pre-processing. It is carried out as follows:

  • standardize the formatting of province name for both Indonesia poverty and village potential data (start with a capital letter)
  • check the formatting of the province name for both Indonesia poverty and village potential data
  • check the wrong formatting of the province name for both Indonesia poverty and village potential data

Biplot Analysis

Biplot analysis is a multivariate analysis that tries to compress information and shows them in Cartesian coordinates using Principal Component Analysis (PCA). To identify the variance of components, it’s necessary to calculate the eigenvalue. This eigenvalue is displayed in Table 2.

# Load the libraries
library(factoextra)
library(FactoMineR)
# Read the data
data_biplot = read.csv(file = 'Social Poverty Indonesia 2010.csv',
                       header = TRUE,
                       sep = ',')
colnames(data_biplot)
data_biplot = data_biplot[,-10]
str(data_biplot)
# Province as Rownames
data_biplot_no_province = data_biplot[,-1]
rownames(data_biplot_no_province) = data_biplot[,1]
data_biplot = data_biplot_no_province

The biplot analysis

# Biplot analysis
res_pca = PCA(data_biplot, 
              graph = FALSE
)
print(res_pca)
# Biplot graph
fviz_pca_biplot(res_pca,
                repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
)
# Calculate the eigenvalue
eig_val = get_eigenvalue(res_pca)
fviz_eig(res_pca,
         addlabels = TRUE,
         ylim = c(0, 50)
)
Table 2 Eigenvalue of PCA (Image by Author)
Table 2 Eigenvalue of PCA (Image by Author)

Based on Table 2, it is found that the 1st component and 2nd component explain 72.148% of the variance of components. So the results obtained by the biplot analysis are quite good in explaining the variance of the data.

Figure 1 Biplot analysis and scree plot (Image by Author)
Figure 1 Biplot analysis and scree plot (Image by Author)

Based on the scree plot in Figure 1 (right), it’s found that in the 2nd component, the percentage of variance that can be explained doesn’t drop steeply. But in the third component, the decrease in the percentage of variance that can be explained drops steeply. It indicates that the addition of the third component has an influence on the explained variance. However, for the biplot analysis, only the first and second components are chosen. The total percentage of the variance that can be explained by the two components is around 72.148%.

Analysis of observations against the result of biplot analysis

After the biplot analysis is carried out, another analysis is conducted on each row (observation) or column (variable) in determining and making an evaluation for the results of the biplot analysis. The coordinates for each province in the two components (PCA) are displayed in Table 3. Because the biplot analysis only uses the 1st component and 2nd component, we will keep the focus on the first two columns.

# Graph of observations
ind = get_pca_ind(res_pca)
# Coordinates of observations
ind$coord
Table 3 Coordinates of observation (Image by Author)
Table 3 Coordinates of observation (Image by Author)

Based on the total percentage of explained variance by the 1st component and 2nd component which is 72.148%, it means that 27.852% of information is missing. Using the 1st component and 2nd component, we have the consequence that there may be points of observation that cannot be represented properly by the biplot. The correct measure to see it is the squared cosine. If the point of observation is adequately represented by the biplot, then the sum of the square cosine in the 1st component and 2nd component is close to 1.

# Squared Cosine of observations
ind$cos2
Table 4 Square cosine of observation (Image by Author)
Table 4 Square cosine of observation (Image by Author)

Based on Table 4, it can be concluded that there are several provinces in the biplot that are not well represented. Those are Banten, DKI Jakarta, Gorontalo, Central Kalimantan, Aceh, Riau, South Sulawesi, Southeast Sulawesi, North Sulawesi, and South Sulawesi. It is caused by the square cosine for those provinces being quite small, below 0.7. Visually, the value of the squared cosine is illustrated in Figure 2.

# Color by cos2 values: quality on the factor map
fviz_pca_ind(res_pca,
             col.ind = 'cos2',
             gradient.cols = c('#00AFBB', '#E7B800', '#FC4E07'),
             repel = TRUE # Avoid text overlapping
)
# Cos2 of individuals on 1st component and 2nd component
fviz_cos2(res_pca,
          choice = 'ind',
          axes = 1:2
)
Figure 2 Biplot analysis and square cosine of observation (Image by Author)
Figure 2 Biplot analysis and square cosine of observation (Image by Author)

Analysis of poverty and village potential data indicators against the result of biplot analysis

The coordinates for each variable in the two components (PCA) are displayed in Table 5. Because the biplot analysis only uses the 1st component and 2nd component, we will keep the focus on the first two columns.

# Graph of variables
var = get_pca_var(res.pca)var
# Coordinates of variables
var$coord
Table 5 Coordinates for the variable (Image by Author)
Table 5 Coordinates for the variable (Image by Author)

Using the 1st component and 2nd component, it has the consequence that it will be possible that there are poverty and village potential variables that cannot be well represented by the biplot. To explore this case, we will look at the squared cosine illustrated in Table 6.

If these indicators are well represented then the sum of the squared cosine in the 1st component and 2nd component will be close to 1. The value of squared cosine for variables is displayed in Table 6.

# Squared Cosine of variables
var$cos2
Table 6 Square cosine of the variable (Image by Author)
Table 6 Square cosine of the variable (Image by Author)

Based on table 9, it can be concluded that there are a few variables that are not well represented. These are the percentage of the poor population and the average poverty line. They have a squared cosine that is quite small, below 0.7.

# Color by cos2 values: quality on the factor map
fviz_pca_var(res_pca,
             col.var = 'cos2',
             gradient.cols = c('#00AFBB', '#E7B800', '#FC4E07'),
             repel = TRUE # Avoid text overlapping
)
# Cos2 of variables on 1st component and 2nd component
fviz_cos2(res_pca,
          choice = 'var',
          axes = 1:2
)
Figure 3 Biplot analysis and square cosine of the variable (Image by Author)
Figure 3 Biplot analysis and square cosine of the variable (Image by Author)

The correlation of variables

The relationship between variables can be seen from the correlation between these variables. The correlation is the cosine of the angle formed by these two vectors (variables). If the vectors coincide or the angle of both of them approaches 0, then the correlation approaches 1. The correlation between the two variables that are around 1 has implications for the significant relationship between these two variables. The male population and female population have the highest correlation which is around 0.9996. The other highest is between the P1 index and P2 index with a value of 0.9910.

The correlation of variables (Image by Author)
The correlation of variables (Image by Author)

CONCLUSION AND INTERPRETATION

Based on Figure 1, the relationship between observations and variables can be grouped into four categories as follows:

  • The first category. It is dominated by the female population, male population, and household population. It has several provinces which spread around these variables, namely the provinces of West Java, Central Java, East Java, and North Sumatra. It indicates that these provinces have a fairly high population compared to others in Indonesia
  • The second category. It has only one variable which is the poverty line. Its category corresponds to Banten, Bangka Belitung, Riau, Bali, East Kalimantan, Central Kalimantan, South Kalimantan, Jambi, West Sumatra, Jambi, Riau, and DKI Jakarta. These provinces have a higher poverty line rather than others in Indonesia
  • The third category. It is dominated by the P1 index and the P2 index. It consists of Lampung, East Nusa Tenggara, and South Sulawesi. They have a higher P1 index and P2 index in Indonesia
  • The fourth category. It has two dominant variables. They are the percentage of the poor population and the percentage of the farmer population. It consists of West Nusa Tenggara, Central Sulawesi, North Sulawesi, Southeast Sulawesi, Nanggroe Aceh Darussalam, and Bengkulu.

REFERENCES

[1] N. Nurwati. Kemiskinan: model pengukuran, permasalahan, dan kemiskinan alternatif kebijakan (2008), Jurnal Kependudukan Padjadjaran. 10(1): 1–11.


Related Articles