Understanding Feature extraction using Correlation Matrix and Scatter Plots

The data out there in the world is very huge and needs to be dealt very consciously for any sensible outcome we’d like to achieve through a novel data science approach.

Tarun Acharya
Towards Data Science

--

This article is going to deal with a very fundamental and important concept when dealing with a large number of features in a given dataset.

Photo by Carlos Muza on Unsplash

Any typical machine learning or deep learning model is made to provide a single output from huge amounts of data be it structured or unstructured. These factors may contribute to the required result at various coefficients and degrees. They need to be filtered out in a way based on their significance in determining the output and also considering the redundancy in these factors.

In supervised learning, we know that there is always an output variable and n input variables. To understand this concept very clearly let's take an example of a simple linear regression problem.

In a simple linear regression model, we ultimately generate an equation from the model of the form y=mx+c where x is an independent variable and y is a dependent variable. Since there is only one variable, y has to depend on the value of x. Although in real-time there might be few other ignored external factors such as air resistance while calculating the average velocity of a bus from A to B. These definitely make an impact on the output but yet has the least significance. In this case, our common sense and experience help us in picking the factor. Hence we pick acceleration given to the bus by the driver and ignore the air resistance. But what about the complex situations where we have no idea about the significance of input variables on the output. Can mathematics solve this puzzle?

Yes! Here comes the concept of correlation.

Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. In simple terms, it tells us how much does one variable changes for a slight change in another variable. It may take positive, negative and zero values depending on the direction of the change. A high correlation value between a dependent variable and an independent variable indicates that the independent variable is of very high significance in determining the output. In a multiple regression setup where there are many factors, it is imperative to find the correlation between the dependent and all the independent variables to build a more viable model with higher accuracy. One must always remember that more number of features does not imply better accuracy. More features may lead to a decline in the accuracy if they contain any irrelevant features creating unrequired noise in our model.

Correlation between 2 variables can be found by various metrics such as Pearson r correlation, Kendall rank correlation, Spearman rank correlation, etc.

Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. The Pearson correlation between any 2 variables x,y can be found using :

n-no. of observations and i-denotes ith observation

Let us consider the dataset 50_Strartups on new startups in New York, California, and Florida. The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending. Here Profit is the dependent variable to be predicted.

Let us first apply linear regression for every independent variable separately to visualize the correlation with the independent variable.

From the scatter plot, we can see that R&D Spend and Profit have a very high correlation thus implying a greater significance towards predicting the output and Marketing spend having a lesser correlation with the Profit compared to R&D Spend.

But, the scatter between Administration and Profit shows that the correlation between them is very less and might end up creating noise during the prediction. Thus, we can exclude this feature in our model for a better result.

This process eliminates the insignificant and irrelevant features of our model. But, what about redundant features?

Redundant Features: Although some features are highly relevant to our target variable, they might be redundant. Any 2 independent variables are considered to be redundant if they are highly correlated. This causes unnecessary time and space wastage. Even the redundancy between 2 variables can be found using correlation.

Note: A high correlation between dependent and independent variables is desired whereas the high correlation between 2 independent variables is undesired.

The above 2 graphs show the correlation between independent variables. We can see a higher correlation in the first graph whereas very low correlation in the second. This means we can exclude any one of the 2 features in the first graph since the correlation between 2 independent variables causes redundancy. But which one to remove? The answer is straightforward. The variable with a higher correlation with the target variable stays and the other is removed.

Determining the correlation between the variables:

df = pd.DataFrame(data,
columns=['R&D Spend','Administration','Marketing Spend','Profit'])
corrMatrix = df.corr()
print (corrMatrix)

Output: The output shows a 2*2 matrix showing the Pearson r correlation among all the variables.

Correlations among all the variables in the dataset.

Finally, comparing various multiple regression models based on their r2 scores.

Scores by picking combinations of features.

From the experimented scores, we observe that:

-> Independent variables with low correlation lead to lower r2 scores. (Ex: Taking administration alone)

-> Variables with higher correlation gave us higher r2 score in our model(Ex: R&D Spend and Marketing Spend)

> Eliminating redundant variables or irrelevant variables may/may not lead to a negligible loss in our accuracy but makes it a very efficient model under many constraints.

For further reference:

Dataset: https://www.kaggle.com/farhanmd29/50-startups

Code: https://github.com/Tarun-Acharya/50_Startups_Regression

--

--

Software Developer. Always curious and zealous towards undiscovered possibilities. Motivated to learn.