Feature selection — Correlation and P-value

Vishal R
Towards Data Science
2 min readSep 11, 2018

--

This article is being moved to my Substack Publication. You can read the article for free here. This post will be deleted on 18th June 2022.

Often when we get a dataset, we might find a plethora of features in the dataset. All of the features we find in the dataset might not be useful in building a machine learning model to make the necessary prediction. Using some of the features might even make the predictions worse. So, feature selection plays a huge role in building a machine learning model.

In this article we will explore two measures that we can use on the data to select the right features.

What is correlation?

Correlation is a statistical term which in common usage refers to how close two variables are to having a linear relationship with each other.

For example, two variables which are linearly dependent (say, x and y which depend on each other as x = 2y) will have a higher correlation than two variables which are non-linearly dependent (say, u and v which depend on each other as u = v2)

How does correlation help in feature selection?

Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.

P-value

Before we try to understand about about p-value, we need to know about the null hypothesis.

Null hypothesis is a general statement that there is no relationship between two measured phenomena.

Testing (accepting, approving, rejecting, or disproving) the null hypothesis — and thus concluding that there are or are not grounds for believing that there is a relationship between two phenomena (e.g. that a potential treatment has a measurable effect) — is a central task in the modern practice of science; the field of statistics gives precise criteria for rejecting a null hypothesis.

Source: Wikipedia

For more info about the null hypothesis check the above Wikipedia article

What is p-value?

The rest of this article has been moved to the publication Machine Learning — The Science, The Engineering, and The Ops. You can read the entire article for free here.

--

--

Data Scientist at Freshworks. Likes to talk about Machine Learning and plays the Harmonica. Slowly moving to Substack