Source: Pixabay

Learn how to do Feature Selection the Right Way

Is x really a predictor of y?

Ayushi Jain
Towards Data Science
12 min readJul 17, 2020

--

Struggling with finding the right features for your model? By right, I mean value-adding features. Consider you are working on a high dimensional data that’s coming from IoT sensors or healthcare with hundreds to thousands of features, it is tough to figure out what subset of features will bring out a good sustaining model. This article is all about feature selection and implementation of its techniques using scikit-learn on the automobile dataset. I’ve tried my best to keep the tricky calculations aside, but some basic understanding of statistics would make your journey easier. 😃

Dataset when ‘raw’ often comes with many irrelevant features that do not contribute much to the accuracy of your predictive model. Understand this using music analogy — music engineers often employ various techniques to tune their music such that there is no unwanted noise and the voice is crisp and clear. Similarly, even the datasets encounter noise and its crucial to remove them for better model optimization. That’s where feature selection comes into the picture!

Now, keeping the model accuracy aside, theoretically, feature selection

  • reduces overfitting ‘ The Curse of Dimensionality’ — If your dataset has more features/columns than samples (X), the model will be prone to overfitting. By removing irrelevant data/noise, the model gets to focus on essential features, leading to more generalization.
  • simplifies models — Dimensionality adds many layers to a model, making it needlessly complicated. Overengineering is fun but they may not be better than their simpler counterparts. Simpler models are easier to interpret and debug.
  • reduces training time — Lesser features/dimensions reduces the computation speed, speeding up model training.

Keep in mind, all these benefits depend heavily on the problem. But for sure, it will result in a better model.

Is this Dimensionality Reduction?

Not exactly!

Often, feature selection and dimensionality reduction are used interchangeably, credits to their similar goals of reducing the number of features in a dataset. However, there is an important difference between them. Feature selection yields a subset of features from the original set of features, which are the best representatives of the data. While dimensionality reduction is the introduction of a new feature space where the original features are represented. It basically transforms the feature space to a lower dimension, keeping the original features intact. This is done by either combining or excluding a few features. To sum up, you can consider feature selection as a part of dimensionality eduction.

Dataset In Action

We will be using the automobile dataset from the UCI Machine Learning repository. The dataset contains information on car specifications, its insurance risk rating and its normalized losses in use as compared to other cars. The goal of the model would be to predict the ‘price’. A regression problem, it comprises a good mix of continuous and categorical variables, as shown below:

Attribute Description ( Image by Author)

After considerable preprocessing of around 200 samples with 26 attributes each, I managed to get the value of R squared as 0.85. Since our focus is on assessing feature selection techniques, I won't go deep into the modeling process. For the complete regression code, check this jupyter notebook on the Github link below.

Now, let's try to improve the model by feature selection!

Techniques

Concisely, feature selection methods can be divided into three major buckets, filter, wrapper & embedded.

I. Filter Methods

With filter methods, we primarily apply a statistical measure that suits our data to assign each feature column a calculated score. Based on that score, it will be decided whether that feature will be kept or removed from our predictive model. These methods are computationally inexpensive and are best for eliminating redundant irrelevant features. However, one downside is that they don't take feature correlations into consideration since they work independently on each feature.

Moreover, we have Univariate filter methods that work on ranking a single feature and Multivariate filter methods that evaluate the entire feature space. Let's explore the most notable filter methods of feature selection:

1.) Missing Values Ratio

Data columns with too many missing values won't be of much value. Theoretically, 25–30% is the acceptable threshold of missing values, beyond which we should drop those features from the analysis. If you have the domain knowledge, it's always better to make an educated guess if the feature is crucial to the model. In such a case, try imputing the missing values using various techniques listed here. To get missing value percentages per feature, try this one-liner code! Adding a jupyter notebook for each technique was cumbersome, so I’ve added the output side by side using Github gist considering the same automobile dataset.

nlargest() returns default 5 features with most missing values

2.) Variance Threshold

Features in which identical value occupies the majority of the samples are said to have zero variance. Such features carrying little information will not affect the target variable and can be dropped. You can adjust the threshold value, default is 0, i.e remove the features that have the same value in all samples. For quasi-constant features, that have the same value for a very large subset, use threshold as 0.01. In other words, drop the column where 99% of the values are similar.

Drops features with zero variance (customizable threshold)

3.) Correlation coefficient

Two independent features (X) are highly correlated if they have a strong relationship with each other and move in a similar direction. In that case, you don't need two similar features to be fed to the model, if one can suffice. It centrally takes into consideration the fitted line, slope of the fitted line and the quality of the fit. There are various approaches for calculating correlation coefficients and if a pair of columns cross a certain threshold, the one that shows a high correlation with the target variable (y) will be kept and the other one will be dropped.

Pearson correlation (for continuous data) is a parametric statistical test that measures the similarity between two variables. Got confused by the parametric term? It means that this test assumes that the observed data follows some distribution pattern( e.g. normal, gaussian). Its coefficient value ‘r’ ranges between -1(negative correlation) to 1(positive correlation) indicating how well the data fits the model. It also returns ‘p-value’ to determine whether the correlation between variables is significant by comparing it to a significance level ‘alpha’ (α). If the p-value is less than α, it means that the sample contains sufficient evidence to reject the null hypothesis and conclude that the correlation coefficient does not equal zero.

Source: Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the strength and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom).

Spearman rank correlation coefficient(for continuous + ordinal data) is a non-parametric statistical test that works similar to Pearson, however, it does not make any assumptions about the data. Denoted by the symbol rho (-1<ρ<1), this test can be applied for both ordinal and continuous data that has failed the assumptions for conducting Pearson's correlation. For newbies, ordinal data is categorical data but with a slight nuance of ranking/ordering (e.g low, medium and high). An important assumption to be noted here is that there should be a monotonic relationship between the variables, i.e. variables increase in value together or if one increases, the other one decreases.

Kendall correlation coefficient (for discrete/ordinal data) - Similar to Spearman correlation, this coefficient compares the number of concordant and discordant pairs of data.

Let's say we have a pair of observations (xᵢ, yᵢ), (xⱼ, yⱼ), with i < j, they are:
* concordant if either (xᵢ > xⱼ and yᵢ > yⱼ) or (xᵢ < xⱼ and yᵢ < yⱼ)
* discordant if either (xᵢ < xⱼ and yᵢ > yⱼ) or (xᵢ > xⱼ and yᵢ < yⱼ)
* neither if there’s a tie in x (xᵢ = xⱼ) or a tie in y (yᵢ = yⱼ)

Denoted with the Greek letter tau (τ), this coefficient varies between -1 to 1 and is based on the difference in the counts of concordant and discordant pairs relative to the number of x-y pairs.

Pearson, Spearman, Kendall’s Correlation Coefficient using Scipy & Pandas

In the regression jupyter notebook above, I’ve used Pearson’s correlation since Spearman and Kendall work best only with ordinal variables and we have 60% continuous variables.

4.) Chi-Square Test of Independence(for categorical data)

Before diving into chi-square, let's understand an important concept: hypothesis testing! Imagine XYZ makes a claim, a commonly accepted fact, you call it a Null Hypothesis. Now you come up with an alternate hypothesis, one that you think explains that phenomenon better and then work towards rejecting the null hypothesis.
In our case:
Null Hypothesis: The two variables are independent.
Alternative Hypothesis: The two variables are dependent.

So, Chi-Square tests come in two variations - one that evaluates the goodness-of-fit and the other one where we will be focusing on is the test of independence. Primarily, it compares the observed data to a model that distributes the data according to the expectation that the variables are independent. Then, you basically need to check where the observed data doesn’t fit the model. If there are too many data points/outliers, there is a huge possibility that the variables are dependent, proving that the null hypothesis is incorrect!

It primarily returns a test statistic “p-value” to help us decide! On a high level, if the p-value is less than some critical value- ‘level of significance’(usually 0.05), we reject the null hypothesis and believe that the variables are dependent!

Chi-square would not work with the automobile dataset since it needs categorical variables and non-negative values! For that reason, we can use Mutual Information & ANOVA.

5.) Mutual Information (for both regression & classification)

The mutual information measures the contribution of a variable towards another variable. In other words, how much will the target variable be impacted if we remove or add the feature? MI is 0 if both the variables are independent and ranges between 0 –1 if X is deterministic of Y. MI is primarily the entropy of X, which measures or quantifies the amount of information obtained about one random variable, through the other random variable.

The best thing about MI is that it allows one to detect non-linear relationships and works for both regression and classification. Cool! isn't it 😃

6.) Analysis of Variance (ANOVA)

Okay honestly, this is a bit tricky but let's understand it step by step. Firstly, here instead of features we deal with groups/ levels. Groups are different groups within the same independent(categorical) variable. ANOVA is primarily an extension of a t-test. With a t-test, you can study only two groups but with ANOVA you need at least three groups to see if there’s a difference in means and determine if they came from the same population.

It assumes Hypothesis as
H0: Means of all groups are equal.
H1: At least one mean of the groups are different.

Let’s say from our automobile dataset, we use a feature ‘fuel-type’ that has 2 groups/levels — ‘diesel’ and ‘gas’. So, our goal would be to determine if these two groups are statistically different by calculating whether the means of the groups are different from the overall mean of the independent variable i.e ‘fuel-type’. ANOVA uses F-Test for statistical significance, which is the ratio of the variance between groups to the variance within groups and the larger this number is, the more likely it is that the means of the groups really *are* different, and that you should reject the null hypothesis.

Anova Feature Selection for Regression task
Choosing a Feature selection Method ( Image by Author)

II. Wrapper Methods

In Wrapper methods, we primarily choose a subset of features and train them using a machine learning algorithm. Based on the inferences from this model, we employ a search strategy to look through the space of possible feature subsets and decide which feature to add or remove for the next model development. This loop continues until the model performance no longer changes with the desired count of features(k_features).

The downside is, it becomes computationally expensive as the features increase, but on the good side, it takes care of the interactions between the features, ultimately finding the optimal subset of features for your model with the lowest possible error.

1.) Sequential Feature Selection

A greedy search algorithm, this comes in two variants- Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS). It basically starts with a null set of features and then looks for a feature that minimizes the cost function. Once the feature is found, it gets added to the feature subset and in the same way one by one, it finds the right set of features to build an optimal model. That's how SFS works. With Sequential Backward Feature Selection, it takes a totally opposite route. It starts with all the features and iteratively removes one by one feature depending on the performance. Both algorithms have the same goal of attaining the lowest cost model.

The main limitation of SFS is that it is unable to remove features that become non-useful after the addition of other features. The main limitation of SBS is its inability to reevaluate the usefulness of a feature after it has been discarded.

SFS (for Sequential Backward Selection, change the forward parameter to False)

2.) Recursive Feature Elimination (RFE)

Considering, you have an initial set of features, what this greedy algorithm does is it repeatedly performs model building by considering smaller subsets of features each time. How does it do that? After an estimator is trained on the features, it returns a rank value based on the model’s coef_ or feature_importances_ attribute conveying the importance of each feature. For the next step, the least important features are pruned from the current set of features. This process is recursively repeated until the specified number of features are attained.

RFE for Regression

III. Embedded Methods

These methods combine the functionalities of both Filter and Wrapper methods. The upside is that they perform feature selection during the process of training which is why they are called embedded! The computational speed is as good as of filter methods and of course better accuracy, making it a win-win model!

1.) L1 ( LASSO) Regularization

Before diving into L1, let's understand a bit about regularization. Primarily, it is a technique used to reduce overfitting to highly complex models. We add a penalty term to the cost function so that as the model complexity increases the cost function increases by a huge value. Coming back to LASSO (Least Absolute Shrinkage and Selection Operator) Regularization, what you need to understand here is that it comes with a parameter, ‘alpha’ and the higher the alpha is, the more feature coefficients of least important features are shrunk to zero. Eventually, we get a much simple model with the same or better accuracy!

However, in cases where a certain feature is important, you can try Ridge regularization (L2) or Elastic Net (a combination of L1 & L2), wherein instead of dropping it completely, it reduces the feature weightage.

Lasso Regression

2.) Tree Model (for Regression & Classification)

One of the most popular and accurate machine learning algorithms, random forests are an ensemble of randomized decision trees. An individual tree won't contain all the features and samples. The reason why we use these for feature selection is the way decision trees are constructed! What we mean by that is, during the process of tree building, it uses several feature selection methods that are built into it. Starting from the root, the function used to create the tree tries all possible splits by making conditional comparisons at each step and chooses the one that splits the data into the most homogenous groups (most pure). The importance of each feature is derived by how “pure” each of the sets is.

Using Gini impurity for classification and variance for regression, we can identify the features that would lead to an optimal model. The same concept can be applied to CART (Classification and Regression Trees) and boosting tree algorithms as well.

Feature Importances using Decision tree

End Notes

That’s all! Hope you got a good intuition of how these statistical tests work as feature selection techniques. An important thing to consider here is that application of a feature selection algorithm doesn't guarantee better accuracy always, but will surely lead to a simpler model than before!

If you have any questions/thoughts, feel free to leave your feedback in the comment section below or you can reach me on Linkedin.

Stay tuned for more!😃

References

--

--