The world’s leading publication for data science, AI, and ML professionals.

How to Perform Feature Selection in a Data Science Project

Four methods and a whole process for Feature Selection, with examples in Python

Photo by Vladislav Babienko on Unsplash
Photo by Vladislav Babienko on Unsplash

Feature selection is an essential part of a Data Science project. When you work with a (very) large dataset you should always ask yourself two questions:

  1. What do these features represent?
  2. Are all these features important?

The answer to the second question leads us to features selection; in fact, you don’t want to have in your data frame meaningless features: it is a waste of computation.

In this article, we will see four methods for features selection and I also will describe a process to follow for features selection (generally speaking, it is difficult you will just have to apply one of the following methods to do the job of features selection).

1. The Correlation Matrix

The correlation matrix helps us find the linear relations between the features and the label, and even between the features themselves. When some features are highly correlated, we can decide to drop (some of) them because when two features are highly correlated they have the same influence on the results.

So, let’s see what we can do, using the correlation matrix. Below, an example taken from one of my projects; suppose we have a data frame "df" (it doesn’t matter the details of the dataset); let’s create its correlation matrix:

import matplotlib.pyplot as plt
import seaborn as sns
#figure size
plt.figure(figsize=(20, 10))
#heat map for correlation coefficient
sns.heatmap(df.corr(), annot=True, fmt="0.1")
A Correlation Matrix. Image by Author.
A Correlation Matrix. Image by Author.

There are a lot of correlated features; for example:

  • baseline_value and histogram_mode
  • baseline_value and histogram_median
  • histogram_mode, istogram_mean, and histogram_median
  • histogram_width and histogram_min

Let’s just focus on the correlation and not on what these features (and the label) represent; I want to see a graphical representation of the possible correlation between these variables:

#creating a subtdataframe
a = df[['baseline value', 'histogram_mode', 'histogram_median', 'histogram_mean', 'histogram_width', 'histogram_min']]
#creating a unique plot with the regressions
g = sns.PairGrid(a)
g = g.map_upper(sns.regplot, scatter_kws={'alpha':0.15}, line_kws={'color': 'red'})
A scatterplot with regression lines of the above-mentioned features. Image by Author,
A scatterplot with regression lines of the above-mentioned features. Image by Author,

So, the features indicated above are really highly correlated, as can be seen from the graphs; so, I can choose to delete some of them. For example, by crossing the various graphs, I choose to eliminate the following features (eliminating more features would result in the loss of some data because they are all crossed with each other):

  • histogram_min
  • histogram_mean

2. Lasso Regression

If you are working on a linear regression problem, another way to perform features selection is to apply the Lasso regularized model.

In this type of regularization, some coefficients related to the linear model can become zero and can be eliminated from the model. This means that the Lasso Regression model also performs features selection. If you want to know more details, I’ve written a dedicated post on it here.

3. Mutual Information method

The methods seen before are performed on the whole dataset. This and the following method have to be performed on the split dataset (you can see an interesting discussion on the topic here. The last comment is the most important one).

Let’s consider the dataset seen in point 1 of this tutorial, and consider we have just split the dataset. Now, we have:

#mutual information selecting all features
mutual = SelectKBest(score_func=mutual_info_classif, k='all')
#learn relationship from training data
mutual.fit(X_train, y_train)
# transform train input data
X_train_mut = mutual.transform(X_train)
# transform test input data
X_test_mut = mutual.transform(X_test)
#printing scores of the features
for i in range(len(mutual.scores_)):
    print('Feature %d: %f' % (i, mutual.scores_[i]))
-------------------
>>>
Feature 0: 0.124999
Feature 1: 0.139990
Feature 2: 0.031640
Feature 3: 0.092322
Feature 4: 0.066883
Feature 5: 0.002289
Feature 6: 0.008455
Feature 7: 0.194067
Feature 8: 0.222438
Feature 9: 0.144378
Feature 10: 0.034891
Feature 11: 0.118958
Feature 12: 0.025970
Feature 13: 0.033416
Feature 14: 0.015075
Feature 15: 0.108909
Feature 16: 0.085122
Feature 17: 0.103669
Feature 18: 0.000000

So, this method gives someway importance to the features, utilizing a scoring method. Let’s see it graphically:

#plot the scores
plt.bar([i for i in range(len(mutual.scores_))], mutual.scores_)
plt.show()
Mutual Information method scores. Image by Author.
Mutual Information method scores. Image by Author.

In the conclusions, we will discuss when to use this method.

4. Anova f-test

Let’s consider the same dataset as before; we have just split the data, so we can apply directly the method:

# configure to select all features
an = SelectKBest(score_func=f_classif, k='all')
# learn relationship from training data
an.fit(X_train, y_train)
# transform train input data
X_train_an = an.transform(X_train)
# transform test input data
X_test_an = an.transform(X_test)
#printing scores of the features
for i in range(len(an.scores_)):
    print('Feature %d: %f' % (i, mutual.scores_[i]))
-------------------
>>>
Feature 0: 0.117919
Feature 1: 0.176444
Feature 2: 0.006887
Feature 3: 0.089149
Feature 4: 0.064985
Feature 5: 0.054356
Feature 6: 0.090783
Feature 7: 0.144446
Feature 8: 0.191335
Feature 9: 0.200292
Feature 10: 0.081927
Feature 11: 0.096509
Feature 12: 0.000000
Feature 13: 0.042977
Feature 14: 0.105467
Feature 15: 0.027062
Feature 16: 0.072015
Feature 17: 0.198037
Feature 18: 0.018785

And even here graphically:

# plot the scores
plt.bar([i for i in range(len(an.scores_))], an.scores_)
plt.show()
Anova f-test method scores. Image by Author.
Anova f-test method scores. Image by Author.

As you can see, the results in terms of features importance are really different between this method and the Mutual Information one; we’ll see in the next paragraph when to choose one and when the other.

A process for features selections and conclusions

For selecting features, you can follow this process:

  1. Plot the Correlation Matrix, calculated on the whole dataset, and decide if there are some features that can be eliminated (we are talking about the highly correlated features. Plot a scatterplot with a regression line to be sure of the linear path, as I did above).
  2. Choose the Lasso regressor only if the problem is a regression the model to be used is linear and there is a "high" number of features.
  3. Split the dataset into train and test sets and then choose one method between Mutual Information and Anova f-test. Anova f-test is able to "feel" the linear dependence between features, while the mutual information "feels" any kind of dependence, in particular "feels" the non-linear ones (read also the documentation here). In the case seen above, considering the results obtained with the correlation matrix and the consequent elimination of two features following the graphs with linear regression, then, coherently, the mutual information indicates better than the Anova f-test which are the features that really matter for this type of problem.


Let’s connect together!

MEDIUM

LINKEDIN (send me a connection request)

If you want, you can subscribe to my mailing list so you can stay always updated!


Consider becoming a member: you could support me and other writers like me with no additional fee. Click here to become a member.


Related Articles