
Feature selection is a common component in many Machine Learning tasks.
It is one of the methods used to reduce dimensionality, thereby ensuring a high model performance while reducing overfitting and computational demand.
Whether you are building inference models or predictive models, you will achieve better results by first verifying that you have chosen the optimal set of features to train your model with.
Here, I give a quick overview of some of the common ways to identify the features best suited for building a machine learning model.
Examples of Feature Selection
There are many criteria that can be used to decide which features to keep or omit.
Each type of feature selection will be demonstrated using a dataset (copyright-free) storing information on house prices. The dataset is accessible here.

The target label in this dataset is ‘property_value’. This feature will not be considered during the feature selection process.
1. Selecting features based on missing values
There are many ways to deal with missing values in datasets. Although deleting records without data is an option, it is discouraged as it entails abandoning valuable information.
That being said, in some cases, removing a feature altogether might be the only option if the majority of the records don’t assign a value for that feature.
In Python, you can easily identify if there are features with too many missing values.

As shown in the output, none of the features have any missing values, no feature elimination will be needed.
2. Selecting features based on variance
A feature’s values need to display some level of variance in order to have predictive capabilities.
Therefore, one criterion for evaluating features is the variance of their values.
Conveniently, the sklearn module offers the VarianceThreshold, a feature selector that chooses features based on a given variance threshold.
Below is a simple example of how to implement the VarianceThreshold:

The VarianceThreshold selects 9 predictive features in the dataset that meet the required variance while omitting the 6 features that do not.
3. Selecting features based on correlation with other features
For building inference models, it is ideal for features to have no relationship with each other. Strongly correlated predictive features will only produce more noise, leading to higher variance in feature coefficient estimates.
This will make it more difficult to obtain insights from an analysis based on such models.
The phenomenon of predictive features having strong correlations with other features is known as multicollinearity. I give an overview of the subject here.
Multicollinearity can be detected by first identifying features that exhibit strong correlations.
One way to find such features is by building a heatmap that displays the correlation coefficient values for all pairs of features.

For this dataset, it is clear that a strong correlation is present between:
- ‘house_size_sqm’ and ‘land_size_sqm’
- ‘land_size_sqm’ and ‘number_of_rooms’
- ‘house_size_sqm’ and ‘number of rooms’
Intuitively, these observations make sense. For instance, ‘house_size_sqm’ and ‘land_size_sqm’ practically give us the same information. Including both features will diminish the reliability of any causal model that is trained with this data.
To determine which feature(s) should be eliminated, we can find the features with high variance inflation factor (VIF) values and remove them.
Note: Typically, a VIF value of 10 or more is deemed to be too high.

The feature ‘land_size_sqm’ has the highest VIF value. Let’s compute the VIF values again after removing this feature.

It is evident that ‘land_size_sqm’ is the only feature that must be removed to address the multicollinearity within the dataset.
4. Selecting features based on model performance
It is also possible to let machine learning models select features.
A well-known feature selection algorithm that utilizes models is known as recursive feature elimination (RFE).
RFE differs from other feature selection approaches since it specifically asks for the number of features that should be selected.
Thankfully, the sklearn module comes to the rescue again with its own RFE estimator.
Prior to the feature selection with RFE, the data needs to be split into training and testing sets and undergo normalization.
Let’s use the RFE to identify the 10 features that should be selected. The RFE will use a linear regression model to select its features.

A valuable feature of the RFE is that can show the order in which the eliminated features are removed from consideration. This can be accomplished with the ‘.ranking_’ method.
Note: a feature with a rank of 1 is a selected feature. Any other ranking represents an eliminated feature.

As shown in the table, the RFE removes the ‘no_of_rooms’, ‘large_living_room’, ‘no_of_bathrooms’, ‘room_size’, and ‘parking_space’ features (in that order).
Conclusion

Now you have become more familiar with the various methods used for feature selection.
Keep in mind that the best approach towards feature selection varies from project to project.
Instead of indiscriminately applying every feature selection method under the sun, consider which ones are the most applicable for your projects.
I wish you the best of luck in your machine learning endeavors!
References
J. Issadeen. (2020). Jiffs house price prediction dataset, Version 3. Retrieved December 12, 2021 from https://www.kaggle.com/elakiricoder/jiffs-house-price-prediction-dataset.