The world’s leading publication for data science, AI, and ML professionals.

Feature Selection & Dimensionality Reduction Techniques to Improve Model Accuracy

A comprehensive guide to selecting the most important features in any dataset

Photo by ian dooley on Unsplash
Photo by ian dooley on Unsplash

Picture this. You’re excited to finally start on your first machine learning project, having spent the last couple of weeks completing an online machine learning course. You come up with a problem that you would like to solve using machine learning and one that you think you can properly put your new knowledge to the test. You happily jump onto Kaggle and found a dataset that you could work with. You open up Jupyter notebook, import and read the dataset.

All of a sudden, the initial confidence that you had disappears as you stare hopelessly at the hundreds of features in the dataset, clueless as to where and how you should begin your analysis. You start to doubt your abilities but you recall from your online course, the "curse of dimensionality" which is a problem that arises when there are too many features in a dataset to the extent when it becomes difficult for a model to interpret and understand.

Consider the following analogy. Let’s say a kid walks into an ice cream shop, eager to buy some ice cream. But before deciding on which flavour ice cream he wants, he first tries out the different flavours that are available while considering their taste, colour, texture, price and so on.

Imagine that this ice cream shop that sells hundreds of different flavours of ice cream, it would be almost impossible to remember all the specific details of each ice cream in that shop.

This illustrates the "curse of dimensionality". As the number of features increases, a problem becomes more complicated and difficult to analyse and solve.

Introduction

Feature selection and dimensionality reduction allow us to minimise the number of features in a dataset by only keeping features that are important. In other words, we want to retain features that contain the most useful information that is needed by our model to make accurate predictions while discarding redundant features that contain little to no information.

There are several benefits in performing feature selection and dimensionality reduction which include model interpretability, minimising overfitting as well as reducing the size of the training set and consequently training time.

In this article, we will explore various feature selection and dimensionality reduction techniques in reference to the Wisconsin breast cancer dataset on Kaggle. More specifically, we will consider the following techniques:

  • Variance inflation factor (VIF)
  • Univariate feature selection
  • Recursive feature elimination
  • Model-based feature selection
  • Principal component analysis (PCA)

You can find the complete notebook on my GitHub here.

Problem statement

Our goal is to train a machine learning model that is able to classify a random breast cancer cell observation as either benign or malignant. This can be seen as a Binary Classification problem in machine learning.

I have chosen the random forest classifier for this particular problem but feel free to try out other classification models of your choice. In terms of comparing the effectiveness of each feature selection technique, I have used the confusion matrix to assess model accuracy while taking into account the number of features that are needed to achieve that accuracy.

Data description

The dataset contains 569 breast cancer observations in which 357 of them are benign and 212 of them are malignant.

The columns represent 10 real-valued features of each cell nucleus:

  1. Radius
  2. Texture
  3. Perimeter
  4. Area
  5. Smoothness
  6. Compactness
  7. Concavity
  8. Concave points
  9. Symmetry
  10. Fractal dimension

The mean, standard error and worst of each feature were also computed, resulting in a total of 10 x 3 = 30 features (columns) in the dataset excluding the target variable.

Just by the names of the features itself, we can already foresee some issues of multicollinearity which is a phenomenon in which one predictor variable can be linearly predicted from the others. The most obvious being between the radius, perimeter and area features.

We will further explore the issue of multicollinearity during exploratory data analysis.

Exploratory data analysis

Target variable (diagnosis)

There are more cancer cells that are benign than there are that are malignant.

Predictor variables

Since all the predictor variables are numerical variables, we can use a heatmap to visualise the correlation between the features in the dataset.

Heatmap of predictor variables
Heatmap of predictor variables

As expected, we have a severe problem of multicollinearity in our data. From the heatmap above, we can observe that the following features are positively correlated with each other:

  • Radius
  • Perimeter
  • Area
  • Compactness
  • Concavity
  • Concave points

Here’s a pair plot to visualise the relationships between these features.

Pair plot of correlated features
Pair plot of correlated features

Explore the relationship between predictor variables and the target variable

In this section, we will visualise the relationship between our predictor variables and the target variable. The goal here is to investigate and determine features that are most important at distinguishing whether a cancer cell is benign or malignant.

Violin plot of predictor variables categorised by diagnosis
Violin plot of predictor variables categorised by diagnosis

We can see that cancer cells that are malignant tend to have higher values in all of the features.

Besides fractal dimension, all the features look promising at classifying cancer cells.

Box plot of predictor variables categorised by diagnosis
Box plot of predictor variables categorised by diagnosis

Similarly, the box plot above illustrates that fractal dimension is not as good at classifying cancer cells as the other features in the dataset. Boxplot also allows us to spot the outliers in our dataset.

Feature selection

Now that we have a better sense of our data, we can move on to the main purpose of this article.

But before we dive into the different techniques, let’s first consider the base case scenario i.e. use all the features in the dataset as training set to train our random forest classifier.

Base case

Base case accuracy
Base case accuracy

Our model achieved an accuracy of 98.25%. Not too shabby at all.

Let’s now explore the different feature selection and Dimensionality Reduction techniques and see if we can replicate this result but using a much smaller training set i.e. fewer features.

Variance inflation factor (VIF)

Variance inflation factor is a measure of collinearity among predictor variables within a multiple regression.

These are the top 5 features in the dataset with the highest VIF:

  • radius_mean
  • perimeter_mean
  • radius_worst
  • perimeter_worst
  • area_mean

Unsurprisingly, these are the suspects that we have already identified during exploratory data analysis.

I proceed to drop those features from the training set.

VIF accuracy
VIF accuracy

After removing the top 5 features with the highest VIF, our model accuracy not only did not decrease but it even went up to 98.83%.

Univariate feature selection

Univariate feature selection works by selecting the best features based on univariate statistical tests.

The top 5 features under univariate Feature Selection are:

  • perimeter_mean
  • area_mean
  • area_se
  • perimeter_worst
  • area_worst
Univariate feature selection accuracy
Univariate feature selection accuracy

Wow, this is remarkable!

Despite only using 5 features, that is 1/6 of our original training set, our model accuracy has only gone down by 3% to 95.32%. This goes to show that these 5 features account for most of the information that is needed by our model to classify cancer cells accurately.

Recursive feature elimination

Recursive feature elimination fits a model and removes the weakest features until the specified number of features is reached.

The top 5 features under recursive feature elimination are:

  • concave points_mean
  • radius_worst
  • perimeter_worst
  • area_worst
  • concave points_worst

These features are slightly different to those selected under univariate feature selection.

Let us now test the model accuracy.

Recursive feature elimination accuracy
Recursive feature elimination accuracy

The accuracy came quite close to that under the univariate feature selection at 95.91%

Model-based feature selection

Under the random forest model, below are the importance of each feature, sorted from most important to least important.

This is a rather subjective approach but I decided to keep features that have importance over 5% as the training set and those features are:

  • concave points_worst
  • area_worst
  • perimeter_worst
  • radius_worst
  • area_mean
  • concave points_mean
  • perimeter_mean
  • concavity_mean
  • radius_mean

In terms of how our model performed using the 9 features stated above, the model accuracy came just slightly below the base case scenario at 98.25%.

This is a rather impressive result considering we used significantly fewer features this time around.

Dimensionality reduction

Principal component analysis (PCA)

Principal component analysis is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

Last but not least, we have the principal component analysis. This is not so much of a feature selection technique but rather a dimensionality reduction technique.

The process involves transforming features in a dataset into linear combinations of one another in such a way that it captures most of the information in the first few components.

Explained variance ratio against the number of components
Explained variance ratio against the number of components

From the plot, we can deduce that the optimal number of components is 4 by using the elbow method.

PCA accuracy
PCA accuracy

With just 4 components, our model is able to achieve a similar accuracy score to that under the model-based feature selection technique where 9 different features were used to train our model.

However, despite the impressive dimensionality reduction abilities of the PCA, one of its disadvantages is that our predictor variables become less interpretable. In other words, PCA makes it more difficult for us to determine the actual features that are important in classifying cancer cells. This is largely attributed to the underlying algorithm of the PCA which turns the original features in the dataset into components which are combinations of different features.

Nevertheless, PCA remains a very robust technique in summarising high number of features into key components and thus allowing a Machine Learning model to capture all the necessary information in a dataset in order to make accurate predictions.

Conclusion

As we have seen from the results of performing feature selection and dimensionality reduction, we can come very close to replicating the accuracy score under the base case scenario despite using significantly fewer features.

The goal of feature selection and dimensionality reduction is to capture and summarise the important information in a dataset into just a few features or components. Our model can then learn from this information and subsequently make predictions that are equally accurate.

I hope you enjoyed this article and gained some value out of it. Thank you so much for reading.

Happy learning!

Additional resources

Follow me on other platforms


Related Articles