The world’s leading publication for data science, AI, and ML professionals.

Feature Selection: Choosing the Right Features for Your Machine Learning Algorithm

Sometimes, less is more

To select, or not to select... Photo by Edu Grande on Unsplash
To select, or not to select… Photo by Edu Grande on Unsplash

Why should we select some features and ignore the rest? Isn’t having more features good for the accuracy of our model?

Choosing the right features, and ignoring unsuitable ones, is a vital step in any machine learning project. This can result in good model performance and save you time down the line. It can also help you interpret the output of your model more easily. But having more features will mean the model has more data to train on, and should mean the model will be more accurate, right? Well, not exactly.

Having too many features can cause the algorithm to be prone to overfitting. Overfitting is when the model generalizes to irrelevant data or outliers. Another good reason to choose features carefully is something called the curse of dimensionality. Typically each feature is stored in a dimension. **** Algorithms become harder to design in high dimensions as the running time often grows exponential with the number of dimensions. So it makes sense, and offers a benefit, when we select the most suitable features and ignore the rest.

How can we select the best features for training?

There are two ways to select features. First, one can manually observe features by representing them graphically though histograms etc. The second way is through automatic selection of best features.

Doing things manually…

We can manually observe features by representing them graphically though histograms etc. Then, by identifying the features that can be distinguished from each other and those that overlap each other, we can decide which ones will be the best. Let us look at an example.

We are going to have a look at the iris dataset. It has data for 150 iris flowers, consisting of 3 species (Iris setosa, Iris virginica and Iris versicolor). Four features of the flowers are available in the dataset (the width and length of sepals and petals of the flowers). An excerpt of the dataset is shown below.

Here, you can see the four available features in the iris dataset for the iris flower species known as Setosa. 
(the first 5 rows of the dataset are shown)
  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width   Species 
1          5.1          3.5           1.4          0.2    setosa
2          4.9          3.0           1.4          0.2    setosa
3          4.7          3.2           1.3          0.2    setosa
4          4.6          3.1           1.5          0.2    setosa
5          5.0          3.6           1.4          0.2    setosa
Note: When we set dim = 0 in the below code, we are selecting the feature: Sepal.Length

With the above code, we draw a histogram for each of the three species of the iris data set, for a specific feature selected using the variable ‘dim’.

We can select a specific species using ‘iris.target’ for example, in the above code: iris.data[iris.target == 0, dim] gives us the data of the Iris Setosa species and the feature: sepal length.

By looking at the resulting histogram, we realize that the features overlap. This means that the feature we selected (sepal length), given by dim = 0, may not be good enough to seperate the different types of iris flowers (Setosa, Versicolor and Virginica).

The results of the above code, the features are overlapping.
The results of the above code, the features are overlapping.

Now, let us select a different feature. We will select feature 4 (petal width) by using dim=3. The image below shows the resulting histogram.

Histogram for feature 3, obtained from above code.
Histogram for feature 3, obtained from above code.

As you can see, this feature provides a good enough seperation between the three types of flowers, compared to the other feature that we observed. Observing the histograms in this way can help us gain a better feeling or an intuition for the data that we are working with and identify suitable features as well as features that are not so useful.

The manual method of doing things might not be suitable when we are working with more features. In such situations, we can make use of automatic feature selection methods.

Note: In the dataset we used, lengths were used for features. Every feature has the same units (centimeters). But some datasets can have features that differ from each other. For example, one feature may be in meters, while another feature might be color. This can introduce its own set of complications and we will need to scale features, which we will look at, at the end of this article.

Automatic feature selection

The general procedure for feature selection is:

  • Calculate the quality of each feature by comparing with ground truth or by comparing variance among the classes for each feature.
  • Next, sort the features according to the calculated quality and keep only the best. The best features can be selected by using a quality threshold or by simply selecting the best n number of features.

To select a subset of features we can perform either forward feature selection, where we add the best dimension or feature step by step, or perform backward feature selection, where we start with all features and continue to delete the feature with the worst quality.

How can we calculate the quality of a feature?

The first quality index we will look at is called the Correlation Coefficient (aka Pearson correlation). Correlation Coefficient is the ratio between covariance and standard deviation between two variables. As a result of the ratio, we get a result between -1 and 1.

What is covariance?

If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values (that is, the variables tend to show similar behavior), the covariance is positive.

In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, covariance is negative.

– Weisstein, Eric W. "Covariance". MathWorld.

This can be understood clearly by observing the image below.

The sign of the covariance of two random variables X and Y
The sign of the covariance of two random variables X and Y

The covariance can be caculated with the following equation with x̄ and ȳ representing mean values of x and y respectively:

Equation to calculate covariance. Source: Covariance
Equation to calculate covariance. Source: Covariance

Thus, the Correlation Coefficient can be calculated as below:

Equation to calculate correlation coefficient
Equation to calculate correlation coefficient

If the two features are stochastically independent, their correlation will be 0. However, keep in mind that even if the correlation is 0 it does not necessarily mean that the variables are independent. There could be underlying dependencies that are not captured by the correlation.

Also note that correlation does not imply causality. Look at the chart below. Since the two incidents are similar, the correlation is extremely high. But does it mean if you eat more cheese, you are likely to be strangled by your bedsheet? The data here is purely coincidental.

Corelation between cheese consumption and death by tangling in bedsheets. Author: Tyler Vigen
Corelation between cheese consumption and death by tangling in bedsheets. Author: Tyler Vigen

Disadvantages of correlation for Feature Selection:

  • Correlation only finds relationships that are linear
  • This also only works for problems with two classes

Another quality measure that can be used is the Fisher’s ratio. It measures the linear discriminative power of a variable and has the following formula.

Equation to calculate Fisher's ratio.
Equation to calculate Fisher’s ratio.

Here x̄ and ȳ represent the means of class 1 and class 2 respectively and the variances of the two classes are given in the denominator. The benefit of this method is that it offers a faster calculation for more complex criteria.

There are many other quality measurement tools available such as Kullback-Leibler Divergence, ANOVA and more, which are not discussed here.

Possible problems with feature selection

Even though most of the algorithms are relatively simple and easy, they are not always applicable all the time. Difficulties arise when trying to determine which quality measure to use and when trying to initialize greedy algorithms when a signle dimension does not lead to any results.

Furthermore, even though features are looked at as individual and independant from each other, they often have a dependence on each other. As a result, quality measurement based feature selection will never offer the same information that can be observed when two features are combined. Thus, it offers a benefit to make use of information shared among dimensions. This can be achieved by transforming the feature space (also known as compression). In order to achieve this, Principal Component Analysis (PCA) can be used. We will look at PCA in another article.

Problem of having very different features

Above, we discussed that having features that differ from each other can introduce problems. For example, having one feature in length with cm as a unit and having another feature which is a color. To mitigate this, we can use feature scaling.

Scaling

  • In the Iris and Digits datasets all features are scaled equally (units are in centimeters)
  • If this if not the case, single features can bias the result.
  • A feature with a high variance for example dominates a distance measure.

Solution:

  • scale features to a mean of 0 and a variance of 1

If the population mean and population standard deviation are known, a raw score x is converted into a standard score by the following formula:

Equation to calculate standard score.
Equation to calculate standard score.

where: μ is the mean of the population, σ is the standard deviation of the population.

Conclusion

Having a large number of features can introduce complications when training a Machine Learning model, such as making the algorithm prone to overfitting and increasing training times. Therefore, it is very important to chose features that work well and ignore features that do not offer sufficient benefits. This can be done manually, by visualing the data and observing how features interact with each other. Also, it can be done using automatic techniques when the features available are too large. There are benefits and advantages of both methods, and selecting the suitable method comes down to the problem at hand.


If you found this post useful, consider subscribing to me and joining medium. Your membership supports me and other writers you read directly.

Thank you for reading! See you in a future post.


Related Articles