Feature Selection Using Random forest

The Wisdom of Crowds

Akash Dubey
Towards Data Science

--

Photo by Noah Silliman on Unsplash

Random forests are one the most popular machine learning algorithms. They are so successful because they provide in general a good predictive performance, low overfitting, and easy interpretability. This interpretability is given by the fact that it is straightforward to derive the importance of each variable on the tree decision. In other words, it is easy to compute how much each variable is contributing to the decision.

Feature selection using Random forest comes under the category of Embedded methods. Embedded methods combine the qualities of filter and wrapper methods. They are implemented by algorithms that have their own built-in feature selection methods. Some of the benefits of embedded methods are :

  • They are highly accurate.
  • They generalize better.
  • They are interpretable

How does Random forest select features?

Random forests consist of 4 –12 hundred decision trees, each of them built over a random extraction of the observations from the dataset and a random extraction of the features. Not every tree sees all the features or all the observations, and this guarantees that the trees are de-correlated and therefore less prone to over-fitting. Each tree is also a sequence of yes-no questions based on a single or combination of features. At each node (this is at each question), the three divides the dataset into 2 buckets, each of them hosting observations that are more similar among themselves and different from the ones in the other bucket. Therefore, the importance of each feature is derived from how “pure” each of the buckets is.

Does it work differently for Classification and Regression?

For classification, the measure of impurity is either the Gini impurity or the information gain/entropy.

For regression the measure of impurity is variance.

Therefore, when training a tree, it is possible to compute how much each feature decreases the impurity. The more a feature decreases the impurity, the more important the feature is. In random forests, the impurity decrease from each feature can be averaged across trees to determine the final importance of the variable.

To give a better intuition, features that are selected at the top of the trees are in general more important than features that are selected at the end nodes of the trees, as generally the top splits lead to bigger information gains.

Let's see some Python code on how to select features using Random forest.

Here I will not apply Random forest to the actual dataset but it can be easily applied to any actual dataset.

  1. Importing libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassfier
from sklearn.feature_selection import SelectFromModel

2. In all feature selection procedures, it is a good practice to select the features by examining only the training set. This is to avoid overfitting.

So considering we have a train and a test dataset. We select the features from the train set and then transfer the changes to the test set later.

X_train,y_train,X_test,y_test = train_test_split(data,test_size=0.3)

3. Here I will do the model fitting and feature selection altogether in one line of code.

  • Firstly, I specify the random forest instance, indicating the number of trees.
  • Then I use selectFromModel object from sklearn to automatically select the features.
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))
sel.fit(X_train, y_train)

SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want.

4. To see which features are important we can use get_support method on the fitted model.

sel.get_support()

It will return an array of boolean values. True for the features whose importance is greater than the mean importance and False for the rest.

5. We can now make a list and count the selected features.

selected_feat= X_train.columns[(sel.get_support())]
len(selected_feat)

It will return an Integer representing the number of features selected by the random forest.

6. To get the name of the features selected

print(selected_feat)

It will return the name of the selected features.

7. We can also check and plot the distribution of importance.

pd.Series(sel.estimator_,feature_importances_,.ravel()).hist()

It will return a histogram showing the distribution of the features selected using this feature selection technique.

We can of course tune the parameters of the Decision Tree.Where we put the cut-off to select features is a bit arbitrary. One way is to select the top 10, 20 features. Alternatively, the top 10th percentile. For this, we can use mutual info in combination with SelectKBest or SelectPercentile from sklearn.

Few of the limitations of Random forest are :

  • Correlated features will be given equal or similar importance, but overall reduced importance compared to the same tree built without correlated counterparts.
  • Random Forests and decision trees, in general, give preference to features with high cardinality ( Trees are biased to these type of variables ).

Selecting features by using tree derived feature importance is a very straightforward, fast and generally accurate way of selecting good features for machine learning. In particular, if we are going to build tree methods.

However, as I said, correlated features will show in a tree similar and lowered importance, compared to what their importance would be if the tree was built without correlated counterparts.

In situations like this, it is better to select features recursively, rather than all together as we have done here.

--

--