The world’s leading publication for data science, AI, and ML professionals.

Effective Feature Selection: Recursive Feature Elimination Using R

Developing an accurate and yet simple (and interpretable) model in machine learning can be a very challenging task. Depending on the…

Photo by Anthony Martino on Unsplash
Photo by Anthony Martino on Unsplash

Developing an accurate and yet simple (and interpretable) model in machine learning can be a very challenging task. Depending on the modeling approach (e.g., neural networks vs. logistic regression), having too many features (i.e., predictors) in the model could either increase model complexity or lead to other problems such as multicollinearity and overfitting¹. Furthermore, with a highly complex model, it could be harder to acquire (or maintain) a large set of features for future predictions. Thus, it is important to select optimal features.

Sometimes, less is more. -William Shakespeare

_Recursive Feature Elimination_², or shortly RFE, is a widely used algorithm for selecting features that are most relevant in predicting the target variable in a predictive model – either regression or classification. RFE applies a backward selection process to find the optimal combination of features. First, it builds a model based on all features and calculates the importance of each feature in the model. Then, it rank-orders the features and removes the one(s) with the least importance iteratively based on model evaluation metrics (e.g., RMSE, accuracy, and Kappa). Feature importance can be computed based on the model (e.g., the random forest importance criterion) or using a model-independent metric (e.g., ROC curve analysis)⁴. This process continues until a smaller subset of features is retained in the model (technically, RFE can still keep all (or most) of the features in the final model).

RFE can be implemented using several programming languages, such as R, Python, and MATLAB. Thanks to the famous caret package³, implementing the RFE method in R from scratch is quite straightforward. In this article, I will demonstrate how to use RFE for Feature Selection in R. After reading this article, you will:

  • understand how RFE works for selecting important features when building predictive models, and
  • know how to use the caret package to implement RFE-based feature selection in R.

Example: Heart Disease Dataset

Photo by Jesse Orrico on Unsplash
Photo by Jesse Orrico on Unsplash

In this example, we will use one of the most popular datasets from the UCI Machine Learning Repository – the Cleveland Heart Disease dataset. The Cleveland database has been used by researchers to predict the presence of heart disease in a sample of 303 patients. Although the original Cleveland database contains 76 variables (see the full variable list here), researchers often selected a subset of 14 variables (13 features and a target variable) when building predictive models (e.g., see Shubhankar Rawat’s post on Towards Data Science). So, we will be using the same subset of features:

  1. age: Age of the individual.
  2. sex: Gender of the individual: 1 = male; 0 = female.
  3. cp: The type of chest-pain experienced by the individual: 1 = typical angina; 2 = atypical angina; 3 = non – anginal pain; 4 = asymptotic.
  4. trestbps: The resting blood pressure value of an individual in mmHg on admission to the hospital.
  5. chol: The serum cholesterol in mg/dl
  6. fbs: The fasting blood sugar level of an individual. If fasting blood sugar > 120mg/dl then 1; else 0.
  7. restecg: Resting electrocardiographic results: 0 = normal; 1 = having ST-T wave abnormality; 2 = left ventricular hypertrophy.
  8. thalach: Maximum heart rate achieved by an individual.
  9. exang: Exercise-induced angina: 1 = yes; 0 = no.
  10. oldpeak: ST depression induced by exercise relative to rest.
  11. slope: The slope of the peak exercise ST segment: 1 = upsloping; 2 = flat; 3 = downsloping.
  12. ca: Number of major vessels (0–3) colored by fluoroscopy.
  13. thal: The thalassemia: 3 = normal; 6 = fixed defect; 7 = reversible defect.
  14. target: Diagnosis of heart disease: 0 = absence of heart disease; 1 = presence of heart disease.

Data Preparation

Before we begin the analysis, we will first active the R packages that we will need in this example:

  • dplyr (for data wrangling),
  • faux (for simulating new variables – see the details below),
  • DataExplorer (for exploratory data analysis), and
  • caret and randomForest (for implementing RFE for feature selection).

Next, we will import the Cleveland Heart Disease dataset into R and preview the first six rows. You can find a copy of the Cleveland Heart Disease dataset as well as the R codes used in this article in my Github repository.

First six rows of the Cleveland Heart Disease dataset
First six rows of the Cleveland Heart Disease dataset

To make the example a bit more interesting, we will add four pseudo-variables to the dataset:

  • catvar: A random, categorical variable with the levels of a, b, and c – it has no correlation with the target variable (r = 0)
  • contvar1: A random, continuous variable with a mean of 10 and a standard deviation of 2 – it has no correlation with the target variable (r = 0)
  • contvar2: A continuous variable with a mean of 10 and a standard deviation of 2 – it has a low correlation with the target variable (r = 0.2)
  • contvar3: A continuous variable with a mean of 10 and a standard deviation of 2 – it has a moderate correlation with the target variable (r = 0.5)

We will use these variables in the feature selection process and see whether RFE will pick any of them as a feature to be included in the model. The new dataset including the pseudo-variables is available here.

Finally, we will recode the categorical features and the target variable as a factor and center (and scale) the numerical features in the dataset.

Exploratory Data Analysis

Using the DataExplorer package, we will check the variables in the dataset. First, we will check the types of variables. Next, we will create bar graphs for the categorical variables. Finally, we will create a correlation plot showing the correlations among all variables in the dataset.

The first plot from our exploratory analysis confirms the type of variables that we have in the data. There are no missing values in the data. Half of the variables are numerical, while the other half is categorical.

Types of variables in the dataset
Types of variables in the dataset

The bar graphs show the categorical variables in the dataset. These graphs are particularly useful for detecting unexpected values for the variables. In our dataset, the levels of all categorical variables look right (based on the dataset description above). For two variables (restecg and thal), some levels seem to have very low frequencies. For now, we will keep the original levels for these variables and collapse them later on, if necessary.

Bar graphs for the categorical variables
Bar graphs for the categorical variables

The correlation matrix plot shows that the two levels of the target variable (see the rows or columns labeled as target_0 and target_1) appear to have a moderate correlation with several features in the dataset (e.g., chol, oldpeak, and thalach). Depending on the intercorrelations among these features, they are likely to be selected as optimal features in the RFE analysis.

Correlation matrix plot with all variables
Correlation matrix plot with all variables

Feature Selection

Using the features in the dataset (i.e., 13 features in the original dataset and 4 pseudo features that we have created), our goal is to build a model to predict the diagnosis of heart disease (0 = absence of heart disease; 1 = presence of heart disease). In the first part, we will implement RFE to identify which features in the original dataset (i.e., 13 features) can be used to predict the target variable. In the second part, we will add the four pseudo features into the RFE model and repeat the same analysis.

To implement RFE, we will use the rfe function from the caret package. The function requires four parameters:

  • x: A matrix or data frame of features
  • y: The target variable to be predicted
  • sizes: The number of features that should be retained in the feature selection process
  • rfeControl: A list of control options for the feature selection algorithm

So, before using the rfe function, we first need to select the control options for rfeControl. The caret package includes a number of algorithms for RFE, such as random forest, naive Bayes, bagged trees, and linear regression. In this example, we will use "random forest" (called rfFuncs) because it has a nice built-in mechanism for computing feature importance. In addition, we will use repeated 10-fold cross-validation with 5 repeats to improve the performance of feature selection with RFE.

Next, we will randomly split the dataset into training (80%) and test (20%) datasets and then save the features and the target variable as separate datasets (e.g., x_train for features and y_train for the target variable).

Finally, we will put everything together to implement feature selection with RFE. In the rfe function, we use sizes = c(1:13) so that the function tries all possible solutions (i.e., only 1 feature, 2 features, …, 13 features) to find the optimal features. Alternatively, we could type something like sizes = c(1:5, 10, 13) to try solutions with up to 5 features, 10 features, and 13 features.

The following shows the output returned from the rfe function. The output indicates that RFE recommends 8 features for the model (see the little asterisk sign under the "Selected" column). Both accuracy and Kappa reach the maximum level when 8 features are retained in the model.

Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
         1    0.720 0.433     0.0833   0.169         
         2    0.715 0.418     0.0840   0.171         
         3    0.815 0.625     0.0851   0.171         
         4    0.802 0.599     0.0652   0.132         
         5    0.806 0.606     0.0764   0.154         
         6    0.813 0.621     0.0748   0.151         
         7    0.823 0.641     0.0677   0.137         
         8    0.836 0.668     0.0672   0.135        *
         9    0.833 0.662     0.0779   0.156         
        10    0.827 0.650     0.0670   0.136         
        11    0.835 0.665     0.0659   0.133         
        12    0.835 0.665     0.0661   0.133         
        13    0.827 0.650     0.0714   0.144
The top 5 variables (out of 8):
   ca, cp, thal, oldpeak, thalach

We can also see the same results visually in the following graphs (the blue dot represents the optimal solution – i.e., 8 features).

Accuracy of the model based on the number of features
Accuracy of the model based on the number of features
Kappa of the model based on the number of features
Kappa of the model based on the number of features

Now, we know that 8 features have been selected by RFE, but what are these features? predictors(result_rfe1) gives us a list of the selected features.

[1] "ca"      "cp"      "thal"    "oldpeak" "thalach" "exang"   "slope"   "sex"

Next, we will visually examine variable importance for the selected features and see which features are more important for predicting the target variable.

The bar graph below shows that ca (the number of major vessels colored by fluoroscopy) is the most important feature, followed by cp (the type of chest-pain experienced by the individual) and thal (the thalassemia status).

A bar graph showing variable importance for the selected features
A bar graph showing variable importance for the selected features

We can also check the model performance using the test dataset. Both accuracy and Kappa values appear to be similar to those obtained from the training dataset.

Accuracy    Kappa 
  0.8333   0.6564

What if we also include the four pseudo features in the analysis? Would the results change or stay the same? Logically, we would expect RFE to eliminate the random variables (catvar and contvar1) as well as the variable with low correlation (contvar2) but to keep the variable that is moderately correlated with the target variable (contvar3) in the model. But, would the selected features from the previous analysis still be retained? Let’s find out.

The output returned from the new analysis (with 17 features) shows that RFE selected a 5-feature solution (i.e., even a more parsimonious solution!). Both accuracy and Kappa values have also increased – model improvement is not necessarily surprising because one of the pseudo features (contvar3) that we included had a moderate correlation with the target variable. So, we would expect this feature to contribute to model performance positively.

Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
         1    0.618 0.228     0.0845   0.171         
         2    0.750 0.491     0.0829   0.168         
         3    0.804 0.602     0.0638   0.129         
         4    0.844 0.684     0.0800   0.161         
         5    0.869 0.735     0.0738   0.149        *
         6    0.854 0.705     0.0782   0.158         
         7    0.852 0.700     0.0790   0.160         
         8    0.856 0.708     0.0736   0.149         
         9    0.857 0.711     0.0709   0.144         
        10    0.859 0.715     0.0770   0.157         
        11    0.858 0.713     0.0709   0.145         
        12    0.858 0.712     0.0700   0.142         
        13    0.863 0.723     0.0761   0.155         
        14    0.860 0.716     0.0753   0.153         
        15    0.866 0.728     0.0683   0.140         
        16    0.861 0.717     0.0752   0.154         
        17    0.862 0.719     0.0680   0.13

The following bar graph confirms our hypothesis: contvar3 is indeed one of the selected features. In fact, it appears to be the most important feature in the model. As we anticipated, the other pseudo features (catvar, contvar1, and contvar2) are not among the selected features because they are not sufficiently associated with the target variable. After adding a feature that is moderately correlated with the target variable, RFE has eliminated some of the features with low importance (e.g., exang, slope, and sex) and yielded a more simplified solution with only 5 features. In sum, this simple experiment demonstrates the sensitivity of RFE in finding important features and eliminating less relevant features.

A bar graph showing variable importance for the selected features
A bar graph showing variable importance for the selected features

Conclusion

Overall, RFE is an effective method to find an optimal set of features for both regression and classification tasks in Machine Learning. The caret package provides an easy-to-use function to implement RFE in R. Although I have demonstrated how to use the rfe function with the default options, it is possible to customize the control options with user-defined functions (see the Helper Functions section of the caret manual for nice examples). I hope you find the R codes and procedures described in this article helpful for your future projects.

Note: RFE can be used in a variety of ways. In a recent blog post on my website, I demonstrated how to use RFE to shorten exams by selecting the most important questions (i.e., features) when predicting students’ scores on the exam.

References

[1] Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. CRC Press.

[2] Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.

[3] Kuhn, M. (2020). caret: Classification and regression training. Available from https://CRAN.R-project.org/package=caret.

[4] Kuhn, M. (2019). The caret package. Available from https://topepo.github.io/caret/index.html


Related Articles