Inside AI

During the survey and data collection step, we do not know which feature/attribute have a strong influence on the output and the ones which do not have that much effect. Due to this, we collect or measure as many logical attributes as possible in this stage.
The Machine Learning model becomes complex, and also computationally becomes expensive as the number of features in the training dataset increases.
The aim is to develop a trained machine learning model with the minimal required feature and which can predict the data points with acceptable accuracy. We should not oversimplify the model and lose the significant information by pruning important features, and at the same time have a complex model with too many redundant or lesser important features.
Scikit-Learn library provides several methods to simplify the model with the dimensional reduction of the training dataset and minimal impact on the prediction accuracy of the machine learning model.
In this article, I will discuss recursive feature elimination and cross-validated selection to identify the optimal independent variables and reduce the dimension of the training dataset.
We will be using the breast cancer dataset in Sckit_Learn in this article to explain Recursive Feature Elimination With Cross-Validation (RFECV).
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
Breast cancer dataset has 569 records with 30 independent variables ( features) and binary classes for the dependent variable.
X,y=load_breast_cancer(return_X_y=True,as_frame=True)
print(X.shape)

We will be using featureimportances attribute in RandomForestClassifier to calculate the significance of the set of features in the iteration.
In RECV, feature importance is calculated based on the estimator selected, and one/few features are dropped in each iteration.
In the below code, the "step" parameter indicates the number of features to remove at each iteration. Here, we will drop one feature in each iteration to identify the right set of features which have maximum influences on the dependent variable. To split data into train/test sets and calculate the feature importance we have mentioned three-fold cross-validation with Stratified K-Folds cross-validator.
You can know more about StratifiedKFold in Scitkit-learn documentation
rfc = RandomForestClassifier(max_depth=8, random_state=0)
clf = RFECV(rfc, step=1, cv=3)
clf.fit(X, y)
"nfeatures" provides the count of features which are crucial and have a strong influence in predicting the dependent variable.
print("Optimal number of features : %d" % clf.n_features_)
print("Optimal number of features : ", clf.n_features_)
In Breast cancer dataset, we have 30 features. Based on "featureimportances" RFECV recommended 16 features are significant in predicting the class of cancer.

"ranking_" attribute provides the importance order of each feature.
print(clf.ranking_)

The features ranked one is most important and has maximum influence on predicting the class of cancer. As the ranking of a feature gets higher, it is relatively lesser important.
Let us visualise the cross-validation score against the number of features to understand the variation in cross-validation scores.
plt.plot(range(1, 31), clf.grid_scores_)
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.show()
We can see that for the combination of 16 most important features the cross-validation scores tops. Increasing the dimension of the training dataset further doesn’t improve the prediction accuracy further.

Key Takeaways and Conclusion
Supervised machine learning aim is to have a generalised trained model with most important features.
We must target to reduce the number of dimensions (features) on which model is trained. It makes the model simpler and also computationally economical.
Recursive Feature Elimination With Cross-Validation indicates the features which are important with importance ranking. This enables us to build the model with optimal dimensions.
As RFECV identifies the best features by eliminating the lesser important or redundant features in steps along with cross-validation, hence it is computationally very expensive. It is one of the disadvantages of Recursive Feature Elimination With Cross-Validation.
Learn to select the independent variable with exploratory data analysis and statistics in the article How to identify the right independent variables for Machine Learning Supervised Algorithms?