Feature Selection for the Lazy Data Scientist

A comprehensive literature review and code on Filter-based methods for Feature Selection

Published in

Towards Data Science

23 min readApr 26, 2022

If you’re a data scientist and the curse of dimensionality has struck you, this post is for you. Here is a comprehensive survey (with examples), of feature selection algorithms. We finish the discussion by integrating and evaluating an ensemble of different feature selectors for a rounded conclusion.

First, let’s start by defining what feature selection is not. If you’re facing the problem of high dimensionality, you probably have heard the terms “dimension reduction (or PCA/Auto Encoders)” and “feature selection.” Before we delve into feature selection, here is a short description of dimension reduction that will help you decide whether you should pursue this approach.

Dimension Reduction

Dimension reduction helps us reduce the number of features by converting data from the feature space into a latent space. This smaller space is supposed to represent the learned features in a more compressed manner. If we choose the dimension of the latent space correctly, then we can even reduce noise and remove irrelevant information. The great downside of this approach is its explainability. Moving from the feature space into the latent space, we are no longer looking directly at features, only at their representations, which are generally don’t have an intuitive explanation or logic as features have. This approach becomes handy when working on unstructured data, like images or signals. See here and here for further discussion on latent space meaning.

An example of a latent space in an autoencoder architecture, image by the author

To conclude, if explainability is essential for your project, or for some reason you may want to keep the original representation, dimension reduction isn’t for you, and you’re in the right place.

Feature Selection

Feature selection methods allow you to choose the best features in your dataset, with respect to (w.r.t.) some criteria. Sound obvious, right? Even so, there are A LOT of different methods, each of which defines the goodness of a feature in a slightly different way, and to correctly select the best feature, you’ll probably need more than one method.

First, some terminology: Feature selection methods can be divided into three families:

(1) Filter methods: Feature selection is made as part of the pre-processing, that is, before we train a model. We filter out features that perform poorly based on some criteria. Naive criteria would be a simple correlation. This approach shines when we have massive datasets, especially when we have a lot of features. Hoph and Reifenrath (2021) found in their work (which is covered in-depth in the next section) that

..three properties make [filter] methods appear particularly suitable for certain dataset scenarios. These are robustness to noise, cost sensitivity to counteract class imbalance and the consideration of feature combinations (multivariate methods) for the detection of redundant features.

(2) Wrapper methods: Iteratively choose a subset of features, train your model, and choose the best combination. As you can already assume, this approach is very expensive and almost impractical.

(3) Embedded methods: Not to be confused with embeddings, vector representations of words. Embedded methods take advantage of the feature importance estimations which are embedded in the algorithm. For example, Random Forest chooses a different subset of features in order to avoid overfitting. After the model has been trained, we can look at how many of our features have benefited the prediction performance. Nevertheless, just like in classification tasks, relying on only embedded methods, we could end up overfitting the feature selection to a given algorithm, restricting us from using different algorithms and different datasets, especially when using high-variance methods like decision trees.

Methods (2) and (3) are more costly. They require a trained model and implicitly assume we can train one using the whole data, which isn’t always the case. The following plot presents the distinction between different feature selection approaches.

Figure 4: Extended taxonomy of supervised feature selection methods and techniques, Danny Butvinik, here.

With that, we conclude the introduction and can attend to the pressing issues that brought us here today.

Filter Methods

Filter methods basically measure feature performance with respect to specific criteria. Some features would shine under some setups while performing poorly in others. Therefore, using multiple criteria and integrating features' scores is essential to get a rounded grasp of our data. I tried to be as inclusive as I could and collected different filtering methods from the literature ([1],[2],[3],[4]). Eventually, I ended up with this long article, and since I wanted to add some code and examples, I had to filter out some filtering methods (see what I did there?). In other words, if you’d like to get an even more profound grasp of the different techniques, I employ you to go over the modest literature review in this article.

Filter Methods — Methods :)

Bommert et al. [2] divide filter-based feature selectors into univariate, multivariate, MI-based methods, and Relief-based methods. I will do the same here.

Most of the methods that are covered can be used through FSM, which is available on my Github here.

Univariate Methods

Rank features separately by their relations with the outcome variable, regardless of other variables. Univariate methods could use statistical tests, like ANOVA, or use predictive performance, like AUC, from predicting the outcome using only one feature at a time.

Analysis of Variance (ANOVA): The ANOVA feature selector uses F-statistic as scoring for each feature. In general, the F statistic asks how far are the means for a given feature, between different classes of our outcome variable. A more detailed explanation could be found in Kasra Manshaei’s answer here, or in this dedicated blog post. In a few words, the better a feature is, at separating different classes, the greater the F-score for this feature is as well.

The separation between classes of each feature, x on the X-axis and y on the Y-axis, source

The implementation in this case, is pretty straightforward using sklearn.

from sklearn.feature_selection import SelectKBest, f_classif
anova_filter = SelectKBest(f_classif, k=3)

And an example using FSM:

selectors = FSM.FSM(k=20, filler = -1)
selectors.anova_inference(X_,y_)# Results# array(['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
#        'int_rate', 'annual_inc', 'dti', 'inq_last_6mths',
#        'mths_since_last_record', 'pub_rec', 'revol_util', 'out_prncp',
#        'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
#        'total_rec_prncp', 'total_rec_late_fee', 'recoveries',
#        'collection_recovery_fee', 'last_pymnt_amnt'], dtype=object)

Kruskal: Applies a Kruskal–Wallis rank-sum test [5] for each feature, which is the non-parametric equivalent of the analysis of variance (that is, it doesn't assume that our features are normally distributed). Before we could calculate the Kruskal statistic, we first have to rank our observations from smallest to largest (regardless of their corresponding outcome). Then, we sum the ranks within each class i. Then the Kruskal statistic is given by

Where T is the sum of ranks within class(i), n_i is the number of observations belonging to class(i), and n is the total number of observations.

Again, higher values of this statistic mean that the corresponding feature values differ over different classes. We can implement the Kruskal Wallis test using scipy.

from scipy import stats
stats.kruskal(x_j, y)

And in FSM:

selectors = FSM.FSM(k=20, filler = -1)
selectors.kruskal_inference(X_,y_)# Results:# Index(['recoveries', 'collection_recovery_fee', 'out_prncp_inv', 'out_prncp',
#        'delinq_2yrs', 'pub_rec', 'total_rec_late_fee', 'acc_now_delinq',
#        'collections_12_mths_ex_med', 'inq_last_6mths', 'policy_code',
#        'revol_bal', 'dti', 'last_pymnt_amnt', 'total_rec_int',
#        'total_rec_prncp', 'total_pymnt', 'total_pymnt_inv', 'member_id',
#        'installment'],
#       dtype='object')

Chi-squared: The value of the χ2 statistic is used as the score.

from sklearn.feature_selection import SelectKBest, chi2
chi2_filter = SelectKBest(chi2, k=2)

And in FSM:

selectors = FSM.FSM(k=20, filler = -1)
selectors.chi2(X_,y_)# Results:# array(['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
#        'int_rate', 'installment', 'annual_inc', 'mths_since_last_record',
#        'revol_bal', 'revol_util', 'out_prncp', 'out_prncp_inv',
#        'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
#        'total_rec_int', 'total_rec_late_fee', 'recoveries',
#        'collection_recovery_fee', 'last_pymnt_amnt'], dtype=object)

Multivariate Methods

Maximum Relevance Minimum Redundancy (MRMR): Was introduced in 2005 and regained popularity recently [6]. MRMR is a greedy algorithm that works iteratively. According to the following rule, each iteration chooses the best feature and adds it to the previously-selected features. It is called Maximum Relevance — Minimum Redundancy because at each iteration, we want to select the feature with maximum relevance with respect to the target variable and minimum redundancy with respect to the chosen features at previous iterations.

Feature selected by MRMR vs Boruta, Figure by Samuele Mazzanti, was taken from here. In this example, Age (integer) is mechanically dependent on Age; IQ is heavily correlated with parents’ IQ, as well as the partner’s height and one’s height. Taking into account high correlations between features, via normalizing by the denominator, MRMR will lead to the above three features, which are circled by a red line.

While there are many ways to calculate MRMR, I only present the FCQ method. Visit [7] for additional strategies.

Figure by Samuele Mazzanti, was taken from here.

The relevance of feature f at the i-th iteration (numerator) is computed as the F-statistic between the feature and the target variable. The redundancy (denominator) is calculated as the average of the absolute value of the Pearson correlation between the feature and all the features that have been selected at previous iterations.

from mrmr import mrmr_classif
data = pd.read_csv('/Users/davidharar/Downloads/data_loan.csv')
X = data._get_numeric_data().drop(['default_ind'], axis = 1)
y = data['default_ind']selected_features = mrmr_classif(X, y, K = 14)
# K is the size of the desired subset.selected_features

Predictive Performance

This is a kind of middle ground between filter methods and embedded methods. Under this approach, we iteratively run a model, say a Random Forest, with only one feature at a time. Then, we rank our features based on the familiar KPIs such as AUC and accuracy. The AUC is usually preferred as it is more resilient to unbalanced datasets and is being assessed over multiple choices of thresholds. The downside of the AUC is that it is suitable for binary prediction only.

Mutual Information

Mutual Information is a measure of mutual dependence between two features. Unlike the Pearson correlation, it isn’t limited to monotonic relations. The mutual information determines how different the joint distribution is from the product of the two marginal distributions. Recall that if two variables are independent, their joint distribution equals the product of their marginal distributions.

Entropy is another important structure that is derived from mutual information. The entropy measures the uncertainty of the variable. The entropy is high when all possible values occur with about the same probability, and if the probability of occurrence is very different, the entropy is low. It is linked to mutual information — a high mutual information value indicates a large uncertainty reduction since when we realize the value of one variable, we have relatively high certainty in the other’s expected value. If the mutual information is zero, the two random variables are independent, and we can’t infer anything about the plausibility of one’s value given the other.

Continuous features have to be discretized before we can calculate either the mutual information or the entropy.

Let X and Y be two discrete variables with (empirical) probability mass functions. The (unconditional) entropy of Y is given by

The conditional entropy of Y given X is given by

The mutual information could be represented by the following equation:

Proof for the above equation could be found here, which gives an excellent intuition for mutual information in the context of the above equation:

..if entropy H(Y) is regarded as a measure of uncertainty about a random variable, then H(Y|X) is a measure of what X does not say about Y. This is “the amount of uncertainty remaining about Y after X is known”, and thus the right side of the second of these equalities can be read as “the amount of uncertainty in Y, minus the amount of uncertainty in Y which remains after X is known”.

Several methods use the above theoretical structure and explore the concepts of information, as well as redundancy between different features. In this part, we will cover some of them.

Venn diagram showing additive and subtractive relationships of various information measures associated with correlated variables X and Y, source: Wikipedia

Greedy Methods on Mutual Information Scores

Before we start, note that most of the algorithms below have computation complexity that depends (sometimes quadratically) on the desired number of features. Most of the implementations include a designated parameter for the desired number of features. Explore it and save yourself a ton of time.

Joint mutual information: Even though it isn’t a greedy or iterative method, I added it here since you would see from time to time some articles use JMI as a method for feature selection as it is. JMI is somewhat naive and reminds the use of simple correlations for filtering. We just calculate the joint mutual information between the target and each of our features, then we keep the k highest ones. Implementation is under JMIM

Joint Mutual Information Maximization (JMIM) and Normalised Joint Mutual Information Maximisation (NJMIM) [8]: As MRMR and CMIM, these two algorithms do a greedy search for features that maximize the mutual information with the target. The main difference between JMIM and CMIM is that the latter maximizes the amount of information the candidate feature conditioning on the pre-selected features, whereas JMIM selects the feature that maximizes the joint mutual information of the new one + pre-selected features with the target feature. The main take here is that JMIM considers the redundancy of features as well, while CMIM only tries to maximize the joining mutual information.

Note about the implementation: It wasn’t easy to find a detailed implementation but I got lucky to find this great one here. The mifs package allows JMI, JMIM and MRMR. The MRMR implementation from the corresponding subsection was faster in my experiments, and overall JMI is faster than JMIM. Last thing, when you setup the package, remember to change from sklearn.feature_selection.base import SelectorMixin to from sklearn.feature_selection._base import SelectorMixin . Other than that, mifs is flexible and clear.

import mifs# define MI_FS feature selection method
feat_selector = mifs.MutualInformationFeatureSelector(method = "JMIM", n_jobs=8)# Few important params for MutualInformationFeatureSelector:
# 1. method, self explenatory, could be "JMI", "JMIM" and "MRMR"
# 2. n_features: desired number of features after filtering.
# 3. n_jobs: How many CPUs should run parallelly.
# 
# Important attributs:
# 1. feat_selector.get_support(): Boolean vector of length p that
#    indicates whether a feature has been selected.
# 2. feat_selector.ranking_: Vector of length p that is ordered by
#    importance, for example [52,1,53,...] indicates that feature 52 
#    was the most important one.# Both JMI and JMIM arn't compatible with NaNs
X_filled = X.fillna(-1)
feat_selector.fit(X_filled, y)# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X_filled)
X_filtered = pd.DataFrame(X_filtered, columns = X_filled.columns[feat_selector.ranking_])

Fast Correlation Based Filter (FCBF): A correlation-based filter method that takes into account the redundancy between features. A feature is good for the classification task if the correlation between a feature and the target is high enough to make it relevant for prediction. The correlation between it and any other relevant features does not reach a level so that any of the other relevant features can predict it. Since linear correlation has some major limitations, FCBF’s correlation is based on entropy, specifically on symmetric uncertainty, SU.

The algorithm works as the following:

Relevant Features: first, we calculate SU for each of our p features. We choose an SU threshold. Let S be the set of features with an SU score higher than the threshold. Let's order these features in decreasing order.
Redundant Features: Starting from the first sorted feature, we calculate SU(j,l), for l in S. If SU(j,l) ≥SU(target,j), we remove l from S. Features that have been removed are called “redundant peers for feature j.”

selectors = FSM.FSM(k=20, filler = -1)
selectors.fcbf(X_,y_)# Results: (fcbf has a lower bound on SU rather than k features. The default is 0).# selectors = FSM.FSM(k=20, filler = -1)
# selectors.fcbf(X_,y_)

Conditional Mutual Information Maximization (CMIM): Greedy algorithm that chooses features based on their joint mutual information with the target variable. Each iteration selects the feature that shares the maximal mutual information with the target given the pre-selected features. That is, we choose the feature that maximizes the mutual information with the part of the target variable which hasn’t been described yet, regardless of redundancy (that is, neglecting the high correlation between the current feature and the pre-selected ones).

Note on the implementation: scikit-feature was written in Python 2. The writer in here, converted most of the utils functions to Python 3 so I ended up using its utils along with somewhat modified versions of scikit-feature original functions. The code and a working example are in the accompanying notebook.

Implementation with FSM:

selectors = FSM.FSM(k=20, filler = -1)
selectors.cmim(X_,y_)# Results:# Index(['member_id', 'funded_amnt_inv', 'installment', 'open_acc', 'revol_util',
#        'total_pymnt', 'delinq_2yrs', 'annual_inc', 'dti', 'revol_bal',
#        'total_acc', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
#        'total_rec_late_fee', 'recoveries', 'inq_last_6mths',
#        'mths_since_last_record', 'collections_12_mths_ex_med',
#        'mths_since_last_major_derog'],
#       dtype='object')

Double Input Symmetrical Relevance (DISR) [9]: The DISR relies on the property that a combination of variables can return more information on the output class than the sum of the information produced by each of the variables taken individually. Under the DISR, the JMI of each feature is normalized by how well the given feature complements the other features.

Implementation with FSM:

selectors = FSM.FSM(k=20, filler = -1)
selectors.disr(X_,y_)# Results:# Index(['member_id', 'dti', 'recoveries', 'collection_recovery_fee',
#        'collections_12_mths_ex_med', 'mths_since_last_major_derog',
#        'policy_code', 'annual_inc_joint', 'total_rec_late_fee',
#        'total_rec_prncp', 'dti_joint', 'acc_now_delinq', 'tot_coll_amt',
#        'tot_cur_bal', 'total_pymnt', 'total_pymnt_inv', 'revol_bal',
#        'open_acc_6m', 'open_il_6m', 'open_il_12m'],
#       dtype='object')

Mutual Information Based Feature Selection (MIFS) [10]: Uses a greedy selection of the best feature w.r.t. I(i, target) and a penalty for redundancy with the pre-selected features.

Other developments in this family of information-based feature selection methods include Adaptive MIFS (AMIFS) [11], Conditional MIFS (CMIFS) [12], and more methods, which due to space constraints left as an exercise to the reader.

Implement MIFS with FSM:

selectors = FSM.FSM(k=20, filler = -1)
selectors.mifs(X_,y_)# Results:# Index(['member_id', 'collections_12_mths_ex_med',
#        'mths_since_last_major_derog', 'policy_code', 'annual_inc_joint',
#        'dti_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal',
#        'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il_24m',
#        'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m',
#        'open_rv_24m', 'max_bal_bc', 'all_util'],
#       dtype='object')

Relief Based Methods

Relief [13] was introduced in 1992 for feature selection for binary targets and gained popularity since ReliefF generalizes it for multi-class targets. RReliefF is a Relief-based method for regression [14], I-RELIEF, Tuned ReliefF, VLSReliefF, SURF, and ReliefSeq. Here is an excellent survey of RBMs [15]. The original algorithm goes as follows. Say we have p different features and a binary target. Then, we’re starting by initializing a weights vector w, of length p, with all 0s. Then, for B times, we

Draw a random example from our data, say from class 0.
Find the closest example in terms of Euclidean distance, which belongs to the same class. We call that example “near-hit.” Similarly, we take the closest point, which belongs to the opposing class, and call it a “near-miss.”
Update w: For m iterations and p features, we update w_j as in the following formula.

Note that we penalize the importance of feature j (denoted by w_j) by the (square) distance each example has from the closest example of the same class. That is, Relief prefers less-sparse features, and features in which examples of the same class are closer to each other will be penalized less. In contrast, Relief assigns higher importance to features in which examples from one class are further away from examples that belong to the other class.

As in your usual KNN estimator, taking into account only a few neighbors may lead to noisy predictions and overfitting. The ReliefF algorithm, in addition to generalizing Relief to multi-class targets, also incorporates information from k neighbors when estimating importance. In general (it is correct to other RBAs as well), more neighbors result in more accurate scores, to some extent, but take a longer time to converge.

In comparison with previous approaches, this one is totally non-parametric, and in contrast with the previous methods that relied on information theory, Relief derives importance from the concept of distance.

I-RELIEF proposed using an instance weighting function over the entire set of hit and miss instances instead of only the closes one of each class. SURF employed a distance threshold of T to define instances as neighbors (where T was equal to the average distance between all instance pairs in the data). However, in contrast with Iterative Relief, SURF utilizes equal instance weights for all instances defined as neighbors [15]. There are other variants like VLSReliefF, and ReliefSeq, which will not be covered here. This Medium post offers a more profound introduction to Relief based methods [16].

While they are not the same, Relief based methods, share similarities with Laplacian methods for feature selection. The Laplacian method assumes, as well, that observations of the same class tend to be closer to each other. In Laplacian feature selection, we embed data in the nearest neighbor’s graph by measuring an arbitrary distance and calculating the weight matrix. Then, we calculate the Laplace criterion for each feature and obtain such a property that the smallest values correspond to the most important dimensions. However, in practice, when selecting a subset of features, a different clustering algorithm (k-means method) is usually used, with the help of which the most effective group is selected [source].

Three Versions of Relief, figure by the author. The figure above demonstrates which neighbors are taken into account under different Relief Algorithms. The upper left panel presents the basic version which we have covered above. In this version, we take only the closest nearHit and nearMiss cases when updating weights. The upper right panel presents ReliefF which takes k different neighbors into account, 4 in this case. The bottom panel presents the neighbors which are taken into account when updating the weights, under SURF/SURF* algorithms. Instead of taking into account a specific number of neighbors, in these algorithms, we define a boundary by which neighbors who are closer than it are taken into account.

Two notes about Relief algorithms:

There are a few implementations of Relief (and its variants) in python (sklearn-relief, 16, for example), but this one seems the most detailed scikit-rebate.
Maybe one of the reasons RBAs have become so popular is their flexibility. RBAs are compatible with a mix of categorical and continuous features, with multi-class classification and regression as well. Nevertheless, as they rely on distance (and KNN), they suffer extensively from the curse of dimensionality. The user guide says:

In very large feature spaces users can expect core Relief-based algorithm scores to become less reliable when run on their own. This is because as the feature space becomes very large, the determination of nearest neighbors becomes more random. As a result, in very large feature spaces (e.g. > 10,000 features), users should consider combining a core Relief-based algorithm with an iterative approach such as TuRF (implemented here) or VLSRelieF, or Iterative Relief.

and

When scaling up to big data problems, keep in mind that the data aspect that slows down ReBATE methods the most is the number of training instances, since Relief algorithms scale linearly with the number of features, but quadratically with the number of training instances. This is is the result of Relief-based methods needing to calculate a distance array (i.e. all pairwise distances between instances in the training dataset). If you have a very large number of training instances available, consider utilizing a class balanced random sampling of that dataset when running any ReBATE methods to save on memory and computational time.

And for the specific time complexity of Relief-Based Algorithms, visit Table 5 in [15].

Implementation with FSM:

selectors = FSM.FSM(k=20, filler = -1)
selectors.ReliefF(X_,y_)
selectors.SURF(X_,y_)
selectors.SURFstar(X_,y_)
selectors.MultiSURF(X_,y_)# Note: X_ and y_ have N=10,000, which may lead to a long computation time.

Other methods

While we have covered a broad list of algorithms, we could only scratch the surface. We haven’t covered unsupervised algorithms for feature selection [17] and [18], spectral methods for feature selection [19], and graph-based methods [20], [21] and [22]. Other filter methods include Burrata, information fragment (IF), SOA, etc. As the headline says, this post was meant for the lazy data scientist. If you aren't lazy, I hope you’d find this article's bibliography interesting.

Integration

OK, if you’ve got until here, you may have a question that has been bothering you — you tried different methods and got different scores, which are not necessarily consistent with each other. What should you do?

According to [23], when ensembling different selection methods, we have to consider the following:

which and how many methods to use? They find that the more, the better.
The number and sizes of the training set to use — should we subset our data? Bootstrap? As we saw above, some of our algorithms are quadratic in the number of observations. Overall, the answer is yes; we should resample and select features. Eventually, we could combine the information with the scoring methods below.
How many features do we want to end up with? This number can be defined as the percentage of all the features we have or be decided before.
How to assess the aggregation method? I’ll devote this part to trying to answer this question.

Say you have E different lists of selected features, you can do one or more of the following.

Combine your E lists into one based on a selection rule (examples below). Then, train one model using the combined list.
Using feature selector ensemble — train E in different “thin” models, later, when a new dataset will arrive, the prediction will be made by predicting the outcome using E different models.
A middle ground between 1 and 2 — convert the E lists of features into E’ lists of combined features, based on E’ different rules, then train E’ (<<E) models as an ensemble.

ranking of features across different lists can be done by

Min: assign to each element the smallest (best) rank it has achieved.
Median: assign to each element the median rank it has achieved.
Arithmetic Mean: assign to each element the mean of the ranks it has achieved.
Geometric Mean: assign to each element the geometric mean of the ranks it has achieved.

We could use voting [24]:

counter = Counter(features_selected_lists) # list of lists
features_voted = [k for k, v in counter.items() if v >= min_votes]

While FSM doesn’t tell you how to rank your features, it comes with a convenient inference function that uses bootstrap and returns a list of features, ordered by their importance, for each of the classifiers, and for each sampling. All the preprocessing (discretization, normalization (recall that chi2, for example, can’t handle with negative values), and encoding) done internally. In the example below I used only too samplings, with a sample size of 100 observations for each sample.

selectors = FSM.FSM(k=20, filler = -1)
results = selectors.Bootstrapper(X_, y_, B=2,Sample_size=100)# Results:# {'ANOVA': [array(['loan_amnt', 'int_rate', 'annual_inc', 'delinq_2yrs',
#          'inq_last_6mths', 'mths_since_last_delinq',
#          'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
#          'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
#          'total_pymnt_inv', 'total_rec_prncp', 'total_rec_late_fee',
#          'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt'],
#         dtype=object),
#   array(['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
#          'int_rate', 'annual_inc', 'inq_last_6mths',
#          'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
#          'revol_util', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
#          'total_pymnt_inv', 'total_rec_prncp', 'recoveries',
#          'collection_recovery_fee', 'last_pymnt_amnt'], dtype=object)],
#  'chi2': [array(['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
#          'annual_inc', 'mths_since_last_delinq', 'mths_since_last_record',
#          'open_acc', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv',
#          'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
#           ...

How to Assess an Ensemble of Feature Selectors

Usually, we look at three KPIs: Diversity — such as in ensemble decision trees, we have no use for K different trees that give always the same results, we don’t need K different selectors that select the same features regardless of the data. Another KPI is Stability — We have to ensure that an ensemble will return similar rankings for similar datasets. Lastly is your KPI (which could be Accuracy, AUC, or whatever). Using redundant features could lead to worse performance, and we should expect better results from using the selected subset and faster training.

Diversity Kuncheva and Whitaker [25] present a comprehensive survey of additional measures to measure the diversity of feature selectors' ensembles. The setup is that we have two or more trained models, where each of them was trained on another subset of features. We also have their predictions, where some of them were correct, some incorrect, some were the same, and others were different. They present and compare 10 measures for diversity: Q-Statistic, Correlation Coefficient, Disagreement Measure, Double-Fault Measure, Kohavi-Wolpart Variance, Interrator Agreement, Entropy Measure, Measure of Difficulty, Generalised DIversity, and Coincident Failor Diversity. Nevertheless, as this article became too long as it is, I’ll go over the Q-Statistic and will leave the rest to the curious reader.

The Q statistic is a pairwise test. Given two classifiers, we have two Di predictions, of length N (which equals the number of observations). We cross-tabbing their prediction and used the following notation:

Cross-Tab of the prediction of two distinct classifiers. Table by the author.

Then the Q-Statistic for a pair of classifiers would be

Q-Statistic for a pair of classifiers, figure by the author.

And if we have L different classifiers, then we would return the average Q-Statistic:

Average Q-Statistic for L distinct classifiers, figure by the author.

For statistically independent classifiers, the expectation of Q(i,j) is 0. Q varies between −1 and 1. Classifiers that tend to recognize the same objects correctly will have positive values of Q, and those which commit errors on different objects will render Q negative.

Stability ensures that an ensemble will return similar rankings for similar datasets.

There are several methods to calculate stability ([26], and [27], for example). I’ll present a more up-to-date one from [28]. Using [29] presentation: Say we have p original features, from which we want to select k features total. We split our data into m different subsets (or resample from it). Define h_j as the number of times feature j is chosen as one of the most k important features in the whole m subsets. Define q to be the sum over p features(that is, it counts how many times feature 1 was selected + feature 2 was selected+…+feature p was selected), then the following

Nogueira et al. (2018), formula from Hopf and Reifenrath 2021

is a stability metric for a feature selection ensemble that is fully defined, is strictly monotonic, has upper and lower bounds, and its maximum value when all feature sets are equal.

Performance is our primary goal here, and using redundant features or just unimportant ones could decrease our model's performance. We can measure the success of the selection method by looking at our KPI under selection in comparison with the base case scenario where all the variables were entered. [24] perform t-tests to compare the KPI of the post-selection model and the one without.

Bibliography

Hopf, Konstantin, and Sascha Reifenrath. “Filter Methods for Feature Selection in Supervised Machine Learning Applications — Review and Benchmark.” arXiv preprint arXiv:2111.12140 (2021).
Bommert, Andrea, et al. “Benchmark for filter methods for feature selection in high-dimensional classification data.” Computational Statistics & Data Analysis 143 (2020): 106839.
Yu, Lei, and Huan Liu. “Feature selection for high-dimensional data: A fast correlation-based filter solution.” Proceedings of the 20th international conference on machine learning (ICML-03). 2003.
Pilnenskiy, Nikita, and Ivan Smetannikov. “Feature selection algorithms as one of the python data analytical tools.” Future Internet 12.3 (2020): 54.
Ross, Sheldon M. Introductory statistics. Academic Press, 2017.
“MRMR” Explained Exactly How You Wished Someone Explained to You, Samuele Mazzanti, Towards Data Science, Medium.
Zhao, Zhenyu, Radhika Anand, and Mallory Wang. “Maximum relevance and minimum redundancy feature selection methods for a marketing machine learning platform.” 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2019.
Bennasar, Mohamed, Yulia Hicks, and Rossitza Setchi. “Feature selection using joint mutual information maximisation.” Expert Systems with Applications 42.22 (2015): 8520–8532.
Meyer, Patrick E., and Gianluca Bontempi. “On the use of variable complementarity for feature selection in cancer classification.” Workshops on applications of evolutionary computation. Springer, Berlin, Heidelberg, 2006.
Battiti, Roberto. “Using mutual information for selecting features in supervised neural net learning.” IEEE Transactions on neural networks 5.4 (1994): 537–550.
Tesmer, Michel, and Pablo A. Estévez. “AMIFS: Adaptive feature selection by using mutual information.” 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. №04CH37541). Vol. 1. IEEE, 2004.
Cheng, grong, et al. “Conditional Mutual Information‐Based Feature Selection Analyzing for Synergy and Redundancy.” Etri Journal 33.2 (2011): 210–218.
Kira, Kenji, and Larry A. Rendell. “A practical approach to feature selection.” Machine learning proceedings 1992. Morgan Kaufmann, 1992. 249–256.
Robnik-Šikonja, Marko, and Igor Kononenko. “Theoretical and empirical analysis of ReliefF and RReliefF.” Machine learning 53.1 (2003): 23–69.
Urbanowicz, Ryan J., et al. “Relief-based feature selection: Introduction and review.” Journal of biomedical informatics 85 (2018): 189–203.
Yash Dagli, Feature selection using Relief algorithms with python example., Medium.
Chen, Renjie, et al. “Supervised feature selection with a stratified feature weighting method.” IEEE Access 6 (2018): 15087–15098.
Wang, Shiping, and William Zhu. “Sparse graph embedding unsupervised feature selection.” IEEE Transactions on Systems, Man, and Cybernetics: Systems 48.3 (2016): 329–341.
Zhao, Zheng Alan, and Huan Liu. Spectral feature selection for data mining. Taylor & Francis, 2012.
Akhiat, Yassine, et al. “A new graph feature selection approach.” 2020 6th IEEE Congress on Information Science and Technology (CiSt). IEEE, 2021.
Zhang, Zhihong, and Edwin R. Hancock. “A graph-based approach to feature selection.” International workshop on graph-based representations in pattern recognition. Springer, Berlin, Heidelberg, 2011.
Hu, Rongyao, et al. “Graph self-representation method for unsupervised feature selection.” Neurocomputing 220 (2017): 130–137.
Bolón-Canedo, Verónica, and Amparo Alonso-Betanzos. “Ensembles for feature selection: A review and future trends.” Information Fusion 52 (2019): 1–12.
Cohen, Cohen, and Atlas, Combining Feature Selection Methods, vatvengers, Medium.
Kuncheva, Ludmila I., and Christopher J. Whitaker. “Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy.” Machine learning 51.2 (2003): 181–207.
Kuncheva, Ludmila I. “A stability index for feature selection.” Artificial intelligence and applications. 2007.
Jurman, Giuseppe, et al. “Canberra distance on ranked lists.” Proceedings of advances in ranking NIPS 09 workshop. Citeseer, 2009.
Nogueira, Sarah, Konstantinos Sechidis, and Gavin Brown. “On the stability of feature selection algorithms.” J. Mach. Learn. Res. 18.1 (2017): 6345–6398.
Hopf, Konstantin, and Sascha Reifenrath. “Filter Methods for Feature Selection in Supervised Machine Learning Applications — Review and Benchmark.” arXiv preprint arXiv:2111.12140 (2021).

Feature Selection for the Lazy Data Scientist

A comprehensive literature review and code on Filter-based methods for Feature Selection

Dimension Reduction

Feature Selection

Filter Methods

Filter Methods — Methods :)

Other methods

Integration

Bibliography

Written by David Harar