Are you dropping too many correlated features?

An analysis of current methods and a proposed solution

Published in

Towards Data Science

7 min readSep 14, 2020

Update: The updated Python correlation function described in this article can be found in the exploretransform package on PYPI.

Summary

Some commonly used correlation filtering methods have a tendency to drop more features than required. This problem is amplified as datasets become larger and with more pairwise correlations above a specified threshold. If we drop more variables than necessary, less information will be available potentially leading to suboptimal model performance. In this article, I will be demonstrating the shortcomings of current methods and proposing a possible solution.

Example

Let’s look at an example of how current methods drop features that should have remained in the dataset. We will use the Boston Housing revised dataset and show examples in both R and Python.

R: The code below uses the findCorrelation() function from the caret package to determine which columns should be dropped.

The function determined that [ ‘indus’, ‘nox’. ‘lstat’, ‘age’, ‘dis’ ] should be dropped based on the correlation cutoff of 0.6.

Python: Python doesn’t have a built in function like findCorrelation(), so I wrote a function called corrX_orig().

We get the same result as R: drop columns [ ‘indus’, ‘nox’. ‘lstat’, ‘age’, ‘dis’ ]

Unfortunately, both are incorrect. The age column shouldn’t have been dropped. Let’s explore why.

How do these functions work?
A correlation matrix is created first. These numbers represent the pairwise correlations for all combinations of numeric variables.

Then, the mean correlation for each variable is calculated. This can be accomplished by taking the mean of every row or every column since they are equivalent.

After, the lower triangle of the matrix and the diagonal is masked. We don’t need the lower triangle because the same information exists on either side of the diagonal (see matrix above). We don’t require the diagonal because that represents correlations between variables and themselves (it’s always 1).

Matrix with Lower Triangle and Diagonal Masked

Here is pseudocode to demonstrate how the rest of the function works. I hard coded 0.6 as the correlation cutoff for this example:

Now to the part you’ve been waiting for. Why should the functions not drop age?

Below is a table that shows variables I captured from the original function. Remember the functions told us to drop [ ‘indus’, ‘nox’. ‘lstat’, ‘age’, ‘dis’ ]. So we manually eliminate [ ‘indus’, ‘nox’, ‘lstat’, ‘dis’ ] from the table. As you can see in the table, there are no other variables left to compare against age to make a drop decision. Therefore age should not be dropped.

But why is this happening?

Because of the sequential nature of the R and python functions, they are unable to consider the state of all the variables holistically. The decision to drop variables happens in order and is final.

How can we prove age belongs in the dataset?

We can remove age from the drop list resulting in [ indus, nox, dis, lstat], and then remove those four columns from the original dataset. When we rerun this subset of variables, we would expect ‘age’ as the output if it should be dropped. If we get no output, that means ‘age’ should have stayed in the set.

As you will see below both functions provided no output. Age should have stayed.

Python

Brief Recap

In this example, we have demonstrated that the commonly used correlation filter functions overstate the number of drop columns. My assertion is that this is due to the sequential nature of how each cell in the correlation matrix is evaluated and dropped.

So what is the solution?

Log the variable states based on the original logic
Calculate which variables to drop at the end using the log
Original : The original solution drops the columns sequentially, immediately, and with finality.

Revised: Captures the variable states without dropping into a dataframe res

2. Revised: Calculate which variables to drop using res

Below is the output of res containing the variable states along with variable definitions

v1, v2: The row and column being analyzed
v1, v2 [.mean]: The average correlation for each v1 and v2
corr: The pairwise correlation between v1 and v2
drop: The initial drop decision to drop higher of (v1.mean, v2.mean)

Revised (2) Steps in drop calculation

I would encourage the reader to manually walk through the steps below using captured variable states (res) illustration above. I’ve also embedded the code for each step from the calcDrop() function. The entire function is at the end of this section.

Step 1: all_vars_corr = All variables that exceeded the correlation cutoff of 0.6. Since our logic will capture variables meeting this condition, this will be the set of unique variables in columns v1 + v2 from the res table above.

Result: [‘tax’, ‘indus’, ‘lstat’, ‘rm’, ‘zn’, ‘age’, ‘nox’, ‘dis’]

Step 2: poss_drop = Unique variables from the drop column. These may or may not be dropped in the end.

Result: [‘indus’, ‘lstat’, ‘age’, ‘nox’, ‘dis’]

Step 3: keep = Variables from v1 and v2 not in poss_drop. Essentially, any variables that aren’t possibly going to be dropped are going to be kept

Result: [‘zn’, ‘tax’, ‘rm’]

Step 4: drop = Variables from v1 and v2 appearing in the same row as keep. If we know which variables to keep, then any variable paired with those will be dropped.

Result: [‘lstat’, ‘nox’, ‘dis’, ‘indus’]

Step 5: poss_drop = Remove drop variables from poss_drop. We are removing variables we know we are dropping from the list of possibles.

Result: [‘age’] This is the last variable left out of the possibles.

Step 6: Subset the dataframe to include only poss_drop variables in v1 and v2. We want to see if there is any reason to drop age.

Step7: Remove rows where drop variables are in v1 or v2 and store unique variables from drop column. Store the result in more_drop. Here we are removing rows we know contain variables we are dropping. In this smaller example, we will get an empty set since all the rows contained variables we know we are dropping. This is correct result: age is not in this set.

Result: set()

Step 8: Add more_drop variables to drop and return drop

Result: [‘lstat’, ‘nox’, ‘dis’, ‘indus’] : more_drop doesn’t contain age after manually completing the steps on the res table which is exactly what we expect

Here is the entire calcDrop() function:

Brief Recap

In this example, we have demonstrated a revised pair of functions for filtering variables based on correlation. The functions work in the following way:

corrX_new: Log the variable states based on the original logic
calcDrop: Calculate which variables to drop

Final Example

Let’s use the (mdrr) dataset from R’s caret package which contains many correlated features. We will use the old and new functions in this section, and it will be less verbose since we’ve covered the general testing routine.

R (original)

findCorrelation() drops 203 columns

Python (original)

corrX_orig() drops 203 columns

Python (revised)

There are 9 columns identified that shouldn’t have been dropped from the dataset. Let’s confirm in R and Python.

When the columns identified by python are added back to the main set in R, no columns drops are identified.

Python

The results in Python are identical. The columns [‘DDI’, ‘ZM1V’, ‘X2v’, ‘piPC05’, ‘VAR’, ‘SPAN’, ‘QYYe’, ‘GMTIV’, ‘X5sol’] shouldn’t have been dropped originally.

Conclusion

In this article, we have demonstrated how commonly used correlation filtering methods have a tendency to unnecessarily drop features. We’ve shown how the problem can be exacerbated when the data becomes larger. Although we haven’t shown evidence, it’s a fair assumption that unnecessary feature removal can have a negative effect on model performance

We have also provided an efficacious solution with code, explanations and examples. In a future article, we will extend this solution adding target correlation to the filtering decision.

Feel free to reach out to me on LinkedIn.