Brute force techniques of variable selection for classification problems

Published in

Towards Data Science

5 min readMay 18, 2019

Variable selection is an important step in building accurate and reliable prediction models and one that requires a lot of creativity, intuition and experience. You must have come across hundreds of articles talking about variable selection techniques and a lot of them talk about understanding the data and getting a feel of it from statistical point of view. Checking which variables have low variance, which ones are highly correlated with each other and with the target variables, etc. And all these are useful techniques unless you come across a dataset containing around 20k variables without a comprehensive metadata ! I work for a major automobile company in the capacity of a Data Scientist and was trying to build a model using the internal signals generated by ADAS (Automated Driver Assistance System). These signals will generally comprise the different velocities, accelerations, distances from detected objects, brake pedal state and a lot of state variables for the system.

Now, I had a lot of use cases requiring predictive modelling but we will consider one of them so that we can relate better. ADAS has a collision warning system that sends a warning whenever it detects a possible collision with a nearby object. The warning is mostly correct or true but at times it gives a false warning which is inconvenient for the driver. Our goal is to built a model that can predict whether a collision warning is false or not. Lets not be swayed by the problems too much and concentrate on the data. Our target variable is binary and we have around 20,000 variables containing continuous, categorical and discreet variables.

So, our variable selection technique will have two directions based on our requirement — Do we need to be able to explain our model in detail to the end users along with stating the variables on which our model is based ? Or we just want to get maximum accuracy ? If former is the case, we need to filter variables on the basis of their prediction power or influence on the target variable.

Point Biserial Correlation

If the target variable is binary, point biserial correlation is one good way to select variables. A point-biserial correlation is used to measure the strength and direction of the association that exists between one continuous variable and one dichotomous variable. It is a special case of the Pearson’s product-moment correlation, which is applied when you have two continuous variables, whereas in this case one of the variables is a nominal binary variable. Like other correlation coefficients, the point biserial ranges from 0 to 1, where 0 is no relationship and 1 is a perfect relationship.

We calculate this coefficient for every continuous variable with the target variable and then select the top highly correlated variables based on a threshold coefficient (say 0.5 for example). Using this, I was able to filter out 50 variables from around 15,000 variables in my dataset.

Chi Square

Next technique is the chi-squared test — which is used to test the independence of two events. If a dataset is given for two events, we can get the observed count and the expected count and this test measures how much both the counts are derivate from each other. The Chi Square statistic is commonly used for testing relationships between categorical variables. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population. You can read more about how to conduct the test here : https://www.statisticssolutions.com/using-chi-square-statistic-in-research. So, i wont go into the details of the test but the idea is to filter the independent categorical variables using the test.

Feature Variance

Another technique used to filter out variables that won’t add value to our model is the variance threshold method which removes all features whose variance does not meet some threshold. This method can be used before any of the above methods as a starting point and doesn’t really have much to do with the target variable. Generally, it removes all the zero-variance features which means all the features that have the same value in all samples.

Feature Importance

We can use the feature importance feature of the tree based classifier model to get the importance of each feature with respect to the target variable:

We can select the top n features based on feature importance which will give good results with any classification algorithm we might want to use.

Linear Discriminant Analysis

Next, let us take a look at some transformation techniques where we won’t be able to retain our original variables but can get a much higher accuracy. LDA and PCA are 2 such techniques and although PCA is known to perform well in a lot of cases, we will mostly focus on LDA and its application. But first, lets quickly take a look at some of the main differences between LDA and PCA :

Linear Discriminant Analysis often outperforms PCA in a multi-class classification task when the class labels are known. In some of these cases, however, PCA performs better. This is usually when the sample size for each class is relatively small. A good example is the comparisons between classification accuracies used in image recognition technology. Linear Discriminant Analysis Python helps to reduce high-dimensional data set onto a lower-dimensional space. The goal is to do this while having a decent separation between classes and reducing resources and costs of computing. We can use the explained variance ratio attribute of the LDA class to decide the number of components which best suits our needs. Take a look :

In case we are not interested in retaining the original feature set, LDA provides a nice way to transform our data to a select number of new dimensional space which is a better input for a classification algorithm.

Now, lets put together all the techniques we discussed for the non transformation case as a series of steps in the form of code:

The function variable_selection returns a new dataframe containing only the columns selected after passing through various filters. The inputs to the function such as variance threshold, point biserial coefficient threshold, chi-square statistics threshold and feature importance threshold should be chosen carefully to get the desired output.

Note that this is just a framework for variable selection while is essentially a manual exercise for the most part. I wrote this code because I felt a lot of the process can be automated or semi-automated to save time. My advice is to use this function as a starting point and make whatever changes you deem fit for a particular dataset and problem statement and then use it.

Good luck !