The world’s leading publication for data science, AI, and ML professionals.

Feature Selection for Data Science: Simple Approaches

By getting rid of features that are irrelevant to our dataset, we can create a better predictive model.

Every data scientist has either confronted or will come across this problem; a huge dataset with so many features that they don’t even know where to start. While there are many advanced methods that can be used for selecting the best collection of features for a dataset, sometimes simple approaches can provide a great baseline to your analysis and even be the only necessary approach needed for selecting your dataset’s best features.

Photo by UX Indonesia on Unsplash
Photo by UX Indonesia on Unsplash

Feature Selection: Why does it matter?

Feature Selection is important because having a dataset containing too many features may lead to high computational costs as well as using redundant or unimportant attributes for making model predictions. By getting rid of features that are irrelevant to our dataset, we can create a better predictive model which is built on a strong foundation. While there are advanced ways of using machine learning algorithms for feature selection, today I want to explore two simple approaches that can help steer the direction of any analysis.

Dataset

The dataset we will use today is the Student Performance Dataset from Kaggle.com. There are 33 features in this dataset, which are:

school, sex, age, address, famsize ,Pstatus, Medu ,Fedu, Mjob, Fjob ,reason ,guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery,, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences, G1, G2, G3

If you haven’t counted, that’s 33 different features for this dataset! Let’s take some simple approaches to pick out the best ones for us!

Preprocessing

Before diving in, we will want to do some preprocessing on the dataset. First, lets take a look at the dataset.

import pandas as pd 
df = pd.read_csv('student_data.csv')
df.head()

Visually, it looks like there are some columns that are categorial in nature. Before converting these columns, we also want to check if there are any missing values.

#Checkign for null values -- False means no null values
df.isnull().values.any()

Running this code returned "False" therefore no null values had to be filled in.

df.dtypes

Typing in the above code showed all of the types of each column, It showed that there were in fact categorical columns which need to be converted to numerical. W I created a simple mapping function that does this pretty quickly for you.

#Function that converts the categorical to numerical through a dictionary. 
def mapper(df_column):
    map_dict = {}

    for i, x in enumerate(df_column.unique()):
        map_dict[x] = i 

    print(map_dict) #Print this out to see 
    df_column.replace(map_dict, inplace = True)

    return df_column

Using this function, we can change all of the object columns to numerical.

def categorical_converter(df):
    for i in df.select_dtypes(include=['object']):
        mapper(df[i])

    return df
categorical_converter(df)

Method 1: Assess the Correlations of Features

If two or more features are highly correlated together, this can mean that they both are explaining the dependent variable in the same manner. This elicits a reason to remove one of those features from the model. If you are unsure which feature to remove, you can always consider building two models, one with each of the features. For getting the correlation matrix, simply call .corr on your data frame

df.corr()
Img: Part of the Correlation Matrix
Img: Part of the Correlation Matrix

Obviously, with this many features, the correlation matrix will be huge.

Correlations for G1, G2, and G3
Correlations for G1, G2, and G3

The features G1, G2, and G3 had high correlations, therefore we would want to remove two of these features from the dataset. I removed G2 and G3. Additionally, I set my X data equal to all of the columns besides G1 which became my Y data.

df.drop(['G2','G3'], axis=1, inplace=True)
X = df.drop(['G1'], axis=1)
Y = df['G1']

Method 2: Remove Features with Low Variance

Any feature that has low variance should be considered to be removed from the dataset. Why? First, let’s think about it. If I am trying to compare who is a better student and both students have the same exact classes together each month, there is little variance in that "classes" feature between the students that will attribute to why one student will perform better than the other. When a feature becomes near-constant, we can hold that feature constant and remove it from a dataset (as always, this depends!). For this, we will use VarianceThreshold() from the Scikit learn Feature Selection library.

from sklearn.feature_selection import VarianceThreshold

The following code will create a variance threshold object that will then transform our X data and return the best features given.

vthresh = VarianceThreshold(threshold=.05)
selector = vthresh.fit(X)
selected = vthresh.get_support()
X = X.loc[:, selected]

Running this code returns an X data frame with 9 features!

X Data with Selected Features
X Data with Selected Features

The selected features were:

age, Medu (Mother’s education), Fedu(Father’s Eduacation), Mjob (Mother’s Job), reason, goout, Walc, heath, and absences.

While these may not be the final feature we end with, this gives us an initially good idea of factors that could be affecting a student’s grade. We were able to do this just by looking at correlations and variances alone! These two simple preliminary approaches allowed for a reduction in the dataset from 33 to 10 features (can’t forget Y!) and now we can begin building models with the smaller dataset!

If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore (This really helps me out more than you can imagine)! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!

Citations

Student Performance DatasetCC0: Public Domain, approved for public use, and all rights by the author were waived with no copyright restrictions.


Related Articles