Every data scientist has either confronted or will come across this problem; a huge dataset with so many features that they don’t even know where to start. While there are many advanced methods that can be used for selecting the best collection of features for a dataset, sometimes simple approaches can provide a great baseline to your analysis and even be the only necessary approach needed for selecting your dataset’s best features.

Feature Selection: Why does it matter?
Feature Selection is important because having a dataset containing too many features may lead to high computational costs as well as using redundant or unimportant attributes for making model predictions. By getting rid of features that are irrelevant to our dataset, we can create a better predictive model which is built on a strong foundation. While there are advanced ways of using machine learning algorithms for feature selection, today I want to explore two simple approaches that can help steer the direction of any analysis.
Dataset
The dataset we will use today is the Student Performance Dataset from Kaggle.com. There are 33 features in this dataset, which are:
school, sex, age, address, famsize ,Pstatus, Medu ,Fedu, Mjob, Fjob ,reason ,guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery,, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences, G1, G2, G3
If you haven’t counted, that’s 33 different features for this dataset! Let’s take some simple approaches to pick out the best ones for us!
Preprocessing
Before diving in, we will want to do some preprocessing on the dataset. First, lets take a look at the dataset.
import pandas as pd
df = pd.read_csv('student_data.csv')
df.head()

Visually, it looks like there are some columns that are categorial in nature. Before converting these columns, we also want to check if there are any missing values.
#Checkign for null values -- False means no null values
df.isnull().values.any()
Running this code returned "False" therefore no null values had to be filled in.
df.dtypes
Typing in the above code showed all of the types of each column, It showed that there were in fact categorical columns which need to be converted to numerical. W I created a simple mapping function that does this pretty quickly for you.
#Function that converts the categorical to numerical through a dictionary.
def mapper(df_column):
map_dict = {}
for i, x in enumerate(df_column.unique()):
map_dict[x] = i
print(map_dict) #Print this out to see
df_column.replace(map_dict, inplace = True)
return df_column
Using this function, we can change all of the object columns to numerical.
def categorical_converter(df):
for i in df.select_dtypes(include=['object']):
mapper(df[i])
return df
categorical_converter(df)
Method 1: Assess the Correlations of Features
If two or more features are highly correlated together, this can mean that they both are explaining the dependent variable in the same manner. This elicits a reason to remove one of those features from the model. If you are unsure which feature to remove, you can always consider building two models, one with each of the features. For getting the correlation matrix, simply call .corr on your data frame
df.corr()

Obviously, with this many features, the correlation matrix will be huge.

The features G1, G2, and G3 had high correlations, therefore we would want to remove two of these features from the dataset. I removed G2 and G3. Additionally, I set my X data equal to all of the columns besides G1 which became my Y data.
df.drop(['G2','G3'], axis=1, inplace=True)
X = df.drop(['G1'], axis=1)
Y = df['G1']
Method 2: Remove Features with Low Variance
Any feature that has low variance should be considered to be removed from the dataset. Why? First, let’s think about it. If I am trying to compare who is a better student and both students have the same exact classes together each month, there is little variance in that "classes" feature between the students that will attribute to why one student will perform better than the other. When a feature becomes near-constant, we can hold that feature constant and remove it from a dataset (as always, this depends!). For this, we will use VarianceThreshold() from the Scikit learn Feature Selection library.
from sklearn.feature_selection import VarianceThreshold
The following code will create a variance threshold object that will then transform our X data and return the best features given.
vthresh = VarianceThreshold(threshold=.05)
selector = vthresh.fit(X)
selected = vthresh.get_support()
X = X.loc[:, selected]
Running this code returns an X data frame with 9 features!

The selected features were:
age, Medu (Mother’s education), Fedu(Father’s Eduacation), Mjob (Mother’s Job), reason, goout, Walc, heath, and absences.
While these may not be the final feature we end with, this gives us an initially good idea of factors that could be affecting a student’s grade. We were able to do this just by looking at correlations and variances alone! These two simple preliminary approaches allowed for a reduction in the dataset from 33 to 10 features (can’t forget Y!) and now we can begin building models with the smaller dataset!
If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore (This really helps me out more than you can imagine)! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!
Citations
Student Performance Dataset– CC0: Public Domain, approved for public use, and all rights by the author were waived with no copyright restrictions.