Need for Feature Engineering in Machine Learning

Ashish Bansal
Towards Data Science
7 min readApr 28, 2019

--

Photo by Franki Chamaki on Unsplash

Feature Selection is it really important?

Feature Selection/Extraction is one of the most important concepts in Machine learning which is a process of selecting a subset of relevant features/ attributes (such as a column in tabular data) that are most relevant for the modelling and business objective of the problem and ignoring the irrelevant features from the data set.

Yes, feature selection is really important. Irrelevant or partially relevant features can negatively impact model performance.

It also becomes important when the number of features is very large we need not need to use every feature at our disposal.

Benefits of Feature Engineering on to your Dataset

1. Reduce Overfitting

2. Improves Accuracy

3. Reduce Training Time

Let’s get into Practice how can we apply various feature engineering techniques to our dataset when the features are large and we don’t know how to select relevant information out of the dataset.

Method 1: Calculate the no of features which has standard deviation as zero. These are the features which are constant. Since these features don't vary, it will have no effect on the model performance.

import pandas as pd
import numpy as np
data = pd.read_csv('./train.csv')
print("Original data shape- ",data.shape)
# Remove Constant Features
constant_features = [feat for feat in data.columns if data[feat].std() == 0]
data.drop(labels=constant_features, axis=1, inplace=True)
print("Reduced feature dataset shape-",data.shape)

Method 2: Calculate the no of features which has low variance. This could be applied by using a threshold value using VarianceThreshold in the sklearn library.

from sklearn.feature_selection import VarianceThreshold
sel= VarianceThreshold(threshold=0.18)
sel.fit(df)
mask = sel.get_support()
reduced_df = df.loc[:, mask]
print("Original data shape- ",df.shape)
print("Reduced feature dataset shape-",reduced_df.shape)
print("Dimensionality reduced from {} to {}.".format(df.shape[1], reduced_df.shape[1]))

Method 3: Remove the features which have a high correlation. Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable)

Pearson’s correlation is given as:

Features could be removed using Threshold value ie remove those features which has a correlation coefficient >0.8

import seaborn as sns
import numpy as np
corr=df_iter.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
# Add the mask to the heatmap
sns.heatmap(corr, mask=mask, center=0, linewidths=1, annot=True, fmt=".2f")
plt.show()
heat map with correlation coefficient
corr_matrix = df_iter.corr().abs()
# Create a True/False mask and apply it
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
tri_df = corr_matrix.mask(mask)
# List column names of highly correlated features (r >0.5 )
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.5)]
# Drop the features in the to_drop list
reduced_df = df_iter.drop(to_drop, axis=1)
print("The reduced_df dataframe has {} columns".format(reduced_df.shape[1]

Method 4: Finding out the coefficients with respect to features using logistic regression. Remove those features which have low lr_coef.

from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#calculating the coeff with respect to columns
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Perform a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.25, random_state=0)
# Create the logistic regression model and fit it to the data
lr = LogisticRegression()
lr.fit(X_train, y_train)
# Calculate the accuracy on the test set
acc = accuracy_score(y_test, lr.predict(X_test))
print("{0:.1%} accuracy on test set.".format(acc))
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

Method 5: Calculating the feature importance using XGBoost .

Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.

import xgboost as xgb
housing_dmatrix = xgb.DMatrix(X,y)
# Create the parameter dictionary: params
params = {"objective":"reg:linear","max_depth":"4"}
# Train the model: xg_reg
xg_reg = xgb.train(dtrain=housing_dmatrix,params=params,num_boost_round=10)
# Plot the feature importances
xgb.plot_importance(xg_reg)

Method 6: Feature Importance using Extra tree classifier.

Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module) can be used to compute feature importances, which in turn can be used to discard irrelevant features

X = df.iloc[:,0:370]  #independent columns
y = df.iloc[:,-1] #target column
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_)
#use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()

Method 7: Recursive Feature Elimination (RFE)

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

from sklearn.feature_selection import RFErfe = RFE(estimator=RandomForestClassifier(random_state=0),n_features_to_select=3,step=2,verbose=1)
rfe.fit(X_train,y_train)
mask=rfe.support_
X_new=X.loc[:,mask]
print(X_new.columns)

Method 8: Univariate Feature Selection (ANOVA)

This works by selecting the best features based on the univariate statistical tests (ANOVA). The methods based on F-test estimate the degree of linear dependency between the two random variables. They assume a linear relationship between the feature and the target. These methods also assume that the variables follow a Gaussian Distribution.

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_classif, f_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile
df= pd.read_csv('./train.csv')
X = df.drop(['ID','TARGET'], axis=1)
y = df['TARGET']
df.head()
# Calculate Univariate Statistical measure between each variable and target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
univariate = f_classif(X_train.fillna(0), y_train)
# Capture P values in a series
univariate = pd.Series(univariate[1])
univariate.index = X_train.columns
univariate.sort_values(ascending=False, inplace=True)
# Plot the P values
univariate.sort_values(ascending=False).plot.bar(figsize=(20,8))
# Select K best Features
k_best_features = SelectKBest(f_classif, k=10).fit(X_train.fillna(0), y_train)
X_train.columns[k_best_features.get_support()]
# Apply the transformed features to dataset 
X_train = k_best_features.transform(X_train.fillna(0))
X_train.shape

Dimension Reduction Techniques

PCA (Principle Component Analysis):-

The original data has 9 columns. In this section, the code projects the original data which is 9 dimensional into 2 dimensions. I should note that after dimensionality reduction, there usually isn’t a particular meaning assigned to each principal component. The new components are just the two main dimensions of variation.

from sklearn.decomposition import PCA
dt=pd.read_csv('./dataset.csv')
X=dt.iloc[0:,0:-1]
y=dt.iloc[:,-1]
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
print("Dimension of dataframe before PCA",dt.shape)
print("Dimension of dataframe after PCA",principalDf.shape)
print(principalDf.head())
finalDf = pd.concat([principalDf, y], axis = 1)
print("finalDf")
print(finalDf.head())
#Visualize 2D Projection
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = [0, 1]
colors = ['r', 'g']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['Class'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()

Explained Variance

The explained variance tells you how much information (variance) can be attributed to each of the principal components. This is important as while you can convert 371-dimensional space to 2-dimensional space, you lose some of the variance (information) when you do this. By using the attribute explained_variance_ratio_, you can see that the first principal component contains 88.85% of the variance and the second principal component contains 0.06% of the variance. Together, the two components contain 88.91% of the information.

ThankYou !

Follow me on my Youtube Channel

https://www.youtube.com/channel/UCSp0BoeXI_EK2W0GzG7TxEw

Connect with me here:

Linkedin: https://www.linkedin.com/in/ashishban...

Github: https://github.com/Ashishb21

Medium: https://medium.com/@ashishb21

Website: http://techplanetai.com/

Email : ashishb21@gmail.com , techplanetai@gmail.com

--

--

Machine Learning /AI Enthusiastic. Experience in Deep Learning and Machine learning Algorithms CNN, RNN, Image Classification, Algorithms and Statistics, NLP,