The world’s leading publication for data science, AI, and ML professionals.

An Overview of Data Preprocessing: Features Enrichment, Automatic Feature Selection

Useful feature engineering methods with python implementation in one view

Hands-on Tutorials

The dataset should render suitable for the data trained in Machine Learning and the prediction made by the algorithm to yield more successful results. Looking at the dataset, It is seen that some features are more important than others, that is, they have more impact on the output. For example, better results are obtained when replacing with the logarithmic values of the dataset or other mathematical operations such as square root, exponential can be more efficient for the results. The distinction to be made here is to choose the Data Preprocessing method suitable for the model and the project. This article contains different angles to look at the dataset to make it easier for algorithms to learn the dataset. All studies are made more understandable with python applications.

Table of Contents (TOC)
1. Binning
2. Polynomial & Interaction Features
3. Non-Linear Transform
3.1. Log Transform
3.2. Square Root Transform
3.3. Exponential Transform
3.4. Box-cox Transform
3.5. Reciprocal Transform
4. Automatic Feature Selection
4.1. Analysis of Variance (ANOVA)
4.2. Model-Based Feature Selection
4.3. Iterative Feature Selection
Photo by Tamara Gak on Unsplash
Photo by Tamara Gak on Unsplash

1. Binning

In the previous article, the methods of digitizing categorical data in a way that the algorithm can process were explained. Binning is used to converting numeric data to categorical data thus making the model more flexible. Considering the numeric data, the number of bins determined by the user is created. All data is filled into these ranges and renamed. Now let’s apply binning to the age column in the dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
IN[1]
data=pd.read_csv('toy_dataset.csv')
data[['Age']].describe()
OUT[2]
count 150000.000000
mean 44.950200
std 11.572486
min 25.000000
25% 35.000000
50% 45.000000
75% 55.000000
max 65.000000
IN[2]
def binnings(col, number_of_bins,labels):
    min_val = col.min()
    max_val = col.max()
    space = (max_val-min_val)/number_of_bins
    bin_borders=[]
    for i in range(number_of_bins+1):
        bin_values = min_val+space*i
        bin_borders.append(bin_values)
    bin_borderss = pd.cut(col, bins=bin_borders,labels=labels,include_lowest=True)
    return bin_borders,bin_borderss
IN[3]
labels=["young_1","young_2","young_3","young_4","young_5","young_6","young_7","young_8","young_9","young_10"]
binnings(data['Age'],10,labels)
OUT[3]
([25.0, 29.0, 33.0, 37.0, 41.0, 45.0, 49.0, 53.0, 57.0, 61.0, 65.0],
 0         young_4
 1         young_8
 2         young_5
 3         young_4
 4         young_6
            ...   
 149995    young_6
 149996    young_1
 149997    young_1
 149998    young_1
 149999    young_3
 Name: Age, Length: 150000, dtype: category

By dividing the age range in the dataset into 11 parts at equal intervals, 10 bins were created. Each range is given the selected label (young_1…..young_10) and added as a column to the dataset. Now If we would like to add a new column to the dataset:

IN[4]
data['binned_age']= binnings(data['Age'],10,labels)
data
Figure 1. OUT[4], Image by author
Figure 1. OUT[4], Image by author

Now let’s see the effect of algorithm accuracy on the dataset we created.

IN[5]
x = np.random.rand(100, 1)
y = 100 + 5 * x + np.random.randn(100, 1)
plt.scatter(x,y)
plt.xlabel('input')
plt.ylabel('output')
Figure 2. OUT[5], Image by author
Figure 2. OUT[5], Image by author
IN[6] without bin
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=2021)
lr = LinearRegression()
lr.fit(x_train,y_train)
print("score=",lr.score(x_test,y_test))
OUT[6]
score= 0.7120200116547827

Now let’s create bins and test the new dataset with bins.

IN[7] create bins
bins=np.linspace(x.min()-0.01,x.max()+0.01,9)
print(bins)
datas_to_bins = np.digitize(x, bins=bins,right=False)
np.unique(datas_to_bins)
OUT[7]
[-0.00374919  0.12264801  0.24904522  0.37544243  0.50183963  0.62823684 0.75463404  0.88103125  1.00742846]
array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64)
IN[8] with bins
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
x_binned=encoder.fit_transform(datas_to_bins)
x_binned_train,x_binned_test,y_binned_train,y_binned_test=train_test_split(x_binned,y,test_size=0.2,random_state=2021)
lr.fit(x_binned_train,y_binned_train)
print("score=",lr.score(x_binned_test,y_binned_test))
OUT[8]
score= 0.7433952534198586

As can be seen, when divided into 9parts and grouped as 8bins, the success rate in the test dataset increased from 71 to 74.

Binning does not affect tree-based algorithms because they train the model with a split-up date. On the other hand, it is quite effective for linear models.

2. Polynomial & Interaction Features

Another improvement that can be made to the dataset is to add interaction features and polynomial features. If we consider the dataset created in the previous section and the binning operation, various mathematical configurations can be created to enhance this. For example, let’s take the binning data, which is converted from numeric variables to categorical variables and then converted back to numeric with OneHotEncoder. We grouped the dataset created by adding 100 random data between 0 and 1 with binning, now let’s combine the binned dataset with the normal dataset and create a new dataset, or multiply the binned dataset with the normal dataset and add it to the binned dataset, or divide the binned dataset into a normal dataset and add it to the binned dataset. And let’s look at the linear regression and score of all these configurations.

IN[9]
x_combined=np.hstack([x_binned,x*x_binned])
print(x_binned.shape)
print(x.shape)
print(x_combined.shape)
x_combined_train,x_combined_test,y_combined_train,y_combined_test=train_test_split(x_combined,y,test_size=0.2,random_state=2021)
lr.fit(x_combined_train,y_combined_train)
print("score=",lr.score(x_combined_test,y_combined_test))
OUT[9]
(100, 3)
(100, 1)
(100, 6)
score= 0.7910475179261578
IN[10]
x_combined2=np.hstack([x_binned,x])
x_combined2_train,x_combined2_test,y_combined2_train,y_combined2_test=train_test_split(x_combined2,y,test_size=0.2,random_state=2021)
lr.fit(x_combined2_train,y_combined2_train)
print("score=",lr.score(x_combined2_test,y_combined2_test))
OUT[10]
score= 0.7203969392138159
IN[11]
x_combined3=np.hstack([x_binned,x_binned/x])
x_combined3_train,x_combined3_test,y_combined3_train,y_combined3_test=train_test_split(x_combined3,y,test_size=0.2,random_state=2021)
lr.fit(x_combined3_train,y_combined3_train)
print("score=",lr.score(x_combined3_test,y_combined3_test))
OUT[11]
score= 0.7019604516773869

Another way to enrich the dataset is possible with polynomial features. Extends the dataset by exponentiating the data in the Polynomial Features column to the specified degree. For example, when degree 4 is set in poly features preprocessing, which is easily used with the sklearn library, 4 new features will be added as x, x², x³, x⁴. Now let’s observe the results by adding polynomial features in the same dataset.

IN[12]
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=4, include_bias=False)
x_poly=poly.fit_transform(x)
poly.get_feature_names()
OUT[12]
['x0', 'x0^2', 'x0^3', 'x0^4']
IN[13]
x_poly_train,x_poly_test,y_poly_train,y_poly_test=train_test_split(x_poly,y,test_size=0.2,random_state=2021)
lr.fit(x_poly_train,y_poly_train)
print("score=",lr.score(x_poly_test,y_poly_test))
OUT[13]
score= 0.7459793178415801

The reason for not achieving great results with polynomial & interaction features, the dataset is created randomly. These methods are frequently used in a real project with high efficiency.

3. Non-Linear Transform

The fact that the numeric values in the dataset are distributed with the gaussian distribution is highly preferred for the model to learn and make predictions. It can convert the dataset to Gaussian distribution with some mathematical operations. This is like looking at the same dataset from another angle, just like using the Fourier transform in the frequency analysis of the same signal. The same mat operation is applied to all data in the column. Now, before we take a look at these methods, let’s prepare the column and qq_plot we will use.

IN[14]
data=pd.read_csv('CarPrice_Assignment.csv')
data.columns
OUT[14]
Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype', 'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke','compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg', 'price'],
      dtype='object')
IN[15]   column we will use
data['price'].describe()
OUT[15]
count      205.000000
mean     13276.710571
std       7988.852332
min       5118.000000
25%       7788.000000
50%      10295.000000
75%      16503.000000
max      45400.000000
Name: price, dtype: float64
IN[16] create histogram and qq plot
import scipy.stats as stat
import pylab
def qq_plot(data,feature):
    plt.figure(figsize=(12,4))
    plt.subplot(1,2,1)
    data[feature].hist()
    plt.title('histogram')
    plt.subplot(1,2,2)
    stat.probplot(data[feature],dist='norm',plot=pylab)
    plt.show()
IN[17]
qq_plot(data,'price')
Figure 3. OUT[17], histogram(left) and probability plot(right), Image by author
Figure 3. OUT[17], histogram(left) and probability plot(right), Image by author

3.1. Log Transform

The logarithm of all data in the column.

IN[18]
data['log'] = np.log(data['price'])
qq_plot(data,'log')
OUT[18]
Figure 4. OUT[18], log transform, histogram(left) and probability plot(right), Image by author
Figure 4. OUT[18], log transform, histogram(left) and probability plot(right), Image by author

3.2. Square Root Transform

The square root of all data in the column.

IN[19]
data['squareroot'] = data.price**(1/2)
qq_plot(data,'squareroot')
Figure 5. OUT[19], square root transform, histogram(left) and probability plot(right), Image by author
Figure 5. OUT[19], square root transform, histogram(left) and probability plot(right), Image by author

3.3. Exponential Transform

The user-selected exponent of all data in the column.

IN[20]
data['exp'] = data.price**(1/1.5)
qq_plot(data,'exp')
Figure 6. OUT[20], exponential transform, histogram(left) and probability plot(right), Image by author
Figure 6. OUT[20], exponential transform, histogram(left) and probability plot(right), Image by author

3.4.Boxcox Transform

It is applied to the column according to the Equation.

Box-cox equation, source
Box-cox equation, source
IN[21]
data['boxcox'],parameters = stat.boxcox(data['price'])
print(parameters)
qq_plot(data,'price')
Figure 7. OUT[21], box-cox transform, histogram(left) and probability plot(right), Image by author
Figure 7. OUT[21], box-cox transform, histogram(left) and probability plot(right), Image by author

3.5. Reciprocal Transform

All data in the column is divided by 1.

IN[22]
data['reciprocal'] = 1/data.price
qq_plot(data,'reciprocal')
Figure 8. OUT[22], reciprocal transform, histogram(left) and probability plot(right), Image by author
Figure 8. OUT[22], reciprocal transform, histogram(left) and probability plot(right), Image by author

4. Automatic Feature Selection

In the previous section, we enriched our features and expanded our dataset, but since this operation will create a complex dataset, it may cause overfitting. Now, let’s examine the methods of reducing features according to feature importance for high-dimensional datasets or complex datasets.

4.1. Analysis of Variance – ANOVA

The relationship of each feature with the target is analyzed individually, and the ones that have less relationship with the target at the rate selected by the user are eliminated. This feature-target relationship is determined according to the p-value. The features with a high p-value are eliminated first. Now let’s import the breast cancer dataset then apply linear regression and decision tree algorithm.

IN[22]
from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
x=data.data
y=data.target
IN[23]
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=2021)
lr = LinearRegression()
lr.fit(x_train,y_train)
print("score=",lr.score(x_test,y_test))
OUT[23]
score= 0.7494572559981934
IN[24]
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state=42)
dt.fit(x_train, y_train)
print("score on test set: ", dt.score(x_test, y_test))
OUT[24]
score on test set:  0.8115079365079365

DecisionTreeRegressor is added in order to see feature importance:

IN[25]
print(dt.feature_importances_)
OUT[25]
[0.         0.01755418 0.         0.         0.01690402 0.00845201
 0.01173891 0.         0.         0.00375645 0.00725021 0.01126935
 0.00907901 0.00991574 0.00223873 0.         0.         0.
 0.         0.         0.00782594 0.0397066  0.         0.71559469
 0.         0.         0.01979559 0.11891856 0.         0.        ]
IN[26]
np.argsort(dt.feature_importances_)
OUT[26]
array([ 0, 25, 24, 22, 19, 18, 17, 16, 15, 28, 29,  2,  3,  8,  7, 14,  9, 10, 20,  5, 12, 13, 11,  6,  4,  1, 26, 21, 27, 23], dtype=int64)

Now let’s apply ANOVA:

IN[27]
from sklearn.feature_selection import SelectPercentile
select = SelectPercentile(percentile=30)
select.fit(x_train, y_train)
# transform training set
x_train_selected = select.transform(x_train)
print("X_train.shape: {}".format(x_train.shape))
print("X_train_selected.shape: {}".format(x_train_selected.shape))
OUT[27]
X_train.shape: (455, 30)
X_train_selected.shape: (455, 9)

The percentile is set 30 so that 30% of features(9) are selected.

Linear Regression with selected features:

IN[28]
x_test_selected = select.transform(x_test)
lr.fit(x_train_selected, y_train)
print("Score with only selected features: {:.3f}".format(
lr.score(x_test_selected, y_test)))
OUT[28]
Score with only selected features: 0.712

0.712 is obtained with only 9 features while 0.749 is obtained with 30 features. Now let’s see which features are chosen by SelectPercentile:

IN[29]
mask = select.get_support()
print(mask)
OUT[29]
[ True False  True  True False False  True  True False False False False False False False False False False False False  True False  True True False False False  True False False]

4.2. Model-Based Feature Selection

It evaluates all features at once and selects features according to their interactions. It selects the ones with higher importance according to the threshold value set by the user. For example, if threshold=medium is selected, 50% of features is selected. The default value of the threshold in Sklearn is mean.

IN[30]
from sklearn.feature_selection import SelectFromModel
selection = SelectFromModel(LinearRegression(), threshold="median")
selection.fit(x_train, y_train)
x_train_select = selection.transform(x_train)
print("X_train.shape:",x_train.shape)
print("X_train_l1.shape:",x_train_select.shape)
OUT[30]
X_train.shape: (455, 30)
X_train_l1.shape: (455, 15)
IN[31]
mask = select.get_support()
print(mask)
OUT[31]
[False False False False  True  True  True  True False  True False False False False  True  True  True  True  True  True False False False False True False False  True  True  True]

Linear Regression with selected features:

IN[32]
x_test_select = selection.transform(x_test)
print(lr.fit(x_train_select, y_train).score(x_test_select, y_test))
OUT[32]
0.6919232554797755

4.3. Iterative Feature Selection

It works in 2 ways according to a certain threshold value: The first one starts with 0 features and continues to add features according to their importance until it reaches the threshold value. The second one selects all features and eliminates them one by one until the threshold value. As the name suggests, Recursive Feature Elimination (RFE) selects all features and eliminates features until the specified condition.

IN[33]
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
select = RFE(LogisticRegression(),
n_features_to_select=20)
select.fit(x_train, y_train)
# visualize the selected features:
mask = select.get_support()
print(mask)
OUT[33]
[ True  True False False  True  True  True  True  True False  True  True True  True False False False False False False  True  True  True False True  True  True  True  True  True]

Condition is set as 20 features and it starts with 30 features and eliminates the features one by one until 20 features are left.

IN[34]
x_train_rfe= select.transform(x_train)
x_test_rfe= select.transform(x_test)
lr.fit(x_train_rfe, y_train)
print("score:",lr.score(x_test_rfe, y_test))
OUT[34]
score: 0.7140632795679898

Back to the guideline click here.

Machine Learning Guideline


Related Articles