Feature Engineering in Python

“The goal is to turn data into information, and information into insight.” — Carly Fiorina.

Published in

Towards Data Science

8 min readMay 16, 2019

We are living in a data-driven economy. It’s a world where having tons of data, understanding it and knowing what to do with data is power. Understanding your data is not one of the most difficult things in data science, but it is time-consuming. Interpretation of data is effective when we know about the source of data.

Data

“In God we trust. All others must bring data.” — W. Deming
First, we need to understand our data. There are 3 types of data:

Numerical
This represents some sort of quantitative measurement. Example: height of people, stock price, page load time etc. It can be further broken up into 2 parts:
Discrete data: This is integer based, often counts of some event. Example: “How many songs do a user like?” or “How many times a coin flipped to “head” ?”
Continuous data: It has an infinite number of possible values. Example: “How much time did it take for a user to buy?”
Categorical
This represents qualitative data with no apparent inherent mathematical meaning. Example: Yes or No, Sex, Race, Marital Status etc. These can take be assigned numbers like Yes(0) and No(1), but numbers have no mathematical meaning.
Ordinal
It has features of both numerical and categorical data. But the numbers placed in categories have mathematical meaning. Example: movies rating on 1–5, etc. These values have a mathematical meaning.
Therefore, we tend to overlook this step and jump straight into the water.

Feature Engineering

Machine learning fits mathematical notations to the data in order to derive some insights. The models take features as input. A feature is generally a numeric representation of an aspect of real-world phenomena or data. Just the way there are dead ends in a maze, the path of data is filled with noise and missing pieces. Our job as a Data Scientist is to find a clear path to the end goal of insights.

Mathematical formulas work on numerical quantities, and raw data isn't exactly numerical. Feature Engineering is the way of extracting features from data and transforming them into formats that are suitable for Machine Learning algorithms.

It is divided into 3 broad categories:-

Feature Selection: All features aren't equal. It is all about selecting a small subset of features from a large pool of features. We select those attributes which best explain the relationship of an independent variable with the target variable. There are certain features which are more important than other features to the accuracy of the model. It is different from dimensionality reduction because the dimensionality reduction method does so by combining existing attributes, whereas the feature selection method includes or excludes those features.
The methods of Feature Selection are Chi-squared test, correlation coefficient scores, LASSO, Ridge regression etc.

Feature Transformation: It means transforming our original feature to the functions of original features. Scaling, discretization, binning and filling missing data values are the most common forms of data transformation. To reduce right skewness of the data, we use log.

Feature Extraction: When the data to be processed through an algorithm is too large, it’s generally considered redundant. Analysis with a large number of variables uses a lot of computation power and memory, therefore we should reduce the dimensionality of these types of variables. It is a term for constructing combinations of the variables. For tabular data, we use PCA to reduce features. For image, we can use line or edge detection.

We start with this data-set from MachineHack.
We will first import all the packages necessary for Feature Engineering.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

We will load the data using pandas and also set the display to the maximum so that all columns with details are shown:

pd.set_option('display.max_columns', None)data_train = pd.read_excel('Data_Train.xlsx')data_test = pd.read_excel('Data_Test.xlsx')

Before we start pre-processing the data, we would like to store the target variable or label separately. We will combine our training and testing data-sets, after removing the label from the training dataset. The reason why we are combining train and test data-sets is that Machine Learning models aren’t great at extrapolation, ie, ML models aren’t good at inferring something that has not been explicitly stated from existing information. So, if the data in the test set hasn’t been well represented, like in training set, the predictions won’t be reliable.

price_train = data_train.Price  # Concatenate training and test sets data = pd.concat([data_train.drop(['Price'], axis=1), data_test])

This is the output we get after data.columns

Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route','Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
'Additional_Info', 'Price'], dtype='object')

To check the first five rows of the data, type data.head() .

To see the broader picture we use data.info() method.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13354 entries, 0 to 2670
Data columns (total 10 columns):
Airline            13354 non-null object
Date_of_Journey    13354 non-null object
Source             13354 non-null object
Destination        13354 non-null object
Route              13353 non-null object
Dep_Time           13354 non-null object
Arrival_Time       13354 non-null object
Duration           13354 non-null object
Total_Stops        13353 non-null object
Additional_Info    13354 non-null object
dtypes: int64(1), object(10)
memory usage: 918.1+ KB

To understand the distribution of our data, use data.describe(include=all)

We would like to analyse the data and remove all the duplicate values.

data = data.drop_duplicates()

We’d want to check for any null values in our data, therefore, data.isnull.sum() .

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

Therefore, we use the following code to remove the null value.

data = data.drop(data.loc[data['Route'].isnull()].index)

Airlines

Let’s check the Airline column. We notice that it contains categorical values. After using data['Airline'].unique() , we notice that the values of the airline are repeated in a way.

We first want to visualize the column:

sns.countplot(x='Airline', data=data)plt.xticks(rotation=90)

This visualization helps us understand that there are certain airlines which have been divided into two parts. For example, Jet Airways had another part called Jet Airways Business. We would like to combine these two categories.

data['Airline'] = np.where(data['Airline']=='Vistara Premium economy', 'Vistara', data['Airline'])data['Airline'] = np.where(data['Airline']=='Jet Airways Business', 'Jet Airways', data['Airline'])data['Airline'] = np.where(data['Airline']=='Multiple carriers Premium economy', 'Multiple carriers', data['Airline'])

This is how our values in the column will look now.

Flight’s Destination

The same goes with Destination. We find that Delhi and New Delhi have been made two different categories. Therefore, we’ll combine them into one.

data['Destination'].unique()data['Destination'] = np.where(data['Destination']=='Delhi','New Delhi', data['Destination'])

Date of Journey

We check this column, data['Date_of_Journey'] and find that the column is in this format:-

24/03/2019
1/05/2019

This is just raw data. Our model cannot understand it, as it fails to give numerical value. To extract useful features from this column, we would like to convert it into weekdays and months.

data['Date_of_Journey'] = pd.to_datetime(data['Date_of_Journey'])OUTPUT
2019-03-24
2019-01-05

And then to get weekdays from it

data['day_of_week'] = data['Date_of_Journey'].dt.day_name()OUTPUT
Sunday
Saturday

And from Date_of_Journey, we will also get the month.

data['Journey_Month'] = pd.to_datetime(data.Date_of_Journey, format='%d/%m/%Y').dt.month_name()OUTPUT
March
January

Departure Time of Airlines

The time of departure is in 24 hours format(22:20), we would like to bin it to get insights.

We will create fixed-width bins, each bin contains a specific numeric range. Generally, these ranges are manually set, with a fixed size. Here, I have decided to group hours into 4 bins. [0–5], [6–11], [12–17] and [18–23] are the 4 bins. We cannot have large gaps in the counts because it may create empty bins with no data. This problem is solved by positioning the bins based on the distribution of the data.

data['Departure_t'] = pd.to_datetime(data.Dep_Time, format='%H:%M')a = data.assign(dept_session=pd.cut(data.Departure_t.dt.hour,[0,6,12,18,24],labels=['Night','Morning','Afternoon','Evening']))data['Departure_S'] = a['dept_session']

We fill the null values with “night” in the ‘Departure_S’ column.

data['Departure_S'].fillna("Night", inplace = True)

Duration

Our duration column had time written in this format 2h 50m . To help machine learning algorithm derive useful insights, we will convert this text into numeric.

duration = list(data['Duration'])for i in range(len(duration)) :
    if len(duration[i].split()) != 2:
        if 'h' in duration[i] :
            duration[i] = duration[i].strip() + ' 0m'
        elif 'm' in duration[i] :
            duration[i] = '0h {}'.format(duration[i].strip())dur_hours = []
dur_minutes = []  
 
for i in range(len(duration)) :
    dur_hours.append(int(duration[i].split()[0][:-1]))
    dur_minutes.append(int(duration[i].split()[1][:-1]))
     
 
data['Duration_hours'] = dur_hours
data['Duration_minutes'] =dur_minutesdata.loc[:,'Duration_hours'] *= 60data['Duration_Total_mins']= data['Duration_hours']+data['Duration_minutes']

The result we will now get is continuous in nature.

After visualizing data, it makes sense to delete rows which duration less than 60 mins.

# Get names of indexes for which column Age has value 30indexNames = data[data.Duration_Total_mins < 60].index# Delete these row indexes from dataFramedata.drop(indexNames , inplace=True)

We will drop the columns which had noise or had texts, which would not help our model. We have transformed these into newer columns to get better insights.

data.drop(labels = ['Arrival_Time','Dep_Time','Date_of_Journey','Duration','Departure_t','Duration_hours','Duration_minutes'], axis=1, inplace = True)

Dummy Variables

We have engineered almost all the features. We have dealt with missing values, binned numerical data, and now it’s time to transform all variables into numeric ones. We will use, get_dummies() to the transformation.

cat_vars = ['Airline', 'Source', 'Destination', 'Route', 'Total_Stops',
       'Additional_Info', 'day_of_week', 'Journey_Month', 'Departure_S' ]
for var in cat_vars:
    catList = 'var'+'_'+var
    catList = pd.get_dummies(data[var], prefix=var)
    data1 = data.join(catList)
    data = data1
    
data_vars = data.columns.values.tolist()
to_keep = [i for i in data_vars if i not in cat_vars]data_final=data[to_keep]

Next Step

We will have to divide the dataset back. Feature engineering helps extract information from raw data, i.e., it has created a lot of features. This means we need to find the main features of the whole lot. This is also known as the Curse of Dimensionality. We combat feature generation with feature selection.

If you want to understand Skew and Kurtosis, click here.

If you want to understand the Statistics behind Data Science, click here.

If you want to get an Introduction to Machine Learning, click here.

Thanks for reading!

You can also give support here: https://buymeacoffee.com/divadugar