Data Handling using Pandas; Machine Learning in Real Life

Saptashwa Bhattacharyya
Towards Data Science
8 min readOct 10, 2018

--

Today we will see some essential techniques to handle a bit more complex data, than the examples I have used before from sklearndata-set, using various features of pandas. This post will help you to arrange complex data-set dealing with real-life problems and eventually we will work our way through an example of logistic regression on the data. For more on data cleaning you can check this post.

You can download the data file from my github repository under the name ‘bank.csv’ or from the original source, where a detailed description of the data-set is available.

Before describing the data file, let’s import it and see the basic shape

import pandas as pdbankdf = pd.read_csv('bank.csv',sep=';') # check the csv file before to know that 'comma' here is ';'print bankdf.head(3)
print list(bankdf.columns)# show the features and label
print bankdf.shape # instances vs features + label (4521, 17)

Output is as below

bank data-set brief description

From the output we see that the data-set has 16 feature and the label is designated with 'y' . A detailed description of the features are given in the main repository. The overview of the data-set as found in the main repository is

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

We can produce a seaborncount plot to see how the output is dominated by one of the classes.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)
countplt=sns.countplot(x='y', data=bankdf, palette ='hls')
plt.show()
Label (‘y’) of the bank data is dominated with ‘no’ variable

We can count the number with the snippet of a code below

count_no_sub = len(bankdf[bankdf['y']=='no'])print count_no_sub>>> 4000

Since the label of the data-set are given in terms of ‘yes’ and ‘no’, it’s necessary to replace them with numbers, possibly with 1 and 0 respectively, so that they can be used in modelling of the data. In the first step we will convert the output labels of the data-set from binary strings of yes/no to integers 1/0.

bankdf['y'] = (bankdf['y']=='yes').astype(int)  # changing yes to 1 and no to 0
print bankdf['y'].value_counts()
>>> 0 4000
1 521
Name: y, dtype: int64

Since the output labels are converted to integers now, we can use the groupbyfeature of pandas to investigate the data-set a bit more. Depending upon the output label (yes/no), we can see how the numbers in the features vary.

out_label = bankdf.groupby('y')
print out_label.agg(np.mean)
# above two lines can be written using a single line of code
#print bankdf.groupby('y').mean()
>>>
age balance day duration campaign pdays previous
y
0 40.99 1403.2117 15.948 226.347 2.862 36.006 0.471
1 42.49 1571.9558 15.658 552.742 2.266 68.639 1.090

First, here we see only 7 features out of 16, as the remaining features are objects and not integers or floats. You can check it typing bankdf.info(). We see that the feature ‘duration’, which tells us about the duration of the last call in seconds, is more than twice for the customers who bought the products than for customers who didn’t. ‘Campaign’, which denotes the number of calls made during the current campaign, are lower for customers who purchased the products. groupby can give us some important information about the relationship between features and labels. Interested ones can check a similar ‘groupby’ operation on ‘education’ feature to verify that customers with tertiary education has the highest ‘balance’ (average yearly balance in Euros)!

Some of the features of the data-set have many categories which can be checked by using the uniquemethod of a series object. Examples are as below

print bankdf["education"].unique()
print bankdf["marital"].unique()
>>> ['primary' 'secondary' 'tertiary' 'unknown']
['married' 'single' 'divorced']

These variables are known as categorical variables and in terms of pandas, these are called ‘object’. To retrieve information using the categorical variables, we need to convert them into ‘dummy’ variables so that they can be used for modelling. We do that using pandas.get_dummies feature. First we create a list of the categorical variables

cat_list = ['job','marital','education','default','housing','loan','contact','month','poutcome']

Then we convert these variables into dummy variables as below

for ele in cat_list:
add = pd.get_dummies(bankdf[ele], prefix=ele)
bankdf1 = bankdf.join(add)# join columns with old dataframe
bankdf = bankdf1
#print bankdf.head(3)
#print bankdf.info()

We have created dummy variables for each categorical variables and printing out the head of the new data-frame will result in as below

Dummy variables from categorical variables

You can understand, how the categorical variables are converted to dummy variables which are ready to be used in the modelling of this data-set. But, we have a slight problem here. The actual categorical variables still exist and they need to be removed to make the data-frame ready for machine learning. We do that by first converting the column headers of the new data-frame to a list using tolist() attribute. Then we create a new list of column headers with no categorical variable and rename the headers. We do this using the following code

bank_vars = bankdf.columns.values.tolist() # column headers are converted into a listto_keep = [i for i in bank_vars if i not in cat_list] #create a new list by comparing with the list of categorical variables - 'cat_list'print to_keep # check the list of headers to make sure no categorical variable remains

We are ready to create a new data-frame with no categorical variables and we do this by -

bank_final = bankdf[to_keep]

Carefully note that to create the new data-frame, here we are passing a list (‘to_keep’) to the indexing operator (‘bankdf’). If you don’t pass the indexing operator a list of column names it will return a keyerror . To select multiple columns as a data-frame, we should pass a list to the indexing operator. However you can select a single column as a ‘series’ and you can see it below

bank_final = bankdf[to_keep] # to_keep is a 'list'
print type(bank_final)
>>> <class 'pandas.core.frame.DataFrame'>bank_final = bankdf['age']
print type(bank_final)
>>> <class 'pandas.core.series.Series'>bank_final = bankdf['age','y']
print type(bank_final)
>>> KeyError: ('age', 'y')

We can verify the headers of the columns of the new data-frame bank-final.

print bank_final.columns.values>>> ['age' 'balance' 'day' 'duration' 'campaign' 'pdays' 'previous' 'y' 'job_admin.' 'job_blue-collar' 'job_entrepreneur' 'job_housemaid' 'job_management' 'job_retired' 'job_self-employed' 'job_services' 'job_student' 'job_technician' 'job_unemployed' 'job_unknown' 'marital_divorced' 'marital_married' 'marital_single' 'education_primary' 'education_secondary' 'education_tertiary' 'education_unknown' 'default_no' 'default_yes' 'housing_no' 'housing_yes' 'loan_no' 'loan_yes' 'contact_cellular' 'contact_telephone' 'contact_unknown' 'month_apr' 'month_aug' 'month_dec' 'month_feb' 'month_jan' 'month_jul' 'month_jun' 'month_mar' 'month_may' 'month_nov' 'month_oct' 'month_sep' 'poutcome_failure' 'poutcome_other' 'poutcome_success' 'poutcome_unknown']

We are in a position to separate feature variables and labels, so that it’s possible to test some machine learning algorithm on the data set. Selecting feature and label from this new data-frame is done using the code below

bank_final_vars=bank_final.columns.values.tolist()# just like before converting the headers into a list
Y = ['y']
X = [i for i in bank_final_vars if i not in Y]

Since there are too many features, we can choose some of the most important features with Recursive Feature Elimination (RFE) under sklearn, which works in two steps. In my later posts I may discuss why feature selection is not possible with Logistic Regression but for now let’s use a RFE to select few of the important features. First the classifier is passed to RFE with number of features to be selected and then the fit method is called. This is depicted in the code below

model = LogisticRegression()
rfe = RFE(model, 15) # we have selected here 15 features
rfe = rfe.fit(bank_final[X], bank_final[Y])

We can use the support_ attribute to find which features are selected.

print rfe.support_>>> [False False False False False False False False False False False False True False False False False False False False  True False False False False False  True False False False False  True False False  True False False  True False  True  True  True  True False False  True  True  True False  True  True]

rfe.support_produces an array, where the features that are selected are labelled as True and you can see 15 of them, as we have selected best 15 features. Another attribute of RFE is ranking_ where the value 1 in the array will highlight the selected features.

print rfe.ranking_>>> [33 37 32 35 23 36 31 18 11 29 27 30  1 28 17  7 12 10  5  9  1 21 16 25 22  4  1 26 24 13 20  1 14 15  1 34  6  1 19  1  1  1  1  3  2  1  1  1 8  1  1]

We can explicitly print out the name of the features that are selected using RFE, with the code below

rfe_rankinglist = rfe.ranking_.tolist()
selected_columns = []
for im in range(len(X)):
if rfe_rankinglist[im]==1:
selected_columns.append(X[im])
print selected_columns
>>> ['job_retired', 'marital_married', 'default_no', 'loan_yes', 'contact_unknown', 'month_dec', 'month_jan', 'month_jul', 'month_jun', 'month_mar', 'month_oct', 'month_sep', 'poutcome_failure', 'poutcome_success', 'poutcome_unknown']

Finally we can proceed with .fit() and .score() attributes to check how well the model performs.

On a separate post I will discuss in detail about the mathematics behind the Logistic Regression and we will see that Logistic regression cannot select the features, it just shrinks the coefficients of a linear model, similar to Ridge Regression. Below is the code that you can use to check the effect of feature selection. Here we have used the whole data-set, but best practice is to divide the data in training and test-set. As a mini exercise you can try this, and remember that the label of the data-set is highly skewed and using stratify can be a good idea. Good luck !

X_new = bank_final[selected_columns]
Y = bank_final['y']
X_old = bank_final[X]
clasf = LogisticRegression()
clasf_sel = LogisticRegression()
clasf.fit(X_old,Y)
clasf_sel.fit(X_new,Y)
print "score using all features", clasf.score(X_old,Y)
print "score using selected features", clasf_sel.score(X_new,Y)

So to conclude this post let’s summarize the most important points

  • We have learnt to use pandasto deal with some of the problems that a realistic data-set can have.
  • We have learnt to convert strings (‘yes’, ‘no’) to binary variables (1, 0).
  • How groupby attribute of a pandas data-frame can help us understand some of the key connections between features and labels.
  • Changing categorical variables to dummy variables and using them in modelling of the data-set.
  • How to select part of a data-frame by passing a list to the indexing operator.
  • Using RFE to select some of the main features of a complex data-set.

For more on data cleaning and processing, you can check my post on data handling using pandas. For more on using Pandas Groupby and Crosstab, you can check my Global Terrorism Data analysis post. Hopefully this post will help you to be bit-more confident in dealing with realistic data-set. Stay strong and happy. Cheers !!

--

--