
Bank Marketing Campaign: costs prediction – EDA (I)
About the project
The dataset stores information – 2008 to 2015 – of a marketing sales operation (telemarketing) implemented by a Portuguese bank’s marketing team to attract customers to subscribe term deposits, classifying the results as ‘yes’ and ‘no’ into a binary categorical variable.
Until that time, the strategy was to reach the maximum number of clients, indiscriminately, and try to sell them the financial product over the phone. However, that approach, besides spending many resources was also very uncomfortable for many clients disturbed by this type of action.
To determine the costs of the campaign, the marketing team has concluded:
- For each customer identified as a good candidate and therefore defined as a target but doesn’t subscribe the deposit, the bank had a cost of 500 EUR.
- For each customer identified as a bad candidate and excluded from the target but would subscribe the product, the bank had a cost of 2.000 EUR.
Machine Learning problem and objectives
We’re facing a binary classification problem. The goal is to train the best machine learning model that should be able to predict the optimal number of candidates to be targeted in order to reduce to the minimum costs and maximize efficiency.
Project structure
The project divides into three categories:
- EDA: Exploratory Data Analysis
- Data Wrangling: Cleaning and Feature Selection
- Machine Learning: Predictive Modelling
In this article, I’ll be focusing only on the first section, the Exploratory Data Analysis (EDA).
Performance Metric
The metric used for evaluation is the total costs since the objective is to determine the minimum costs of the campaign.
You will find the entire code of this project [here](https://bit.ly/31a0EAL). The ‘bank_marketing_campaign.csv’ dataset can be downloaded here.

The first thing to do is to import the libraries and dependencies required.
# import libraries
import pandas as pd
from pandas.plotting import table
import numpy as np
import seaborn as sns
import scipy.stats
import matplotlib.pyplot as plt
%matplotlib inline
Loading the dataset (I will assign it as ‘df’) and inspect the first rows.
df = pd.read_csv('bank_marketing_campaign.csv') # load the dataset
df.head() # print the data

head()
is a method used to display the first ‘n’ rows in a dataframe and tail()
for the ‘n’ last rowsThe dependent variable or target (on the right as the last column) labeled as ‘y’ is a binary categoric variable. Let’s start by converting it into a binary numeric wich will assume the value of 1 if the client subscribes and 0 if otherwise. A new column ‘target’ will replace the ‘y’ (to be dropped).
# converting into a binary numeric variable
df['target'] = df.apply(lambda row: 1 if row["y"] == "yes" else 0, axis=1)
df.drop(["y"],axis=1,inplace=True)
I will also rename some columns replacing the dots by underscores.
# Renaming some columns for better typing and calling variables
df.rename(columns={"emp.var.rate":"emp_var_rate", "cons.price.idx":"cons_price_idx", "cons.conf.idx":"cons_conf_idx", "nr.employed":"nr_employed"}, inplace=True)
df.head()

Basic info of the dataset
- How many features are available?
- How many clients are in the dataset?
- Are there any duplicated records?
- How many clients subscribed to the term deposit and how many didn’t?
# Printing number of observations, variables including the target, and duplicate samples
print(f"Number of clients: {df.shape[0]}")
print(f"Number of variables: {df.shape[1]} incl. target")
print(f"Number of duplicate entries: {df.duplicated().sum()}")
Number of clients: 41188 Number of variables: 16 incl. target Number of duplicate entries: 5853
I must conclude that these apparent duplicated samples are actually from people with an identical profile.
# How many clients have subscribed and how many didn't?
absolut = df.target.value_counts().to_frame().rename(columns={"target":"clients"})
percent = (df.target.value_counts(normalize=True) *100).to_frame().rename(columns={"target":"%"})
df_bal = pd.concat([absolut,percent],axis=1).round(decimals=2)
print(f"[0] Number of clients that haven't subscribed the term deposit: {df.target.value_counts()[0]}")
print(f"[1] Number of clients that have subscribed the term deposit: {df.target.value_counts()[1]}")
display(df_bal)
absolut.plot(kind='pie', subplots=True, autopct='%1.2f%%',
explode= (0.05, 0.05), startangle=80,
legend=False, fontsize=12, figsize=(14,6));
The dataset is highly imbalanced:
[0] Number of clients that haven’t subscribed the term deposit: 36548 [1] Number of clients that have subscribed the term deposit: 4640

Exploratory Data Analysis (EDA)
Let’s now check the type of variables, missing values, and correlations as well as displaying statistical descriptions.
# Type of variables
df.dtypes.sort_values(ascending=True)
age int64
pdays int64
previous int64
target int64
emp_var_rate float64
cons_price_idx float64
cons_conf_idx float64
euribor3m float64
nr_employed float64
job object
marital object
education object
default object
housing object
loan object
poutcome object
dtype: object
# Counting variables by type
df.dtypes.value_counts(ascending=True)
int64 4
float64 5
object 7
dtype: int64
# Detecting missing values
print(f"Are there any missing values? {df.isnull().values.any()}")
Are there any missing values? False
# Visualization of correlations (heatmap)
mask = np.triu(df.corr(), 1)
plt.figure(figsize=(19, 9))
sns.heatmap(df.corr(), annot=True, vmax=1, vmin = -1, square=True, cmap='BrBG', mask=mask);

Variables 'emp_var_rate'
, 'nr_employed'
, 'euribor3m'
are very redundant. 'nr_employed'
is the most correlated with the target.
To have a clear and more accurate sense of the present data I will be displaying general stats.
# General stats of numeric variables adding 'variance' values
describe = df.describe()
describe.append(pd.Series(df.var(), name='variance'))

Age: the youngest client has 17 years old and the oldest has 98 years with a median of 38 years whereas the average is 40 years old. The distribution is skewed to the left. This possibly indicates the presence of outliers.
Pdays: number of days that passed by after the client was last contacted from a previous campaign. The majority of the clients have the 999 number which indicates most people did not contact nor were contacted by the bank. Those 999 are considered to be ‘out of range’ values.
Previous: number of contacts performed before this campaign for each client. The vast majority were never been contacted before.
Emp_var_rate: employment variation rate. During the period the index varied from [-3.4, 1.4].
Cons_price_idx: the consumer price index varied from [92.2, 94.8].
Cons_conf_idx: the consumer confidence level during that period kept always negative with a range of variation of [-51, -27]. These negative values might be explained by the recession that severely affected Portugal due to the financial global crisis during that same period the data was recorded.
Euribor3m: there was a huge variation of the Euribor rate during the period of analysis [5% to 0.6%]. This fluctuation together with the negative confidence verified above reinforces the hypothesis that the data provides information from a crisis period.
Nr_employed: the number of employed people varied around 200 during the campaign.
df.describe(include=['object']) # General stats of categoric variables

Job: there are 12 types of job recordings in which the administrative role is the most common with almost 10.5k of all clients.
Marital: the majority of clients are married with almost 25k records.
Education: more than 12k people have a university degree.
Default: from all the 41.188 clients, 32.588 do not have any credit in default.
Housing: almost half of the customers have a housing loan.
Loan: almost 34k clients do not have any personal loans.
Poutcome: there is no information about the outcome of any previous marketing campaign.
Statistic description of all variables
To be able to perform the analysis to both numeric and categoric variables, I will start by defining and creating a list of features separately by type of variable.
# creating indexes
cat_features = list(df.select_dtypes('object').columns)
int_features = list(df.select_dtypes('int64').columns)
float_features = list(df.select_dtypes('float64').columns)
num_features = int_features+float_features
The next step of the EDA process consists of providing a full description of all variables in the dataset starting with the numeric ones.
# Visualization of the numeric distribution
df[num_features].hist(figsize=(10,8), bins=25, xlabelsize=8, ylabelsize=8, alpha=0.9, grid=False)
plt.tight_layout();

Age
# creating a dataframe
stats_age = {'Designation': ['Value'],
'Variable': 'age',
'Description': 'clients` age',
'Type of variable': df.age.dtype,
'Type of distribution': 'continuous',
'Total observations': df.age.shape[0],
'Missing values': df.age.isnull().sum(),
'Unique values': df.age.nunique(),
'Min': df.age.min(),
'25%': int(df.age.quantile(q=[.25]).iloc[-1]),
'Median': df.age.median(),
'75%': int(df.age.quantile(q=[.75]).iloc[-1]),
'Max': df.age.max(),
'Mean': df.age.mean(),
'Std dev': df.age.std(),
'Variance': df.age.var(),
'Skewness': scipy.stats.skew(df.age),
'Kurtosis': scipy.stats.kurtosis(df.age)
}
st_age = pd.DataFrame(stats_age, columns = ['Designation',
'Variable',
'Description',
'Type of variable',
'Type of distribution',
'Total observations',
'Missing values',
'Unique values',
'Min',
'25%',
'Median',
'75%',
'Max',
'Mean',
'Std dev',
'Variance',
'Skewness',
'Kurtosis'
])
st_age.set_index("Designation", inplace=True)
results = st_age.T # transposing the dataframe
resultsT = st_age
display(results)

It is very important to visualize the distribution and dispersion as follows.
# Visualization of 'age'
# creating a distribution graph and bloxplot combined
age = df.age
np.array(age).mean()
np.median(age)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.5, 2)})
mean=np.array(age).mean()
median=np.median(age)
sns.boxplot(age, ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
sns.distplot(age, ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')
plt.legend({'Mean':mean,'Median':median})
ax_box.set(xlabel='')
plt.show()

Calculus of percentiles: 1%, 5%, 95%, 99%
display(df.age.quantile(q=[.01, .05, .95, .99]))
0.01 23.0
0.05 26.0
0.95 58.0
0.99 71.0
Name: age, dtype: float64
To be able to visualize the variable against the target we must start by defining two groups: those clients who had to subscribe and those who hadn’t (let’s call them "Good" and "Bad" clients, respectively).
# Visualization variable vs. target
age_0 = df[df.target == 0].iloc[:,:1]
age_1 = df[df.target == 1].iloc[:,:1]
a = np.array(age_0)
b = np.array(age_1)
np.warnings.filterwarnings('ignore')
plt.hist(a, bins=40, density=True, color="r", alpha = 0.6, label='Bad client')
plt.hist(b, bins=40, density=True, color="g", alpha = 0.6, label='Good client')
plt.legend(loc='upper right')
plt.title('age', fontsize=14)
plt.xlabel('age')
plt.ylabel('absolute frequency');

Proportionally there are more subscribers aged between less than 30 and more than 60 years old than between those from 30 to 60 years old.
Let’s take a step further and group the clients that subscribed, by age, in 3 buckets: young (<=30), adult (>30 to <=60), and senior (>60) using the cut
method.
df['age_bins'] = pd.cut(df['age'], bins = [df['age'].min(), 30, 60, df['age'].max()],labels=['Young', 'Adult', 'Senior'])
group_age_target = df.groupby(['age_bins'])['target'].mean().multiply(100)
display(group_age_target)
group_age_target.plot.barh()
plt.xlabel('Subscribed [%]');

- 45.5% of Seniors (+60 years old) subscribed to the term deposit.
- Less than 1 in 10 Adults (>30 and <=60 years old) subscribed.
- Young people were the 2nd group that subscribed to the deposit corresponding to 1/6 of all young people.
- Senior subscribers alone were almost as much as Young and Adults subscribers, respectively, all together.
Pdays
Following the previous structure, I’ll be only displaying visuals for a greater understanding. The full code can be seen here.

Calculus of percentiles: 1%, 5%, 95%, 99%
display(df.pdays.quantile(q=[.01, .05, .95, .99]))
0.01 3.0
0.05 999.0
0.95 999.0
0.99 999.0
Name: pdays, dtype: float64
Next, visualizing the distribution of the data and also the variable against the target.

Considering only the clients who had to subscribe, let’s count the days that passed by after contact from a previous campaign. Most of the people will respond on the 6th day as well as within 7 to 8 days, you can observe below.
dummy = df.loc[(df['pdays']!=999) & (df['target'] == 1), 'pdays']
print('Median: {:.2}'.format(dummy.median()))
dummy.hist().grid(False)
plt.title('Histogram')
plt.xlabel('Couting days after contact n for those who subscribed')

Previous

Calculus of percentiles: 1%, 5%, 95%, 99%
display(df.previous.quantile(q=[.01, .05, .95, .99]))
0.01 0.0
0.05 0.0
0.95 1.0
0.99 2.0
Name: previous, dtype: float64

The graph below shows that the clients that were previously contacted have subscribed to the term deposit at a much higher rate.
Considering only clients that never got reached only 10% subscribed whereas clients that have been previously contacted twice and more the success of the campaign increased up to >45%.
group = df.groupby(['previous'])['target'].mean().multiply(100)
group.plot.barh()
plt.xlabel('Subscribed [%]');
print('How many people got previously contacted? {}'.format(df.loc[df['previous']!=0].shape[0]))
print('How many people got contacted 7 times? {}'.format(df.loc[df['previous']==7, 'previous'].count()))
print('How many people got previously contacted with success? {}'.format(df.poutcome.value_counts()[1]))
print('How many people got previously contacted without success? {}'.format(df.poutcome.value_counts()[2]))
- How many people got previously contacted? 5625
- How many people got contacted 7 times? 1
- How many people got previously contacted with success? 4525
- How many people got previously contacted without success? 1373

Indexes variables
There are 4 macro rating variables, or economic indexes, present in the dataset: ’emp_var_rate’, ‘cons_price_idx’, ‘cons_conf_idx’ and ‘euribor3m’.
Let’s dig a bit further and briefly investigate their correlation and check if there are any trends or patterns between those indexes and also with each other against the target. To do that I must create a list only with these specific variables and display them side by side (pairplot method).
Note: we will return to the correlation subject later on this project (but in another article) in the section Data Wrangling: Cleaning and Feature Engineering.
# creating a list
idx_list = ["cons_price_idx", "cons_conf_idx", "euribor3m", "emp_var_rate", "target"]
df[idx_list].corr()

We can see that euribor3m
, cons_price_idx
, and emp_Var_rate
are highly correlated. Next, visualize the correlations between the indexes variables using the pairplot method.
sns.pairplot(df[idx_list], hue="target")
plt.show()

What can we observe from the scatter plots? Please keep in mind that (blue) 0=NO and (orange) 1=YES
euribor3m
increases when there is a positive variation ofemp_var_rate.
cons_conf_idx
varies linearly withcons_price_idx
andemp_var_rate
: the higher the prices and rate of employment get, the lower the confidence level index becomes.- when
emp_var_rate
(employment rate) increasescons_price_idx
(price index) also increases.
Can the output of the campaign be affected by the indexes variables? The answer comes from the observation of the distribution plots.
- the lower the
euribor3m
is, the higher the number of subscriptions. - when the
cons_price_idx
(consumer price index) increases there is a strong negative response from the clients’ subscriptions. - when the
emp_var_rate
(the employment rate) is negative there is a higher positive response to the campaign.
Nr_employed

Calculus of percentiles: 1%, 5%, 95%, 99%
display(df.nr_employed.quantile(q=[.01, .05, .95, .99]))
0.01 4963.6
0.05 5017.5
0.95 5228.1
0.99 5228.1
Name: nr_employed, dtype: float64

From the analysis, we have known that people that were contacted had higher rates of subscription. The above histogram shows that the first contacts were exclusively made to known clients resulting in a much more efficient campaign with a low number of employed people.
Let’s move on to the categoric variables description.
Job
‘Job’ has 12 unique values. There is a class labeled as ‘unknown’ considered a missing value (I’ll not do anything about missing values for the time being).
stats_job = {'Designation': ['Value'],
'Variable': 'job',
'Description': 'type of job',
'Type of variable': df.job.dtype,
'Total observations': df.job.shape[0],
'Unique values': df.job.nunique(),
}
st_job = pd.DataFrame(stats_job, columns = ['Designation',
'Variable',
'Description',
'Type of variable',
'Total observations',
'Unique values',
])
st_job.set_index("Designation", inplace=True)
results = st_job.T
resultsT = st_job
display(results)
print(f"List of unique values: {df.job.unique()}")

How are the clients distributed concerning their jobs?
data_count = df['job'].value_counts().plot(kind='bar', figsize=(6,4), fontsize=12)
plt.title('Categorical variable 'job'', fontsize=12)
plt.xlabel('Jobs')
plt.ylabel('Absolute frequency');

Below, we can find the absolute values (sum of jobs) and their proportions. The 5 most common jobs are enough to represent 80% of the data.
num_obs = df.job.value_counts()
num_o = pd.DataFrame(num_obs)
num_o.rename(columns={"job":"Freq abs"}, inplace=True)
num_o_pc = (df.job.value_counts(normalize=True) * 100).round(decimals=2)
num_obs_pc = pd.DataFrame(num_o_pc)
num_obs_pc.rename(columns={"job":"percent %"}, inplace=True)
n_obs = pd.concat([num_o,num_obs_pc], axis=1)
display(n_obs)

job_0 = df[df.target == 0].iloc[:,1:2]
job_1 = df[df.target == 1].iloc[:,1:2]
a = np.array(job_0)
b = np.array(job_1)
np.warnings.filterwarnings('ignore')
plt.hist(a, bins=40, color="r", alpha = 0.8, label='Bad client', align="left")
plt.hist(b, bins=40, color="g", alpha = 0.8, label='Good client', align="right")
plt.legend(loc='upper right')
plt.title('job', fontsize=12)
plt.xlabel('Professional ocupation')
plt.xticks(rotation='vertical')
plt.ylabel('Absolute frequency');

The classes ‘admin.’, ‘blue-collar’, ‘technician’, ‘management’ and ‘ services’ are the jobs where most clients had to subscribe but in terms of proportion, ‘student’ and ‘retired’ are the most representative.
Below we see that ‘retired’ and ‘housemaid’ are the oldest clients and the ones who have accepted the subscription more than any of the other classes.
type_pivot = df.pivot_table(
columns="target",
index="job",
values="age", aggfunc=np.mean)
type_pivot.sort_values(by=["job"], ascending=True).plot(kind="bar", title=("Type of customer by professional occupation and age"), figsize=(6,4), fontsize = 12);

Marital

‘Marital’ has 4 unique values. Bellow, the chart and table show us the dominant class is the ‘married’ people with 61% and the ‘divorced’ clients correspond to 11% of all clients.
# pie chart
df['marital'].value_counts(dropna=False).plot(kind='pie', figsize=(14,9), explode = (0.01, 0.01, 0.01, 0.01), autopct='%1.1f%%', startangle=120);
# table
marital_obs = df.marital.value_counts()
marital_o = pd.DataFrame(marital_obs)
marital_o.rename(columns={"marital":"Freq abs"}, inplace=True)
marital_o_pc = (df.marital.value_counts(normalize=True) *100).round(decimals=2)
marital_obs_pc = pd.DataFrame(marital_o_pc)
marital_obs_pc.rename(columns={"marital":"percent %"}, inplace=True)
marital_obs = pd.concat([marital_o,marital_obs_pc], axis=1)
marital_obs


Education
‘Education’ has 8 unique values. The top 4 education levels correspond to 80% of the data. Clients with 4 years basic or illiterate are the oldest and prone to subscribe to the product.



Default
With 3 unique values, the class ‘yes’ is meaningless, the variable is unexpressive and totally imbalanced.


Housing
This variable has 3 unique values with the ‘unknown’, interpreted as a missing value, representing 2% of observations. The proportion of ‘yes’ and ‘no’ is very tight might reduce its predictive power.


Loan
Although, generally speaking, ‘Loan’ shows a high number of non-subscribers, this variable has some similarities with ‘housing’ in the sense that, proportionally, ‘yes’ and ‘no’ are very even. Once again, it might reduce its predictive power.


Poutcome
Interesting fact: between the clients previously contacted from previous promotional campaigns that actually succeed, the majority subscribed this time.


What are the most common categories?
The detailed analysis gave us a ton of information about the data, the customers, and the variables’ behavior. Let’s take an overview.
n = len(cat_features)
i=1
plt.figure(figsize=(16,14))
for feature in df[cat_features]:
plt.subplot(round(n/2),round(n/3), i)
df[feature].value_counts().plot.bar()
plt.xticks(rotation=90)
plt.title(feature)
i+=1
plt.tight_layout();


Most common:
- job: administrative
- marital state: married
- education: university degree
- credit in default: no
- housing: yes, however not having a housing loan is very close
- loan: no
- poutcome: did not participate in previous campaigns
Notice that all the features include the category ‘unknown’ except the ‘poutcome’ variable.
How these categories influence the target variable?
n = len(cat_features)
i=1
plt.figure(figsize=(16,14))
for feature in df[cat_features]:
plt.subplot(round(n/2),round(n/3), i)
df.groupby([feature])['target'].mean().multiply(100).plot.barh()
plt.xlabel('Subscribed [%]')
plt.title(feature)
i+=1
plt.tight_layout();


Observations:
- ‘Student’ and ‘retired’ have the highest percentage of subscriptions (>25%) whereas ‘blue-collar’ and ‘services’ have the lowest.
- ‘Illiterate’ people have the highest percentage of subscriptions (>20%), on the other hand ‘basic 9y’, ‘basic 6y’ and ‘basic 4y’ have the lowest.
- People with credit in default did not subscribe.
- More than 60% of the people previously contacted to other campaigns subscribed.
- Marital state, existence of loans, and housing do not influence much the subscription rate.
Conclusion
I’ve focused on how to begin a machine learning project starting by one of the most important parts which allow us to look beyond the data, the Exploratory Data Analysis. This is a very important procedure because if you don’t deeply understand the data you’re about to be dealing with, it will be a total mess.
In the next article (Data Cleaning and Feature Selection) I’ll be doing feature engineering and feature selection, cleaning the dataset, and taking care of anomalies we might find along the way as well as imputations and transformations.
In the last part and third article (Machine Learning: Predictive Modelling) we are going to train some machine learning models to be able to choose the best one and find out the most important features in predicting the minimum costs for the marketing campaign.
- You can find the entire code of this project here.
Check out other articles you might also like:
Descriptive Statistics: Expectations vs. Reality (Exploratory Data Analysis – EDA)
Pandas made easy (the guide – I)
Costs prediction of a Marketing Campaign (Data Cleaning & Feature Selection – Part II)
That’s it! I hope you enjoyed reading this article as much as I enjoyed writing it.
Contacts
Good readings, great codings!