Logistic Regression classifier on Census Income Data

Animesh Agarwal
Towards Data Science
10 min readOct 22, 2018

--

In this blog, we will analyze the Census Dataset from the UCI Machine Learning Repository.

The dataset contains three files:

We will use Logistic Regression to build the classifier. If you don’t know about Logistic Regression you can go through my previous blog.

We will see how to build a practical machine learning project. In general, any machine learning project requires the following steps:

  • Defining the problem statement
  • Exploratory Data Analysis
  • Training the model
  • Fine tuning the model
  • Save the model

So let’s get started.

Defining the problem statement

The data contains anonymous information such as age, occupation, education, working class, etc. The goal is to train a binary classifier to predict the income which has two possible values ‘>50K’ and ‘<50K’. There are 48842 instances and 14 attributes in the dataset. The data contains a good blend of categorical, numerical and missing values.

First, we will import the required libraries

import numpy as np
import pandas as pd
import io
import requests
import seaborn as sns
from matplotlib import pyplot as plt
import pickle
import os
from pandas.api.types import CategoricalDtype
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_score
%matplotlib inline

Downloading the data

Instead of manually downloading the dataset, we will write a small script which downloads the content for a list of URLs and saves them in a folder. Later, we can directly use the downloaded data instead of hitting the network every time. You can also use this code for any machine learning project.

def load_dataset(path, urls):
if not os.path.exists(path):
os.mkdir(path)

for url in urls:
data = requests.get(url).content
filename = os.path.join(path, os.path.basename(url))
with open(filename, "wb") as file:
file.write(data)

We will create a data folder in the current working directory and store the content of the URLs.

urls = ["http://archive.ics.uci.edu/ml/machine-learning-  databases/adult/adult.data",
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names",
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"]
load_dataset('data', urls)

Next, we load the data into a pandas dataframe using the read_csv function.

columns = ["age", "workClass", "fnlwgt", "education", "education-
num","marital-status", "occupation", "relationship",
"race", "sex", "capital-gain", "capital-loss", "hours-per-
week", "native-country", "income"]
train_data = pd.read_csv('data/adult.data', names=columns,
sep=' *, *', na_values='?')
test_data = pd.read_csv('data/adult.test', names=columns,
sep=' *, *', skiprows=1, na_values='?')

There are some whitespaces before and after the data values. To trim all the whitespaces we use the separator ‘ *, *’. The test dataset has a weird first line, hence we skip the line using skiprows=1. The missing values in the dataset are indicated by ?

Next, we will explore the data. This is an important step before going building the model.

Exploratory Data Analysis

Let’s get more information about the training data using train_data.info()

RangeIndex: 32561 entries, 0 to 32560 
Data columns (total 15 columns):
age 32561 non-null int64
workClass 30725 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education-num 32561 non-null int64
marital-status 32561 non-null object
occupation 30718 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital-gain 32561 non-null int64
capital-loss 32561 non-null int64
hours-per-week 32561 non-null int64
native-country 31978 non-null object
income 32561 non-null object

Observations

  • There are 32561 samples in the training dataset
  • There are both categorical and numerical columns in the dataset
  • The columns workClass, occupation, native-country have missing values

Similarly, for the test dataset

  • There are 16281 samples
  • There are no missing values

Let’s look the numerical and the categorical data with the help of some visualizations.

Handling Numerical Columns

We select the numerical columns using the select_dtypes function.

num_attributes = train_data.select_dtypes(include=['int'])
print(num_attributes.columns)
['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

The variables age, hours-per-week are self-explanatory.

  • fnlwgt: sampling weight
  • education-num: number of years of education in total
  • capital-gain/capital-loss: income from investment sources other than salary/wages

fnlwgt is not related to the target variable income and will be removed before building the model

Data Visualizations

num_attributes.hist(figsize=(10,10))

More information about the data can be gathered by using train_data.describe()

Observations

  • None of the numerical attributes have missing values
  • The values are on different scales. Many machine learning models require the values to be on the same scale. We will use StandardScaler from the sklearn library to scale the features.

Handling Categorical Columns

cat_attributes = train_data.select_dtypes(include=['object'])
print(cat_attributes.columns)
['workClass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']

Data Visualization

We will use countplot from the seaborn package.

sns.countplot(y='workClass', hue='income', data = cat_attributes)
sns.countplot(y='occupation', hue='income', data = cat_attributes)

Observations

  • The column education is just a string representation of the column education-num. We will drop the education column.
  • The variables workClass, occupation, native-country have missing values. We will replace the missing values in each column with the most_frequent occurring value of that column.

We need to handle the numerical and categorical attributes differently. Numerical attributes need to be scaled, whereas for categorical columns we need to fill the missing values and then encode the categorical values into numerical values. To apply these sequence of transformations we will use the sklearn’s Pipeline class. We will also build custom transformers that can be directly used with Pipeline.

Creating Pipelines

sklearn has many in-built transformers. However, if the in-built ones don’t get the job done for you, you can build a custom transformer. All you need to do is to inherit BaseEstimator and TransformerMixin classes. You also need to implement the fit and transform methods.

  • fit: should return an instance of self
  • transform: the transformation logic can be added here

ColumnsSelector Pipeline

sklearn doesn’t provide libraries to directly manipulate with pandas dataframe. We will write our own custom transformer which will select the corresponding attributes (either numerical or categorical)

class ColumnsSelector(BaseEstimator, TransformerMixin):

def __init__(self, type):
self.type = type

def fit(self, X, y=None):
return self

def transform(self,X):
return X.select_dtypes(include=[self.type])

Numerical Data Pipeline

We select the numerical attributes using the ColumnsSelector transformer defined above and then scale the values using the StandardScaler.

num_pipeline = Pipeline(steps=[
("num_attr_selector", ColumnsSelector(type='int')),
("scaler", StandardScaler())
])

If we call the fit and transform methods for the num_pipeline it internally calls the fit and transform methods for all the transformers defined in the pipeline. In this case, the ColumnsSelector and StandardScaler transformers.

Categorical Data Pipeline

We need to replace the missing values in the categorical columns. We will replace the missing values with the most frequently occurring value in each column. sklearn comes with Imputer to handle missing values. However, Imputer works only with numerical values. We will write a custom transformer which will accept a list of columns for which we need to replace the missing values and the strategy used to fill the missing values

class CategoricalImputer(BaseEstimator, TransformerMixin):

def __init__(self, columns = None, strategy='most_frequent'):
self.columns = columns
self.strategy = strategy


def fit(self,X, y=None):
if self.columns is None:
self.columns = X.columns

if self.strategy is 'most_frequent':
self.fill = {column: X[column].value_counts().index[0] for
column in self.columns}
else:
self.fill ={column: '0' for column in self.columns}

return self

def transform(self,X):
X_copy = X.copy()
for column in self.columns:
X_copy[column] = X_copy[column].fillna(self.fill[column])
return X_copy

All the machine learning models expect numerical values. We will use pd.get_dummies to convert the categorical values to numerical values. This is similar to using OneHotEncoder except that OneHotEncoder requires numerical columns.

We need to merge the train and test dataset before using pd.get_dummies as there might be classes in the test dataset that might not be present in the training dataset. For this, in the fit method, we will concatenate the train and test dataset and find out all the possible values for a column. In the transform method, we will convert each column to Categorical Type and specify the list of categories that the column can take. pd.get_dummies will create a column of all zeros for the category not present in the list of the categories for that column.

The transformer also takes an argument dropFirst which indicates whether we should drop the first column after creating dummy columns using pd.get_dummies. We should drop the first column to avoid multicollinearity. By default, the value is set to True

class CategoricalEncoder(BaseEstimator, TransformerMixin):

def __init__(self, dropFirst=True):
self.categories=dict()
self.dropFirst=dropFirst

def fit(self, X, y=None):
join_df = pd.concat([train_data, test_data])
join_df = join_df.select_dtypes(include=['object'])
for column in join_df.columns:
self.categories[column] =
join_df[column].value_counts().index.tolist()
return self

def transform(self, X):
X_copy = X.copy()
X_copy = X_copy.select_dtypes(include=['object'])
for column in X_copy.columns:
X_copy[column] = X_copy[column].astype({column:
CategoricalDtype(self.categories[column])})
return pd.get_dummies(X_copy, drop_first=self.dropFirst)

Complete Categorical Pipeline

cat_pipeline = Pipeline(steps=[
("cat_attr_selector", ColumnsSelector(type='object')),
("cat_imputer", CategoricalImputer(columns=
['workClass','occupation', 'native-country'])),
("encoder", CategoricalEncoder(dropFirst=True))
])

Complete Pipeline

We have two transformer pipeline i.e, num_pipeline and cat_pipeline. We can merge them using FeatureUnion

full_pipeline = FeatureUnion([("num_pipe", num_pipeline), 
("cat_pipeline", cat_pipeline)])

Now we have all the pipelines for building the model, let’s prepare the data for the model and build it.

We will drop the columns that aren’t required

train_data.drop(['fnlwgt', 'education'], axis=1, inplace=True)
test_data.drop(['fnlwgt', 'education'], axis=1, inplace=True)

Training the model

We will create a copy of the training dataset and separate the feature vectors and the target values.

train_copy = train_data.copy()
train_copy["income"] = train_copy["income"].apply(lambda x:0 if
x=='<=50K' else 1)
X_train = train_copy.drop('income', axis =1)
Y_train = train_copy['income']

Next, we pass the X_train to the full_pipeline we built.

X_train_processed=full_pipeline.fit_transform(X_train)model = LogisticRegression(random_state=0)
model.fit(X_train_processed, Y_train)

Testing the model

test_copy = test_data.copy()
test_copy["income"] = test_copy["income"].apply(lambda x:0 if
x=='<=50K.' else 1)
X_test = test_copy.drop('income', axis =1)
Y_test = test_copy['income']

We apply the same transformations to the test dataset that we applied to the training dataset.

X_test_processed = full_pipeline.fit_transform(X_test)predicted_classes = model.predict(X_test_processed)

Model Evaluation

We will use accuracy_score from sklearn to find the accuracy of the model

accuracy_score(predicted_classes, Y_test.values)

The accuracy is 85.2%

Let’s plot the confusion matrix.

cfm = confusion_matrix(predicted_classes, Y_test.values)
sns.heatmap(cfm, annot=True)
plt.xlabel('Predicted classes')
plt.ylabel('Actual classes')

The X-axis represents the Predicted classes and the Y-axis represents the Actual classes. How do we interpret the confusion matrix? 1.2e+04 times the model correctly predicted the class 0 when the actual class was 0. Similarly, conclusions can be drawn for the remaining cases.

Cross-Validation

We will use StratifiedKFold to divide our dataset into k folds. In each iteration, k-1 folds are used as the training set and the remaining fold is used as the validation. We use StratifiedKFold because it preserves the percentage of samples from each class.

If we use KFold, we might run the risk of introducing sampling bias i.e, the training set might contain a large number of samples where income is greater than 50K and test set contains more samples where income is less than 50K. In this case, the model build from training data will not generalize well for test dataset. Whereas StratifiedKFold will ensure that there are enough samples of each class in both the train and test dataset.

We will use cross_val_score for cross-validation. The parameter cv determines the cross-validation strategy. StatifiedKFold is used if an integer value is passed to cv

cross_val_model = LogisticRegression(random_state=0)
scores = cross_val_score(cross_val_model, X_train_processed,
Y_train, cv=5)
print(np.mean(scores))

We take the mean of all the score obtained in each iteration as the final score of our model.

The accuracy with cross-validation is 85.0%.

Fine Tuning the model

We can fine-tune our model by playing around with the parameters. sklearn comes with GridSearchCV to do an exhaustive search over specified parameter values for an estimator.

# penalty specifies the norm in the penalization
penalty = ['l1', 'l2']
# C is the inverese of regularization parameter
C = np.logspace(0, 4, 10)
random_state=[0]# creating a dictionary of hyperparameters
hyperparameters = dict(C=C, penalty=penalty,
random_state=random_state)

Using GridSearchCV to find the optimal parameters

clf = GridSearchCV(estimator = model, param_grid = hyperparameters, 
cv=5)
best_model = clf.fit(X_train_processed, Y_train)
print('Best Penalty:', best_model.best_estimator_.get_params() ['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

The best parameters are

Best Penalty: l1 
Best C: 1.0

Predicting using the best model

best_predicted_values = best_model.predict(X_test_processed)
accuracy_score(best_predicted_values, Y_test.values)

The accuracy of the model with the best parameters is 85.2%

Saving the model

We have done all the hard work of creating and testing the model. It would be good if we could save the model for future use rather than retrain it. We will save our model in the pickle.

filename = 'final_model.sav'
pickle.dump(model, open(filename, 'wb'))

Loading the model from the pickle

saved_model = pickle.load(open(filename, 'rb')) 

Now we can use the model for prediction purposes.

That’s all for this blog. The complete Jupyter Notebook can be found here.

Final Remarks

We have learned to build a complete machine learning project. In the process, we built custom transformers that can be used with sklearn’s Pipeline class. We also learned to fine-tune our model and save it for further use.

Please let me know if there is any part I could have done better.

Thanks for Reading!!

--

--

Software Engineer | Passionate about data | Loves large scale distributed systems