Logistic Regression classifier on Census Income Data
In this blog, we will analyze the Census Dataset from the UCI Machine Learning Repository.
The dataset contains three files:
- adult.data: the training dataset
- adult.names: dataset description
- adult.test: test dataset
We will use Logistic Regression to build the classifier. If you don’t know about Logistic Regression you can go through my previous blog.
We will see how to build a practical machine learning project. In general, any machine learning project requires the following steps:
- Defining the problem statement
- Exploratory Data Analysis
- Training the model
- Fine tuning the model
- Save the model
So let’s get started.
Defining the problem statement
The data contains anonymous information such as age, occupation, education, working class, etc. The goal is to train a binary classifier to predict the income which has two possible values ‘>50K’ and ‘<50K’. There are 48842 instances and 14 attributes in the dataset. The data contains a good blend of categorical, numerical and missing values.
First, we will import the required libraries
import numpy as np
import pandas as pd
import io
import requests
import seaborn as sns
from matplotlib import pyplot as plt
import pickle
import os
from pandas.api.types import CategoricalDtypefrom sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_score%matplotlib inline
Downloading the data
Instead of manually downloading the dataset, we will write a small script which downloads the content for a list of URLs and saves them in a folder. Later, we can directly use the downloaded data instead of hitting the network every time. You can also use this code for any machine learning project.
def load_dataset(path, urls):
if not os.path.exists(path):
os.mkdir(path)
for url in urls:
data = requests.get(url).content
filename = os.path.join(path, os.path.basename(url))
with open(filename, "wb") as file:
file.write(data)
We will create a data
folder in the current working directory and store the content of the URLs.
urls = ["http://archive.ics.uci.edu/ml/machine-learning- databases/adult/adult.data",
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names",
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"]load_dataset('data', urls)
Next, we load the data into a pandas
dataframe using the read_csv
function.
columns = ["age", "workClass", "fnlwgt", "education", "education-
num","marital-status", "occupation", "relationship",
"race", "sex", "capital-gain", "capital-loss", "hours-per-
week", "native-country", "income"]train_data = pd.read_csv('data/adult.data', names=columns,
sep=' *, *', na_values='?')
test_data = pd.read_csv('data/adult.test', names=columns,
sep=' *, *', skiprows=1, na_values='?')
There are some whitespaces before and after the data values. To trim all the whitespaces we use the separator ‘ *, *’
. The test dataset has a weird first line, hence we skip the line using skiprows=1
. The missing values in the dataset are indicated by ?
Next, we will explore the data. This is an important step before going building the model.
Exploratory Data Analysis
Let’s get more information about the training data using train_data.info()
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age 32561 non-null int64
workClass 30725 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education-num 32561 non-null int64
marital-status 32561 non-null object
occupation 30718 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital-gain 32561 non-null int64
capital-loss 32561 non-null int64
hours-per-week 32561 non-null int64
native-country 31978 non-null object
income 32561 non-null object
Observations
- There are 32561 samples in the training dataset
- There are both categorical and numerical columns in the dataset
- The columns workClass, occupation, native-country have missing values
Similarly, for the test dataset
- There are 16281 samples
- There are no missing values
Let’s look the numerical and the categorical data with the help of some visualizations.
Handling Numerical Columns
We select the numerical columns using the select_dtypes
function.
num_attributes = train_data.select_dtypes(include=['int'])
print(num_attributes.columns)['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
The variables age, hours-per-week are self-explanatory.
- fnlwgt: sampling weight
- education-num: number of years of education in total
- capital-gain/capital-loss: income from investment sources other than salary/wages
fnlwgt is not related to the target variable income and will be removed before building the model
Data Visualizations
num_attributes.hist(figsize=(10,10))
More information about the data can be gathered by using train_data.describe()
Observations
- None of the numerical attributes have missing values
- The values are on different scales. Many machine learning models require the values to be on the same scale. We will use StandardScaler from the sklearn library to scale the features.
Handling Categorical Columns
cat_attributes = train_data.select_dtypes(include=['object'])
print(cat_attributes.columns)['workClass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']
Data Visualization
We will use countplot
from the seaborn package.
sns.countplot(y='workClass', hue='income', data = cat_attributes)
sns.countplot(y='occupation', hue='income', data = cat_attributes)
Observations
- The column education is just a string representation of the column education-num. We will drop the education column.
- The variables workClass, occupation, native-country have missing values. We will replace the missing values in each column with the most_frequent occurring value of that column.
We need to handle the numerical and categorical attributes differently. Numerical attributes need to be scaled, whereas for categorical columns we need to fill the missing values and then encode the categorical values into numerical values. To apply these sequence of transformations we will use the sklearn’s Pipeline class. We will also build custom transformers that can be directly used with Pipeline.
Creating Pipelines
sklearn has many in-built transformers. However, if the in-built ones don’t get the job done for you, you can build a custom transformer. All you need to do is to inherit BaseEstimator and TransformerMixin classes. You also need to implement the fit and transform methods.
- fit: should return an instance of self
- transform: the transformation logic can be added here
ColumnsSelector Pipeline
sklearn doesn’t provide libraries to directly manipulate with pandas dataframe. We will write our own custom transformer which will select the corresponding attributes (either numerical or categorical)
class ColumnsSelector(BaseEstimator, TransformerMixin):
def __init__(self, type):
self.type = type
def fit(self, X, y=None):
return self
def transform(self,X):
return X.select_dtypes(include=[self.type])
Numerical Data Pipeline
We select the numerical attributes using the ColumnsSelector transformer defined above and then scale the values using the StandardScaler.
num_pipeline = Pipeline(steps=[
("num_attr_selector", ColumnsSelector(type='int')),
("scaler", StandardScaler())
])
If we call the fit
and transform
methods for the num_pipeline
it internally calls the fit
and transform
methods for all the transformers defined in the pipeline. In this case, the ColumnsSelector and StandardScaler transformers.
Categorical Data Pipeline
We need to replace the missing values in the categorical columns. We will replace the missing values with the most frequently occurring value in each column. sklearn comes with Imputer to handle missing values. However, Imputer works only with numerical values. We will write a custom transformer which will accept a list of columns for which we need to replace the missing values and the strategy used to fill the missing values
class CategoricalImputer(BaseEstimator, TransformerMixin):
def __init__(self, columns = None, strategy='most_frequent'):
self.columns = columns
self.strategy = strategy
def fit(self,X, y=None):
if self.columns is None:
self.columns = X.columns
if self.strategy is 'most_frequent':
self.fill = {column: X[column].value_counts().index[0] for
column in self.columns}
else:
self.fill ={column: '0' for column in self.columns}
return self
def transform(self,X):
X_copy = X.copy()
for column in self.columns:
X_copy[column] = X_copy[column].fillna(self.fill[column])
return X_copy
All the machine learning models expect numerical values. We will use pd.get_dummies to convert the categorical values to numerical values. This is similar to using OneHotEncoder except that OneHotEncoder requires numerical columns.
We need to merge the train and test dataset before using pd.get_dummies as there might be classes in the test dataset that might not be present in the training dataset. For this, in the fit
method, we will concatenate the train and test dataset and find out all the possible values for a column. In the transform
method, we will convert each column to Categorical Type and specify the list of categories that the column can take. pd.get_dummies will create a column of all zeros for the category not present in the list of the categories for that column.
The transformer also takes an argument dropFirst
which indicates whether we should drop the first column after creating dummy columns using pd.get_dummies. We should drop the first column to avoid multicollinearity. By default, the value is set to True
class CategoricalEncoder(BaseEstimator, TransformerMixin):
def __init__(self, dropFirst=True):
self.categories=dict()
self.dropFirst=dropFirst
def fit(self, X, y=None):
join_df = pd.concat([train_data, test_data])
join_df = join_df.select_dtypes(include=['object'])
for column in join_df.columns:
self.categories[column] =
join_df[column].value_counts().index.tolist()
return self
def transform(self, X):
X_copy = X.copy()
X_copy = X_copy.select_dtypes(include=['object'])
for column in X_copy.columns:
X_copy[column] = X_copy[column].astype({column:
CategoricalDtype(self.categories[column])})
return pd.get_dummies(X_copy, drop_first=self.dropFirst)
Complete Categorical Pipeline
cat_pipeline = Pipeline(steps=[
("cat_attr_selector", ColumnsSelector(type='object')),
("cat_imputer", CategoricalImputer(columns=
['workClass','occupation', 'native-country'])),
("encoder", CategoricalEncoder(dropFirst=True))
])
Complete Pipeline
We have two transformer pipeline i.e, num_pipeline and cat_pipeline. We can merge them using FeatureUnion
full_pipeline = FeatureUnion([("num_pipe", num_pipeline),
("cat_pipeline", cat_pipeline)])
Now we have all the pipelines for building the model, let’s prepare the data for the model and build it.
We will drop the columns that aren’t required
train_data.drop(['fnlwgt', 'education'], axis=1, inplace=True)
test_data.drop(['fnlwgt', 'education'], axis=1, inplace=True)
Training the model
We will create a copy of the training dataset and separate the feature vectors and the target values.
train_copy = train_data.copy()
train_copy["income"] = train_copy["income"].apply(lambda x:0 if
x=='<=50K' else 1)X_train = train_copy.drop('income', axis =1)
Y_train = train_copy['income']
Next, we pass the X_train
to the full_pipeline
we built.
X_train_processed=full_pipeline.fit_transform(X_train)model = LogisticRegression(random_state=0)
model.fit(X_train_processed, Y_train)
Testing the model
test_copy = test_data.copy()
test_copy["income"] = test_copy["income"].apply(lambda x:0 if
x=='<=50K.' else 1)X_test = test_copy.drop('income', axis =1)
Y_test = test_copy['income']
We apply the same transformations to the test dataset that we applied to the training dataset.
X_test_processed = full_pipeline.fit_transform(X_test)predicted_classes = model.predict(X_test_processed)
Model Evaluation
We will use accuracy_score from sklearn to find the accuracy of the model
accuracy_score(predicted_classes, Y_test.values)
The accuracy is 85.2%
Let’s plot the confusion matrix.
cfm = confusion_matrix(predicted_classes, Y_test.values)
sns.heatmap(cfm, annot=True)
plt.xlabel('Predicted classes')
plt.ylabel('Actual classes')
The X-axis represents the Predicted classes and the Y-axis represents the Actual classes. How do we interpret the confusion matrix? 1.2e+04
times the model correctly predicted the class 0 when the actual class was 0. Similarly, conclusions can be drawn for the remaining cases.
Cross-Validation
We will use StratifiedKFold to divide our dataset into k folds. In each iteration, k-1
folds are used as the training set and the remaining fold is used as the validation. We use StratifiedKFold because it preserves the percentage of samples from each class.
If we use KFold, we might run the risk of introducing sampling bias i.e, the training set might contain a large number of samples where income is greater than 50K and test set contains more samples where income is less than 50K. In this case, the model build from training data will not generalize well for test dataset. Whereas StratifiedKFold will ensure that there are enough samples of each class in both the train and test dataset.
We will use cross_val_score for cross-validation. The parameter cv
determines the cross-validation strategy. StatifiedKFold is used if an integer value is passed to cv
cross_val_model = LogisticRegression(random_state=0)
scores = cross_val_score(cross_val_model, X_train_processed,
Y_train, cv=5)
print(np.mean(scores))
We take the mean of all the score obtained in each iteration as the final score of our model.
The accuracy with cross-validation is 85.0%.
Fine Tuning the model
We can fine-tune our model by playing around with the parameters. sklearn comes with GridSearchCV to do an exhaustive search over specified parameter values for an estimator.
# penalty specifies the norm in the penalization
penalty = ['l1', 'l2']# C is the inverese of regularization parameter
C = np.logspace(0, 4, 10)random_state=[0]# creating a dictionary of hyperparameters
hyperparameters = dict(C=C, penalty=penalty,
random_state=random_state)
Using GridSearchCV to find the optimal parameters
clf = GridSearchCV(estimator = model, param_grid = hyperparameters,
cv=5)
best_model = clf.fit(X_train_processed, Y_train)
print('Best Penalty:', best_model.best_estimator_.get_params() ['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])
The best parameters are
Best Penalty: l1
Best C: 1.0
Predicting using the best model
best_predicted_values = best_model.predict(X_test_processed)
accuracy_score(best_predicted_values, Y_test.values)
The accuracy of the model with the best parameters is 85.2%
Saving the model
We have done all the hard work of creating and testing the model. It would be good if we could save the model for future use rather than retrain it. We will save our model in the pickle.
filename = 'final_model.sav'
pickle.dump(model, open(filename, 'wb'))
Loading the model from the pickle
saved_model = pickle.load(open(filename, 'rb'))
Now we can use the model for prediction purposes.
That’s all for this blog. The complete Jupyter Notebook can be found here.
Final Remarks
We have learned to build a complete machine learning project. In the process, we built custom transformers that can be used with sklearn’s Pipeline class. We also learned to fine-tune our model and save it for further use.
Please let me know if there is any part I could have done better.
Thanks for Reading!!