Scikit-learn is a powerful Machine Learning library in python. It provides many tools for classification, regression and clustering tasks. In this post we will discuss some popular tools for building classification models using scikit-learn.
Let’s get started!
For our purposes we will be working with the Bank Churn Modeling data set. The data can be found here.
To start, let’s import the Pandas library, relax display limits and print the first five rows of data:
import pandas as pd
df = pd.read_csv("Bank_churn_modelling.csv")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print(df.head())
CLASSIFICATION
Let’s consider the task of building a classification model for predicting whether a customer will churn (stop using a service or product). For simplicity, let’s use ‘CreditScore’, ‘Age’, ‘Tenure’ and ‘NumOfProducts’ to predict churn (‘Exited’):
X = df[['CreditScore', 'Age', 'Tenure', 'NumOfProducts']]
y = df['Exited']
Let’s also split our data from training and testing:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
Let’s begin by building a simple logistic regression model. Let’s import the logistic regression module from scikit-learn:
from sklearn.linear_model import LogisticRegression
Next we will define a model object, fit our model and make predictions on our test data:
model_logr = LogisticRegression()
model_logr.fit(X_train, y_train)
y_pred = model_logr.predict(X_test)
And we can evaluate our model using a confusion matrix. We can do so with a snippet of code borrowed from seralouk on StackOverflow:
We see that our model is really good at predicting negative cases (people who don’t churn) but performs poorly when it comes to predicting actually churners. Let’s inspect the number of churners and non-churners using the Counter() method from the collections module:
from collections import Counter
print(df['Exited'])
We see this data has significant imbalance in labels. One way to remedy this is to balance the training data. Instead of using ‘train_test_split’ we can sample using the Pandas sample method on our original dataframe to generate training and test sets:
df_train = df.sample(int(0.67*len(df)), random_state = 42)
df_test = df[~df['CustomerID'].isin(list(df_train['CustomerID']))]
We can now balance the training data while leaving the distribution in labels in the test set untouched:
sample_in = int(min(list(dict(Counter(df_train['Exited'])).values()))-1)
df_1 = df_train[df_train['Exited'] == 0]
df_2 = df_train[df_train['Exited'] == 1]
df_1 =df_1.sample(n=sample_in, random_state = 42)
df_2 =df_2.sample(n=sample_in, random_state = 24)
df_train = df_1.append(df_2)
Now if we train our model and predict on the test set we get:
This looks much better. Before we will look at how to train additional classification models like random forests and support vector machine (SVM) let’s define a function that allows us to choose between training a logistic regression, random forest and a SVM model:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
def train_model(model_type):
model = {'logistic_regression': LogisticRegression(), 'random_forests': RandomForestClassifier(random_state =42),
'SVM': SVC()}
model = model[model_type]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
conmat = confusion_matrix(y_test, y_pred)
conmat = np.mat(conmat)
return y_pred, conmat
Now let’s call our function with ‘logistic_regression’:
y_pred, conmat = train_model('logistic_regression')
And when we run our script we should reproduce the same confusion matrix.
Now let’s call our function with ‘randomforests’. If you are unfamiliar with Random Forests I recommend Tony Yiu’s article Understanding Random Forests_:
y_pred, conmat = train_model('random_forests')
This can be further improved with hyper parameter tuning. We can use the following function to search for the best hyper parameters. We will use the RandomizedSearchCV module from sklearn.model_selection to search for hyper parameters. (If you want to learn more about random forest parameters I recommend reading the documentation). Within our function let’s define a grid of search values:
def get_rf_parameters():
n_estimators = [10, 50, 100]
max_features = ['auto', 'sqrt', 'log2']
max_depth = [5, 10, 20, 50, None]
min_samples_split = [2, 4, 6, 8]
min_samples_leaf = [1, 2, 4, 6]
bootstrap = [True, False]
random_grid = {'n_estimators':n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap':bootstrap}
...
Now let’s import RandomizedSearchCV, initialize a model object, pass the model into a RandomizedSearchCV object, fit our model and find the best parameters. We will randomly select 5 iterations and choose the best parameters using 3 fold cross validation (n_iter = 5, cv = 3):
from sklearn.model_selection import RandomizedSearchCV
def get_rf_parameters():
...
model = RandomForestClassifier(random_state =42)
rf_random = RandomizedSearchCV(estimator = model, param_distribution = random_grid, n_iter = 5, cv = 3, verbose = 2, random_state = 42)
rf_random.fit(X_train, y_train)
parameters = rf_random.best_params_
print("Best Parameters: ", parameters)
return parameters
Let’s call our function and store the return value in a new variable:
rf_par = get_rf_parameters()
We should see a series of CV tests being run:
We’ll also update our ‘train_model’ function so that it takes the random forest parameters as input:
def train_model(model_type, rf_parameters):
model = {'logistic_regression': LogisticRegression(), 'random_forests': RandomForestClassifier(**rf_parameters, random_state =42),
'SVM': SVC()}
...
Now let’s call ‘train_model’ with the random forest parameters and run our script:
y_pred, conmat = train_model('random_forests', rf_par)
We see that performance improves upon optimizing the random forest hyper parameters. Now let’s do the same for support vector machines (documentation). First let’s define our SVM grid search function and find the best SVM classifier:
def get_svm_parameters():
C = [0.1, 1, 10,]
gamma = [1, 0.1, 0.01]
kernel = ['rbf', 'linear']
random_grid = {'C': C,'gamma': gamma, 'kernel':kernel}
model = SVC(random_state =42)
svm_random = RandomizedSearchCV(estimator = model, param_distributions = random_grid,
n_iter = 5, cv = 3, verbose = 2, random_state = 42)
svm_random.fit(X_train, y_train)
parameters = svm_random.best_params_
return parameters
svm_par = get_svm_parameters()
Next, let’s modify our ‘train_model’ function so that it takes SVM parameters:
def train_model(model_type, rf_parameters, svm_parameters):
if rf_parameters:
model = {'logistic_regression': LogisticRegression(), 'random_forests': RandomForestClassifier(**rf_parameters, random_state =42),
'SVM': SVC()}
elif svm_parameters:
model = {'logistic_regression': LogisticRegression(), 'random_forests': RandomForestClassifier(random_state =42),
'SVM': SVC(**svm_parameters)}
model = model[model_type]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
conmat = confusion_matrix(y_test, y_pred)
conmat = np.mat(conmat)
return y_pred, conmat
Now we can call ‘train_model’ with the SVM parameters and run our script:
y_pred, conmat = train_model('SVM', None, svm_par)
I’ll stop here but I encourage you to play around with the data and code yourself.
CONCLUSIONS
To summarize, in this post we discussed how to build classification models using the python machine learning library scikit-learn. First we showed how to build a logistic regression model. We also showed how to improve performance by balancing the training data. Next we discussed how to train a random forest model and perform a hyper parameter search to optimize performance. Finally, we repeated this process for a support vector machine classifier. If you are interested learning about the basics of python Programming, data manipulation with Pandas, and machine learning in python check out _Python for Data Science and Machine Learning: Python Programming, Pandas and Scikit-learn Tutorials for Beginners._ I hope you found this post useful/interesting. The code from this post is available on GitHub. Thank you for reading!