Building a Deployable ML Classifier in Python

Published in

Towards Data Science

6 min readMar 5, 2018

Now-a-days, machine learning has become completely a necessary, effective and efficient way to find solutions to the problems thanks to complexity of problems and huge amount of data associated. In most of the resources, a machine learning model is developed in a structured data just to check the accuracy of the model. But, in real time some of the major requirements while developing a machine learning model is to handle the imbalanced data while building the model, parameter tuning in the model and saving the model to the file system for later use or deployment. Here, we will see how to design a binary classifier in python while dealing with all the three requirements specified above.

While developing a machine learning model, we generally put all our innovation in a standard workflow. Some of the steps involved are getting data, feature engineering on it, building a model with proper parameters through iterative training and testing and deploying the built model at production.

We will go through this work flow by building a binary classifier for predicting the quality of the red wine from the available features. The dataset is publicly available in UCI Machine Learning Repository. Scikit Learn library is used here for classifier designing. For the source code, the github link is-

sambit9238/Machine-Learning

Machine-Learning - It represents some implementations of the Machine Learning in different scenarios.

github.com

First we need to import all the necessary dependecies and load the dataset. We always need numpy and pandas in any ml model design as data frame, matrix and array operations are involved in all.

import numpy as np
import pandas as pd
df = pd.read_csv("winequality-red.csv")
df.head()

The dataset looks like-

As can be seen here, the quality has been presented by numbers from 3 to 8. To make it a binary classification problem, let’s take quality>5 is good otherwise bad.

df["quality_bin"] = np.zeros(df.shape[0])
df["quality_bin"] = df["quality_bin"].where(df["quality"]>=6, 1)
#1 means good quality and 0 means bad quality

To get a summary of data description —

df.describe()

As visible from the snapshot, data values are quite deviated at some attributes. It is a good practice to standardize the values as it will bring the variance to a reasonable level. Also, since most of the algorithms are using euclidean distance in background, having scaled features is better in model building.

from sklearn.preprocessing import StandardScaler
X_data = df.iloc[:,:11].values
y_data = df.iloc[:,12].values
scaler = StandardScaler()
X_data = scaler.fit_transform(X_data)

Here fit_transform is used so that the StandardScaler will fit on X_data and transform X_data too. If you need to fit and transform it on two different data set, you can call fit and transform function separately too. Now, we have total 1599 data instances, out of which 855s are of bad quality and 744s are of good quality. Data is clearly imbalanced here. Since less number of data instances are there, so we will go for oversampling. But it is important to note that resampling should always be done only on training data not on testing/validation data. Now, let’s divide the dataset into training and testing dataset for model building.

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3, random_state=42) 
#so that 30% will be selected for testing data randomly

Instead of train and test split, you can go for beleived to be more effective cross validation method also. Now we have 588 bad quality and 531 good quality instances for training. 267 bad quality and 213 good quality instances for testing remained. It’s the time to resample the training data in order to balance it so that the model won’t be biased. We will use SMOTE algorithm for oversampling here.

from imblearn.over_sampling import SMOTE
#resampling need to be done on training dataset only
X_train_res, y_train_res = SMOTE().fit_sample(X_train, y_train)

After over sampling, there are 588 instances from both good and bad quality wine in training set. Now is the time for model selection. I have taken here Stochastic Gradient Classifier. But, you can check a few models and compare there accuracy to select the suitable now.

from sklearn.linear_model import SGDClassifier
sg = SGDClassifier(random_state=42)
sg.fit(X_train_res,y_train_res)
pred = sg.predict(X_test)
from sklearn.metrics import classification_report,accuracy_score
print(classification_report(y_test, pred))
print(accuracy_score(y_test, pred))

The result looks like-

The accuracy got is 65.625%. The parameters like learning rate, loss function etc. plays a major role in the performance of the model. We can effectively select the best parameters for the model using GridSearchCV.

#parameter tuning 
from sklearn.model_selection import GridSearchCV
#model
model = SGDClassifier(random_state=42)
#parameters
params = {'loss': ["hinge", "log", "perceptron"],
          'alpha':[0.001, 0.0001, 0.00001]}
#carrying out grid search
clf = GridSearchCV(model, params)
clf.fit(X_train_res, y_train_res)
#the selected parameters by grid search
print(clf.best_estimator_)

As can be seen here, only loss function and alpha have been supplied here for finding the best option for them. Same can be done with other parameters. The best option for loss function seems to be ‘hinge’ i.r. linear SVM and for alpha value it seems to be 0.001. Now, we will build a model using the best parameters selected by grid search.

#final model by taking suitable parameters
clf = SGDClassifier(random_state=42, loss="hinge", alpha=0.001)
clf.fit(X_train_res, y_train_res)
pred = clf.predict(X_test)

Now, we selected the model, tuned the parameters so it is time to validate the model before deployment.

print(classification_report(y_test, pred))
print(accuracy_score(y_test, pred))

As can be seen here, after tuning the parameters the metrics’ values have been improved by 2–3%. The accuracy has also improved from 65.625% to 70.625%. In case you aren’t still satisfied with the model you can try other algorithms too through a few training and testing iteration. Now, since the model is built, it needs to be saved to file system for later use or deployment somewhere else.

from sklearn.externals import joblib
joblib.dump(clf, "wine_quality_clf.pkl")

When you need the classifier, it can simply be loaded using joblib and the feature array will be passed to get the result.

clf1 = joblib.load("wine_quality_clf.pkl")
clf1.predict([X_test[0]])

Congrats! Now you are ready to design a deployable machine learning model. :-D

References:-

SMOTE

SMOTEwww.cs.cmu.edu

3.2. Tuning the hyper-parameters of an estimator - scikit-learn 0.19.1 documentation

Note that it is common that a small subset of those parameters can have a large impact on the predictive or computation…

scikit-learn.org

Importance of Feature Scaling - scikit-learn 0.19.1 documentation

Feature scaling though standardization (or Z-score normalization) can be an important preprocessing step for many…

scikit-learn.org

Building a Deployable ML Classifier in Python

sambit9238/Machine-Learning

Machine-Learning - It represents some implementations of the Machine Learning in different scenarios.

SMOTE

SMOTE

3.2. Tuning the hyper-parameters of an estimator - scikit-learn 0.19.1 documentation

Note that it is common that a small subset of those parameters can have a large impact on the predictive or computation…

Importance of Feature Scaling - scikit-learn 0.19.1 documentation

Feature scaling though standardization (or Z-score normalization) can be an important preprocessing step for many…

Written by Sambit Mahapatra