It’s 8 PM and you are still cleaning up the data, performing EDA, and creating more features. Your initial discussion with business is first thing tomorrow and the expectation is to discuss what potential "classification" methodologies are in consideration and what are the key features coming out of the models considered. As a data scientist, you want to try as many methodologies as possible and get the best model prediction for your meeting with the business.
But, as always, there is a timing crunch!!
Nowadays, data scientists (DS) have cost-saving/revenue increment targets and the work pressure is tremendous. With data cleaning, wrangling, and feature engineering eating out a large portion of the time, DS never has sufficient time to try all relevant methodologies on the data.
Though some experts in the industry do have a sense of what kind of classification algorithm would work on a specific type of data, however, it is usually a difficult choice to make. To make a judicious decision, there are too many considerations like data linearity, model training time, type of data (categorical vs Numerical), explainability, ease of ongoing monitoring, and of course the accuracy.
There is surely a need to pace up the initial screening of models and shortlist top "n" models and then fine-tune them to improve accuracy and other key metrics.
To shortlist a few methodologies one can run the models one at a time and compare various output metrics. However, it is inefficient and time-consuming and there are certainly efficient ways of doing it.
Hereafter, we discuss three key effective ways to perform this initial shortlisting. Each of the methodologies has its pros and cons and depending on the situation one is in – it is worthwhile to choose one of these from the quiver.
1. LazyPredict: LazyPredict is a python package that encapsulates 30 classification models and 40 regression models. It automates the training pipeline and returns various metrics corresponding to the model output. Post comparison of the metrics one can fine-tune the hyperparameters of the shortlisted model to improve output metrics.
Some key classification models trained are SVC, AdaBoost, Logistic Regression, ExtraTreeClassifier, Random Forest, Gaussian Naïve Bayes, XG Boost, Ridge Classifier, etc.
Python Code:
Assuming X_train, X_test, y_train, y_test have been defined and EDA is already performed.
pip install lazypredict
from lazypredict.Supervised import LazyRegressor, LazyClassifier
cls= LazyClassifier(ignore_warnings=False, custom_metric=None) models, predictions = cls.fit(X_train, X_test, y_train, y_test)
2. Model List Iteration: Though this is an orthodox way of doing the requisite, however, this methodology provides the flexibility to choose scoring methodology, create various metrics (like confusion matrix, etc.) and perform cross-validation. The idea here is to create a list of models to execute and iterate through the model list whilst creating various model metrics.
Python Code:
from pandas import read_csv
from matplotlib import pyplot
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
dataframe = read_csv(‘classificationdata.csv’)
array_a1 = dataframe.values
X = array_a1 [:,0:20]
Y = array[:,20]
Create a list of models and append various models to test.
models_list = []
models_list.append((‘LogitReg’, LogisticRegression()))
models_list.append((‘SVM’, SVC()))
models_list.append((‘KNN_Algo ‘, KNeighborsClassifier()))
models_list.append((‘LDA_Algo’, LinearDiscriminantAnalysis()))
models_list.append((‘CART_DT’, DecisionTreeClassifier()))
model_results = []
model_names = []
This could be f1, roc etc. depending on the problem statement
scoring_type = ‘accuracy’
Iterate through the model list and find the best model using the cross validation score.
for pseudoname, modelname in models_list:
kfold = KFold(n_splits=10, random_state=7)
crossval_results = cross_val_score(model, X, Y, cv=kfold, scoring= scoring_type)
model_results.append(crossval_results)
model_names.append(pseudoname)
print (name, cv_results.mean(), cv_results.std())
- Automl Functions: Of the all, this is the most effective and most sought-after solution. There are a variety of python packages which support AutoML. Some key ones being – auto-sklearn, AutoKeras, TPOT etc. Note that these packages not only focuss on automating the model filtering but also speeds up the data processing, feature engineering etc.
Conclusion: LazyPredict, Model Iteration, and AutoML are various ways to train a list of models before shortlisting and fine-tuning the models. With the unprecedented pace at which data science is evolving, I am positive that there are many other ways of getting better models faster. (Please do share in the comments if you have come across any)
However, it doesn’t hurt to always have options handy – you never know when that meeting with the business gets scheduled.
Disclaimer: The views expressed in this article are the opinion of the author in his personal capacity and not of his employer.