
Evaluating machine learning algorithms is a common task performed by data scientists. While a data scientist needs to know the different types of machine learning algorithms to use for different types of problems, it is nevertheless paramount that he puts the different algorithms to work on his/her specific dataset. Only by doing that would he/she have a better sense of which algorithm to use to train the model and how to perform hyper-parameter tuning after that. However, choosing the right algorithms is a time-consuming and exhausting process. Ideally, there should be an automated process where you just need to supply your data and the ideal machine learning algorithm to use would be chosen for you.
The answer to this is Lazypredict. LazyPredict is a Python library that helps you to partially automate the process of selecting the best algorithm to train your dataset. By supplying your data, LazyPredict would use more than 60 ML algorithms to train a model. And the end result would be presented to you. From there on, you would be able to choose the best performing ML algorithm to further train or refine using your dataset.
Selecting Machine Learning (ML) Models the Manual Way
To appreciate the beauty of LazyPredict, it is always good to understand how things are usually done manually. So for this section, I am doing to make use of the diabetes dataset as an example and see how we can use it to evaluate several ML algorithms and choose the ideal algorithm that works best with it. For simplicity, we are going to use the following ML algorithms:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machines
Diabetes Dataset: https://www.kaggle.com/datasets/mathchi/diabetes-data-set. Licensing: CC0: Public Domain
Loading the data
The first step would be to load the diabetes.csv file into a Pandas DataFrame and then print out its details:
import numpy as np
import pandas as pd
df = pd.read_csv('diabetes.csv')
df.info()
Specifically, there are no NaN
values in the dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
Let’s take a look at the dataframe itself:
df
Observe that some columns have 0 values, such as the Pregnancies, SkinThickness, Insulin, and Outcome:

Cleaning the data
Since there are no NaN
values in the dataframe, let’s now check to see which specific columns have 0 values in them:
#---check for 0s---
print(df.eq(0).sum())
From the output below, you can see that only the DiabetesPedigreeFunction and Age columns have no 0 values:
Pregnancies 111
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 500
dtype: int64
For the other columns that have 0 values in them, only the Pregnancies and Outcome columns are allowed to have 0 values – a 0 for Pregnancies simply mean that the patient was never pregnant before and a 0 for Outcome means that the patient is not diabetic. For the other columns, having a 0 for value is simply not logical – 0 skin thickness, really?
So let’s now replace the 0 values in these columns so they have more meaning values. The first step is to replace the 0’s with NaN
:
df[['Glucose','BloodPressure','SkinThickness',
'Insulin','BMI']] =
df[['Glucose','BloodPressure','SkinThickness',
'Insulin','BMI']].replace(0,np.NaN)
df

Then, replace the NaN
s with the mean of each column:
df.fillna(df.mean(), inplace = True) # replace the rest of the NaNs with the mean
You can now verify that all columns have no 0 values except Pregnancies and Outcome:
print(df.eq(0).sum())
Pregnancies 111
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 500
dtype: int64
Examining the Correlation Between the Features
While we have several features in the dataset, not all features contribute towards the outcome. Hence it is useful to calculate the correlation factor of each columns:
corr = df.corr()
print(corr)
Here is the outcome:
Pregnancies Glucose BloodPressure SkinThickness
Pregnancies 1.00 0.13 0.21 0.08
Glucose 0.13 1.00 0.22 0.19
BloodPressure 0.21 0.22 1.00 0.19
SkinThickness 0.08 0.19 0.19 1.00
Insulin 0.06 0.42 0.07 0.16
BMI 0.02 0.23 0.28 0.54
DiabetesPedigreeFunction -0.03 0.14 -0.00 0.10
Age 0.54 0.27 0.32 0.13
Outcome 0.22 0.49 0.17 0.22
Insulin BMI DiabetesPedigreeFunction Age Outcome
Pregnancies 0.06 0.02 -0.03 0.54 0.22
Glucose 0.42 0.23 0.14 0.27 0.49
BloodPressure 0.07 0.28 -0.00 0.32 0.17
SkinThickness 0.16 0.54 0.10 0.13 0.22
Insulin 1.00 0.17 0.10 0.14 0.21
BMI 0.17 1.00 0.15 0.03 0.31
DiabetesPedigreeFunction 0.10 0.15 1.00 0.03 0.17
Age 0.14 0.03 0.03 1.00 0.24
Outcome 0.21 0.31 0.17 0.24 1.00
Plotting the Correlation Between Features
Visualizing the correlations between features using a heatmap makes understanding the numbers much easier:
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 10))
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(df.columns),1)
ax.set_xticks(ticks)
ax.set_xticklabels(df.columns)
plt.xticks(rotation = 90)
ax.set_yticklabels(df.columns)
ax.set_yticks(ticks)
#---print the correlation factor---
for i in range(df.shape[1]):
for j in range(9):
text = ax.text(j, i, round(corr.iloc[i][j],2),
ha="center", va="center", color="w")
plt.show()
Here is the heat map for the correlation factors. We are interested to see which features are highly correlated (either positively or negatively) to the Outcome. So we will look at the Outcome column and focus on those cells which are dark red (positively correlated) and dark blue (negatively correlated; none in this case):

You can also find the top correlated features programmatically:
#---get the top four features that has the highest correlation---
print(df.corr().nlargest(4, 'Outcome').index)
#---print the top 4 correlation values---
print(df.corr().nlargest(4, 'Outcome').values[:,8])
You can see that the top 3 correlated features to Outcome are Glucose, BMI, and Age:
Index(['Outcome', 'Glucose', 'BMI', 'Age'], dtype='object')
[1. 0.49292767 0.31192439 0.23835598]
Evaluating the Machine Learning Algorithms
With the data cleaned, the next step would be to choose the different algorithms to train the model using your data.
Using LogisticRegression
Let’s first use logistic regression to train a model. We shall use cross validation to score the model:
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
#---features---
X = df[['Glucose','BMI','Age']]
#---label---
y = df.iloc[:,8]
log_regress = linear_model.LogisticRegression()
log_regress_score = cross_val_score(log_regress, X, y, cv=10, scoring='accuracy').mean()
print(log_regress_score)
For logistic regression, I obtained a score of 0.7669856459330144. I will append the result to a list so that later on we can do a comparison among all the other models:
result = []
result.append(log_regress_score)
Using K-Nearest Neighbors
Next up, we will use the K-Nearest Neighbors algorithm to train the model:
from sklearn.neighbors import KNeighborsClassifier
#---empty list that will hold cv (cross-validates) scores---
cv_scores = []
#---number of folds---
folds = 10
#---creating odd list of K for KNN---
ks = list(range(1,int(len(X) * ((folds - 1)/folds)), 2))
#---perform k-fold cross validation---
for k in ks:
knn = KNeighborsClassifier(n_neighbors=k)
score = cross_val_score(knn, X, y, cv=folds, scoring='accuracy').mean()
cv_scores.append(score)
#---get the maximum score---
knn_score = max(cv_scores)
#---find the optimal k that gives the highest score---
optimal_k = ks[cv_scores.index(knn_score)]
print(f"The optimal number of neighbors is {optimal_k}")
print(knn_score)
result.append(knn_score)
We will try different values of K and then score them individually. We will pick the highest score and print out the optimal value of K. Here is the result:
The optimal number of neighbors is 19
0.7721462747778537
Using Support Vector Machines
The final algorithm that we want to use is Support Vector Machines (SVM). There are two types of kernels that we will try for SVM. Let’s try out the linear
kernel first:
from sklearn import svm
linear_svm = svm.SVC(kernel='linear')
linear_svm_score = cross_val_score(linear_svm, X, y,
cv=10, scoring='accuracy').mean()
print(linear_svm_score)
result.append(linear_svm_score)
Followed by the rbf
(Radial Basis Function) kernel:
rbf = svm.SVC(kernel='rbf')
rbf_score = cross_val_score(rbf, X, y, cv=10, scoring='accuracy').mean()
print(rbf_score)
result.append(rbf_score)
Selecting the Best Performing Algorithms
Now that we have trained the dataset using the different algorithms, we can collate all the results and display them for comparison:
algorithms = ["Logistic Regression", "K Nearest Neighbors", "SVM Linear Kernel", "SVM RBF Kernel"]
cv_mean = pd.DataFrame(result,index = algorithms)
cv_mean.columns=["Accuracy"]
cv_mean.sort_values(by="Accuracy",ascending=False)
As you can see from the figure below, KNN is the winner, but the others are not too far behind as well.

With this result, you now know that KNN is the best algorithm to use for this particular dataset.
Using LazyPredict for Classification Problems
While we know from the previous result that KNN performs the best using the four algorithms that we used, the conclusion is not definitive. For all you know, there might be other better algorithms that are more suitable for your dataset. This is where you will use LazyPredict to automatically train your dataset using the different algorithms available.
LazyPredict supports regression and classification algorithms.
With regards to the dataset, I am going to use the one that I have cleaned earlier. You can use the raw data that you obtained from reading the CSV files and LazyPredict will automatically preprocess your data – it will replace your missing values with the mean (for numeric columns) and a constant value (for categorical columns). It will then standardize your numeric columns and encode your categorical columns.
However, it is always better to perform the data preprocessing yourself as you are the best person to understand your own data (as evident in our dataset where for certain columns 0 values are not acceptable).
And so I am going to extract the features from the first 8 columns of the cleaned dataframe df
and the ninth column as the label:
#---features---
X = df.iloc[:,:8]
#---label---
y = df.iloc[:,8]
Next, install LazyPredict:
!pip install lazypredict
For classification problems, import the LazyClassifier
class. You also import and the other required module:
import lazypredict
# for classification problem
from lazypredict.Supervised import LazyClassifier
# split dataset into training and testing sets
from sklearn.model_selection import train_test_split
Initialize the LazyClassifier
class, in particular set the predictions
parameter to True
:
clf = LazyClassifier(verbose=0, ignore_warnings=True,
custom_metric = None, predictions=True)
Split the dataset into 80% training and 20% test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state = 42)
You can now use the clf
classifier to fit (train) your data using the various classification algorithms and predict the outcome:
scores, predictions = clf.fit(X_train, X_test, y_train, y_test)
scores
The scores
variable is a dataframe that shows the various ML models and their respective metrics such as accuracy, ROC AUC, F1 Score, etc:

You can see that the result is pretty close to our initial test, where K-Nearest Neighbor performs pretty well (second best performing model in this case). Of course, now we know that the ExtraTreesClassifier algorithm works better.
The predictions
variable is a dataframe containing the predicted value for each model used:
predictions

Using LazyPredict for Regression Problems
Before I end this article, let’s use LazyPredict
to work on a regression problem. This time, we will make use of the Boston dataset that is shipped with the sk-learn
library. For regression problem, use the LazyRegressor
class:
from lazypredict.Supervised import LazyRegressor
from sklearn.model_selection import train_test_split
from sklearn import datasets
# load the Boston dataset
data = datasets.load_boston()
X, y = data.data, data.target
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)
clf = LazyRegressor(predictions=True)
# fit the data using different algorithms
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
models
The result of the evaluation is here:

And the predicted value for each algorithm:
predictions

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.
Summary
This article showed you how the process of selecting Machine Learning algorithms can be simplified using the LazyPredict library. Once you have identified the ideal algorithm to use, you should further refine your model by using hyper-parameter tuning. If you want a quick introduction to this topic, check out my earlier article:
Tuning the Hyperparameters of your Machine Learning Model using GridSearchCV