The world’s leading publication for data science, AI, and ML professionals.

Using LazyPredict for Evaluating ML Algorithms

Automate the process of selecting the best machine learning algorithms using the LazyPredict library

Photo by Victoriano Izquierdo on Unsplash
Photo by Victoriano Izquierdo on Unsplash

Evaluating machine learning algorithms is a common task performed by data scientists. While a data scientist needs to know the different types of machine learning algorithms to use for different types of problems, it is nevertheless paramount that he puts the different algorithms to work on his/her specific dataset. Only by doing that would he/she have a better sense of which algorithm to use to train the model and how to perform hyper-parameter tuning after that. However, choosing the right algorithms is a time-consuming and exhausting process. Ideally, there should be an automated process where you just need to supply your data and the ideal machine learning algorithm to use would be chosen for you.

The answer to this is Lazypredict. LazyPredict is a Python library that helps you to partially automate the process of selecting the best algorithm to train your dataset. By supplying your data, LazyPredict would use more than 60 ML algorithms to train a model. And the end result would be presented to you. From there on, you would be able to choose the best performing ML algorithm to further train or refine using your dataset.

Selecting Machine Learning (ML) Models the Manual Way

To appreciate the beauty of LazyPredict, it is always good to understand how things are usually done manually. So for this section, I am doing to make use of the diabetes dataset as an example and see how we can use it to evaluate several ML algorithms and choose the ideal algorithm that works best with it. For simplicity, we are going to use the following ML algorithms:

  • Logistic Regression
  • K-Nearest Neighbors (KNN)
  • Support Vector Machines

Diabetes Dataset: https://www.kaggle.com/datasets/mathchi/diabetes-data-set. Licensing: CC0: Public Domain

Loading the data

The first step would be to load the diabetes.csv file into a Pandas DataFrame and then print out its details:

import numpy as np
import pandas as pd

df = pd.read_csv('diabetes.csv')
df.info()

Specifically, there are no NaN values in the dataframe:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Let’s take a look at the dataframe itself:

df

Observe that some columns have 0 values, such as the Pregnancies, SkinThickness, Insulin, and Outcome:

All images by author
All images by author

Cleaning the data

Since there are no NaN values in the dataframe, let’s now check to see which specific columns have 0 values in them:

#---check for 0s---
print(df.eq(0).sum())

From the output below, you can see that only the DiabetesPedigreeFunction and Age columns have no 0 values:

Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

For the other columns that have 0 values in them, only the Pregnancies and Outcome columns are allowed to have 0 values – a 0 for Pregnancies simply mean that the patient was never pregnant before and a 0 for Outcome means that the patient is not diabetic. For the other columns, having a 0 for value is simply not logical – 0 skin thickness, really?

So let’s now replace the 0 values in these columns so they have more meaning values. The first step is to replace the 0’s with NaN:

df[['Glucose','BloodPressure','SkinThickness',
    'Insulin','BMI']] = 
    df[['Glucose','BloodPressure','SkinThickness',
        'Insulin','BMI']].replace(0,np.NaN)
df

Then, replace the NaNs with the mean of each column:

df.fillna(df.mean(), inplace = True)   # replace the rest of the NaNs with the mean

You can now verify that all columns have no 0 values except Pregnancies and Outcome:

print(df.eq(0).sum())
Pregnancies                 111
Glucose                       0
BloodPressure                 0
SkinThickness                 0
Insulin                       0
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

Examining the Correlation Between the Features

While we have several features in the dataset, not all features contribute towards the outcome. Hence it is useful to calculate the correlation factor of each columns:

corr = df.corr()
print(corr)

Here is the outcome:

                          Pregnancies  Glucose  BloodPressure  SkinThickness  
Pregnancies                      1.00     0.13           0.21           0.08   
Glucose                          0.13     1.00           0.22           0.19   
BloodPressure                    0.21     0.22           1.00           0.19   
SkinThickness                    0.08     0.19           0.19           1.00   
Insulin                          0.06     0.42           0.07           0.16   
BMI                              0.02     0.23           0.28           0.54   
DiabetesPedigreeFunction        -0.03     0.14          -0.00           0.10   
Age                              0.54     0.27           0.32           0.13   
Outcome                          0.22     0.49           0.17           0.22   

                          Insulin  BMI  DiabetesPedigreeFunction  Age  Outcome  
Pregnancies                  0.06 0.02                     -0.03 0.54     0.22  
Glucose                      0.42 0.23                      0.14 0.27     0.49  
BloodPressure                0.07 0.28                     -0.00 0.32     0.17  
SkinThickness                0.16 0.54                      0.10 0.13     0.22  
Insulin                      1.00 0.17                      0.10 0.14     0.21  
BMI                          0.17 1.00                      0.15 0.03     0.31  
DiabetesPedigreeFunction     0.10 0.15                      1.00 0.03     0.17  
Age                          0.14 0.03                      0.03 1.00     0.24  
Outcome                      0.21 0.31                      0.17 0.24     1.00  

Plotting the Correlation Between Features

Visualizing the correlations between features using a heatmap makes understanding the numbers much easier:

%matplotlib inline
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 10))
cax     = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)

fig.colorbar(cax)
ticks = np.arange(0,len(df.columns),1)
ax.set_xticks(ticks)

ax.set_xticklabels(df.columns)
plt.xticks(rotation = 90)

ax.set_yticklabels(df.columns)
ax.set_yticks(ticks)

#---print the correlation factor---
for i in range(df.shape[1]):
    for j in range(9):
        text = ax.text(j, i, round(corr.iloc[i][j],2),
                       ha="center", va="center", color="w")
plt.show()

Here is the heat map for the correlation factors. We are interested to see which features are highly correlated (either positively or negatively) to the Outcome. So we will look at the Outcome column and focus on those cells which are dark red (positively correlated) and dark blue (negatively correlated; none in this case):

You can also find the top correlated features programmatically:

#---get the top four features that has the highest correlation---
print(df.corr().nlargest(4, 'Outcome').index)

#---print the top 4 correlation values---
print(df.corr().nlargest(4, 'Outcome').values[:,8])

You can see that the top 3 correlated features to Outcome are Glucose, BMI, and Age:

Index(['Outcome', 'Glucose', 'BMI', 'Age'], dtype='object')
[1.         0.49292767 0.31192439 0.23835598]

Evaluating the Machine Learning Algorithms

With the data cleaned, the next step would be to choose the different algorithms to train the model using your data.

Using LogisticRegression

Let’s first use logistic regression to train a model. We shall use cross validation to score the model:

from sklearn import linear_model
from sklearn.model_selection import cross_val_score

#---features---
X = df[['Glucose','BMI','Age']]

#---label---
y = df.iloc[:,8]

log_regress = linear_model.LogisticRegression()
log_regress_score = cross_val_score(log_regress, X, y, cv=10, scoring='accuracy').mean()

print(log_regress_score)

For logistic regression, I obtained a score of 0.7669856459330144. I will append the result to a list so that later on we can do a comparison among all the other models:

result = []
result.append(log_regress_score)

Using K-Nearest Neighbors

Next up, we will use the K-Nearest Neighbors algorithm to train the model:

from sklearn.neighbors import KNeighborsClassifier

#---empty list that will hold cv (cross-validates) scores---
cv_scores = []

#---number of folds---
folds = 10

#---creating odd list of K for KNN---
ks = list(range(1,int(len(X) * ((folds - 1)/folds)), 2))

#---perform k-fold cross validation---
for k in ks:
    knn = KNeighborsClassifier(n_neighbors=k)
    score = cross_val_score(knn, X, y, cv=folds, scoring='accuracy').mean()
    cv_scores.append(score)

#---get the maximum score---
knn_score = max(cv_scores)

#---find the optimal k that gives the highest score---
optimal_k = ks[cv_scores.index(knn_score)]

print(f"The optimal number of neighbors is {optimal_k}")
print(knn_score)
result.append(knn_score)

We will try different values of K and then score them individually. We will pick the highest score and print out the optimal value of K. Here is the result:

The optimal number of neighbors is 19
0.7721462747778537

Using Support Vector Machines

The final algorithm that we want to use is Support Vector Machines (SVM). There are two types of kernels that we will try for SVM. Let’s try out the linear kernel first:

from sklearn import svm

linear_svm = svm.SVC(kernel='linear')
linear_svm_score = cross_val_score(linear_svm, X, y,
                                   cv=10, scoring='accuracy').mean()
print(linear_svm_score)
result.append(linear_svm_score)

Followed by the rbf (Radial Basis Function) kernel:

rbf = svm.SVC(kernel='rbf')
rbf_score = cross_val_score(rbf, X, y, cv=10, scoring='accuracy').mean()
print(rbf_score)
result.append(rbf_score)

Selecting the Best Performing Algorithms

Now that we have trained the dataset using the different algorithms, we can collate all the results and display them for comparison:

algorithms = ["Logistic Regression", "K Nearest Neighbors", "SVM Linear Kernel", "SVM RBF Kernel"]
cv_mean = pd.DataFrame(result,index = algorithms)
cv_mean.columns=["Accuracy"]
cv_mean.sort_values(by="Accuracy",ascending=False)

As you can see from the figure below, KNN is the winner, but the others are not too far behind as well.

With this result, you now know that KNN is the best algorithm to use for this particular dataset.

Using LazyPredict for Classification Problems

While we know from the previous result that KNN performs the best using the four algorithms that we used, the conclusion is not definitive. For all you know, there might be other better algorithms that are more suitable for your dataset. This is where you will use LazyPredict to automatically train your dataset using the different algorithms available.

LazyPredict supports regression and classification algorithms.

With regards to the dataset, I am going to use the one that I have cleaned earlier. You can use the raw data that you obtained from reading the CSV files and LazyPredict will automatically preprocess your data – it will replace your missing values with the mean (for numeric columns) and a constant value (for categorical columns). It will then standardize your numeric columns and encode your categorical columns.

However, it is always better to perform the data preprocessing yourself as you are the best person to understand your own data (as evident in our dataset where for certain columns 0 values are not acceptable).

And so I am going to extract the features from the first 8 columns of the cleaned dataframe df and the ninth column as the label:

#---features---
X = df.iloc[:,:8]

#---label---
y = df.iloc[:,8]

Next, install LazyPredict:

!pip install lazypredict

For classification problems, import the LazyClassifier class. You also import and the other required module:

import lazypredict

# for classification problem
from lazypredict.Supervised import LazyClassifier

# split dataset into training and testing sets
from sklearn.model_selection import train_test_split

Initialize the LazyClassifier class, in particular set the predictions parameter to True:

clf = LazyClassifier(verbose=0, ignore_warnings=True, 
                     custom_metric = None, predictions=True)

Split the dataset into 80% training and 20% test set:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state = 42)

You can now use the clf classifier to fit (train) your data using the various classification algorithms and predict the outcome:

scores, predictions = clf.fit(X_train, X_test, y_train, y_test)
scores

The scores variable is a dataframe that shows the various ML models and their respective metrics such as accuracy, ROC AUC, F1 Score, etc:

You can see that the result is pretty close to our initial test, where K-Nearest Neighbor performs pretty well (second best performing model in this case). Of course, now we know that the ExtraTreesClassifier algorithm works better.

The predictions variable is a dataframe containing the predicted value for each model used:

predictions

Using LazyPredict for Regression Problems

Before I end this article, let’s use LazyPredict to work on a regression problem. This time, we will make use of the Boston dataset that is shipped with the sk-learn library. For regression problem, use the LazyRegressor class:

from lazypredict.Supervised import LazyRegressor
from sklearn.model_selection import train_test_split
from sklearn import datasets

# load the Boston dataset
data = datasets.load_boston()
X, y = data.data, data.target

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

clf = LazyRegressor(predictions=True)

# fit the data using different algorithms
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
models

The result of the evaluation is here:

And the predicted value for each algorithm:

predictions

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.

Join Medium with my referral link – Wei-Meng Lee

Summary

This article showed you how the process of selecting Machine Learning algorithms can be simplified using the LazyPredict library. Once you have identified the ideal algorithm to use, you should further refine your model by using hyper-parameter tuning. If you want a quick introduction to this topic, check out my earlier article:

Tuning the Hyperparameters of your Machine Learning Model using GridSearchCV


Related Articles