Machine learning made easier with PyCaret

Working of PyCaret

Phani Rohith
Towards Data Science

--

We are often stuck in scenarios where we have a time crunch to finish our tasks. In Machine Learning, a library that is very useful in such scenarios is PyCaret.

PyCaret is an open-source library in Python that is extremely useful for several machine learning activities. It can help you right from data preprocessing to the deployment of the model. The reason why PyCaret is so useful and convenient is that anything can be achieved in very few lines of code and the code is very simple to understand. We can concentrate more on performing experiments on the data rather than writing several lines of code. Apart from being helpful for Data Preprocessing, encoding Categorical features, PyCaret also gives an understanding of which model is better by taking as few performance metrics into consideration such as Accuracy, F1 score, Recall, etc.

Photo by Prateek Katyal on Unsplash

Let’s dive into the working of PyCaret.

1. Installation:

The installation of PyCaret is very easy. It is just like any other python library. It can be installed by running the following command in your command line:

pip install pycaret

If you are using Google Collab then PyCaret can be installed by using :

!pip install pycaret

Installing PyCaret will automatically install all the following dependencies for you.

You don’t have to worry if you’re unaware of these dependencies as PyCaret deals with those.

Once you have installed PyCaret, you can import it into your jupyter notebook by using: import pycaret

import pycaret

2. Getting the data ready

Once PyCaret is imported, we must get the data ready for building our models. We can load the data in the following two ways:

1. Pandas Dataframe

2. Data from PyCaret’s Repository

Let us first discuss how the data can be loaded using the Pandas dataframe. Pycaret supports pandas dataframe and data can be loaded easily by using “read_csv” with the path of the file.

import pandas as pd
data = pd.read_csv("data/train.csv")
data.head()

In a similar way, we can load all other types of data that pandas support.

The other way to load the data is by using PyCaret’s repository. It consists of datasets that can be imported very easily. If you want to know the datasets present in the PyCaret repository click here.

from pycaret.datasets import get_data
nba = get_data('nba')

3. Environment Setup

Before setting up the environment, we must import the appropriate module for our dataset. PyCaret supports 6 modules and any of these modules can be imported using a single line.

PyCaret supports the following 6 modules.

The core setup of the Pycaret environment lies in a function named setup().

setup() function starts the environment and pipeline to handle the data for modeling and deployment. This function must be initiated before executing other functions in PyCaret.

There are around 50 parameters that are to be fed into setup() function but we don’t have to worry since most of the parameters are set to default and are optional. There are only two mandatory parameters to be fed and they are dataframe {array-like, sparse matrix} and the target column.

setup() function automatically does data pre-processing and data sampling in the background. It operates on default parameters but these paraments can be changed according to one’s requirement.

In our example, we take a dataset named “nba”, where the target variable is “TARGET_5Yrs” and it is a binary classification problem. Hence we import the classification module using :

from pycaret.classification import *
pycar = setup(nba, target = 'TARGET_5Yrs')

Once you run the above command your dataset is set up as a classification model and you can see the output message stating “setup successfully completed!” and you’ll see a set of parameters along with their Description and Value corresponding to it.

Let’s look into the functionality of setup( ) function now:

The setup( ) function performs the data preprocessing on the input dataframe. In any ml model, data preprocessing plays a vital role in building the ml model. So in setup( ) function, PyCaret prepares the data with over 20 features for machine learning. The machine learning pipeline is built based on the parameters defined in the setup() function. Now, I will explain in detail the preprocessing steps involved in the setup( ) function. Don’t panic if you are not aware of these as our friend PyCaret does it for us.

The preprocessing features which PyCaret handles for you are:

— Sampling and Split

— Scale and Transform

— Data Preparation

— Feature Engineering

— Unsupervised

Let me elaborate in detail about the pre-processing steps that PyCaret is capable of doing.

  • Sampling and Split

(i) Train Test Split:

Any dataset in machine learning is split into a train and test dataset. This is done because it is important to know the working of the machine learning model when an unseen data is given. In PyCaret, 70% of the data belongs to the training dataset and 30% of the data belongs to the testing dataset by default.

from pycaret.classification import *
reg1 = setup(data = nba, target = 'TARGET_5Yrs')

Output:

However, these dataset sizes can be varied by just passing the train_size parameter in the function.

Parameter: train_size

from pycaret.classification import *
reg1 = setup(data = nba, target = 'TARGET_5Yrs', train_size = 0.6)

Data after splitting (with split = 60%):

Output:

(ii) Sampling:

If the samples/datapoints of the dataset are large i.e. if it exceeds 25,000 samples then sampling is done automatically by PyCaret. A base estimator with various sample sizes is built and a plot is obtained showing the performance metrics for each sample. Then the desired sample size can be entered in the text box. Sampling is a Boolean parameter and the default value is True.

Sampling example:

from pycaret.datasets import get_data
income = get_data('income')
from pycaret.classification import *
model = setup(data = income, target = 'income >50K')

This functionality is only available in pycaret.classification and pycaret.regression modules.

  • Data Preparation

Having the right data is very important while performing machine learning pipeline. Many times, the data can be corrupted, there might be some missing values or the data should be categorized. All these play a vital role in building the model and need to be addressed before using the data itself.

(i) Missing value Imputation:

It is very common to have missing records in the data and this problem cannot be handled by machine learning algorithms by default. PyCaret does missing value imputation automatically. The default imputation technique used for numerical features is “mean” and the default value used for categorical features is “Constant”. The name of the parameters in the setup() function are

Parameters:

numeric_imputation: string, default = ‘mean’

categorical_imputation: string, default = ‘constant’

These parameters can be changed according to the problem by just giving the parameter in the setup() function.

#import the hepatitis dataset from PyCaret repository
from pycaret.datasets import get_data
nba = get_data('hepatitis')

Before initiating setup( ):

After initiating setup( ):

(ii) Changing data types:

PyCaret automatically detects the data type of the features present in the dataset. These values might be wrong at times. So this problem can be solved by giving a parameter such as

Parameters:

numeric_features = [‘column_name’]

categorical_features = [‘column_name’] or date_features = ‘date_column_name

ignore_features = [‘column_name’]

These parameters can be used to overwrite the data type that was detected by PyCaret. Another parameter called ignore_features can be used when we do not want to take any feature into consideration.

Example: If the feature “GP” is categorical but PyCaret interprets it as numerical, then this can be overwritten.

Code Snippet ( From Numerical to Categorical):

from pycaret.classification import *
pycar = setup(nba, target = 'TARGET_5Yrs', categorical_features = ['GP'])

Output:

Example for ignore_features:

Here we will ignore the MIN column of the data frame.

Code Snippet (Ignoring ‘MIN’ column from the dataset):

from pycaret.classification import *
pycar = setup(nba, target = 'TARGET_5Yrs', ignore_features = ['MIN'])

Output (‘MIN’ column has been ignored):

(iii) One hot encoding:

Categorical features cannot be used directly in machine learning algorithms. They have to be categorized using one-hot encoding. PyCaret automatically encodes the categorical features using one-hot encoding.

Example:

Here, name is a categorical feature. Hence PyCaret encodes it.

Output (‘Name’ column has been one-hot encoded):

(iv) Ordinal Encoding:

Categorical features that follow orders such as “Bad, Good, Excellent” should be encoded differently when compared to other Categorical features. This is done by PyCaret by sending a parameter as

Parameters: ordinal_features: dictionary

The default value is none.

Code:

from pycaret.datasets import get_data
emp = get_data('employee')
from pycaret.classification import *
model = setup(data = emp, target = 'left', ordinal_features = {'salary' : ['low', 'medium', 'high']})

Output:

model[0]

(v) Cardinal Encoding:

By using one-hot encoding, it is possible to get very large sparse vectors for features such as Zipcodes or Countries. Hence we can use Cardinal encoding to get around this problem. Pycaret has a parameter in the setup() which does cardinal encoding for you. The parameter is

Parameter: high_cardinality_features: string, default = None

Code:

from pycaret.datasets import get_data
inc = get_data('income')
from pycaret.classification import *
model = setup(data = inc, target = 'income >50K', high_cardinality_features = ['native-country'])

Output:

(vi) Handle unknown levels:

We run into a situation many times where the test data has new levels that were not present in the trained data. This is handled by PyCaret automatically by giving a value of ‘most frequent’ or ‘least frequent’. The parameters in the setup() function are

Parameters:

handle_unknown_categorical: bool, default = True

unknown_categorical_method: string, default = ‘least_frequent’

  • Scale and Transform

Scale and Transform are very important because sometimes the data might be in a wide range of variance or on different scales.

(i) Normalize:

Normalizing is a very essential preprocessing step that will make sure the numerical values are not that widely spread. PyCaret does normalization when the parameter normalize is set to true. There are several ways to normalize data. The default value for the parameter normalize_method is the z- score where the mean of the values is 0 and the standard deviation is 1. Other values are min-max(range of the value is 0 to 1), maxabs(makes sure that maximum absolute value of each feature is 1), robust(normalizes according to the interquartile range and it is better when there are outliers).

Parameters: normalize, normalize_method

Example for normalize (Using default normalize_method: Z-score):

Code:

from pycaret.classification import *
pycar = setup(nba, target = 'TARGET_5Yrs', normalize = True)

Output:

Example for normalize_method:

Code (Using ‘minmax’ method for normalization):

from pycaret.classification import *
pycar = setup(nba, target = 'TARGET_5Yrs', normalize = True, normalize_method = 'minmax')

Output:

pycar[0]

(ii) Transformation:

Transformation is used to transform the data into a gaussian or an approximate Gaussian distribution. PyCaret does normalization when the parameter transformation is set to true. There are several ways to transform data. The default value for the parameter transformation_method is the yeo-johnson. Another value for transformation_method is quantile.

Parameters: transformation, transformation_method

Example:

from pycaret.classification import *
pycar = setup(nba, target = 'TARGET_5Yrs', transformation = True)
pycar[0]

Output:

  • Feature Engineering

Feature engineering is the creative side of machine learning. Feature Engineering is used to transform the data into another space by using the combination of features such as multiplication, trigonometric functions, logarithmic functions, etc.

(i) Feature Interaction:

PyCaret allows the creations of new features by using the existing features. Two features can be multiplied or divided with each other to form new features. The parameters used are feature_interaction(multiplication), feature_ratio(division). Both these parameters are set to false by default. These parameters can be changed in the setup() function in order to obtain feature interaction.

Parameters:

feature_interaction: bool, default = False

feature_ratio: bool, default = False

interaction_threshold: bool, default = 0.01

Example:

Importing ‘blood’ Dataset from PyCaret repository

Code:

from pycaret.classification import *
model = setup(data, target = 'Class',feature_interaction = True, feature_ratio = True)
model[0]

Output:

(ii) Polynomial Features:

Just like in feature interaction, the new features are created by using the polynomial degree(a²). The parameter used are polynomial_features which is set to false by default, polynomial_degree is an integer whose value is set to 2 by default. These parameters can be changed in the setup() function in order to obtain polynomial features.

Parameters:

polynomial_features: bool, default = False

polynomial_degree: int, default = 2

polynomial_threshold: float, default = 0.1

Example:

Importing ‘blood’ Dataset from PyCaret repository

Code:

from pycaret.classification import *
model = setup(data, target = 'Class', polynomial_features = True)
model[0]

Output:

(iii) Trigonometry features:

This is very similar to Polynomial Features. The parameter used is trigonometry_features which is set to false by default. The parameter can be changed in setup() function to obtain trigonometric features.

Parameter: trigonometry_features: bool, default = False

(iv) Group features:

When features are related to each other, these features can be grouped by using the group_features parameter in the setup() function. Information such as mean, the median is obtained by using this parameter. A list of features is passed in the parameter group_features.

Parameters:

group_features: list or list of list, default = None

group_names: list, default = None

(v) Bin Numeric Features:

Sometimes continuous features can have a wide range of values. In such cases we use feature binning. The parameter used in the setup() function is bin_numeric_features which is used to bin the numeric features.

Parameter: bin_numeric_features: list, default = None

(vi) Combine Rare levels:

Earlier we have seen one hot encoding where features like countries when encoded into numerical values generate a sparse matrix. In scenarios like this, the computational time of the model increases as the number of features increase. In such a case, the rare levels in the features are combined which have high cardinality.

Parameters:

combine_rare_levels: bool, default = False

rare_level_threshold: float, default = 0.1

  • Feature Selection

It is very important to select good and useful features because it can be helpful for the interpretation of the models.

(i) Feature Importance:

It is used to determine the features which are most important in predicting the target variable. The parameter which is used in setup() function is feature_selection which is false by default. Another parameter called feature_selection_threshold is present and needs to be used especially when polynomial or feature interaction is used. The default value is 0.8.

Parameters:

feature_selection: bool, default = False

feature_selection_threshold: float, default = 0.8

(ii) Remove multicollinearity:

Multicollinearity exists when one feature is highly correlated with another feature. This will lead to unstable models. Hence this has can be removed by using the parameter remove_multicollinearity which is set to false by default in the setup() function. A threshold can be set for dropping the number of the feature using the parameter multicollinearity_threshold which is set to 0.9 by default.

Parameters:

remove_multicollinearity: bool, default = False

multicollinearity_threshold: float, default = 0.9

(iii) Principle component analysis:

This is mainly used for dimensionality reduction. It is used when the dataset consists of a lot of dimensions. But there is a loss of information when PCA is used. The parameters used here are pca_method whose default value is linear. Other methods that can be used are RBF and incremental. The next parameter is pca_components which can take both int and float value. If an integer value is given then it means the number of features to be present and if the float value is present then it means that the percentage of the information that is to be retained.

Parameters:

pca: bool, default = False

pca_method: string, default = ‘linear’

pca_components: int/float, default = 0.99

(iv) Ignore low variance:

In scenarios where a multi categorical feature with skewed distribution and domination of one or two features over other features is seen, the obtained variance of the model will be very low. In such cases, we can ignore that feature.

Before ignoring a feature the below criteria should be met (Reference) :

— Count of unique values in a feature / sample size < 10%

— Count of most common value / Count of second most common value > 20 times.

Parameters: ignore_low_variance: bool, default = False

  • Unsupervised

(i) Create clusters:

Clusters are very important for unsupervised learning. By giving the create_cluster value as true, each point in the dataset is assigned to a particular cluster and each cluster is used as a new feature. Cluster_iter parameter is used to control the number of iterations used to form one cluster.

Parameters:

create_clusters: bool, default = False

cluster_iter: int, default = 20

(ii) Remove outliers:

Outliers can affect the performance of a model and hence needs to be removed. PyCaret removes outliers using PCA using SVD. Outliers can be removed by setting the parameter remove_outliers as true. The percentage of outliers can be controlled by the parameter outlier_threshold whose default value is 0.5.

Parameters:

remove_outliers: bool, default = False

outliers_threshold: float, default = 0.05

Set up uses all the belowinput parameters but remember that only two mandatory parameters are to be fed i.e. the data and the target, rest all the values are set to the default/optional.

setup(data, target, train_size = 0.7, sampling = True, sample_estimator = None, categorical_features = None, categorical_imputation = ‘constant’, ordinal_features = None, high_cardinality_features = None, high_cardinality_method = ‘frequency’, numeric_features = None, numeric_imputation = ‘mean’, date_features = None, ignore_features = None, normalize = False, normalize_method = ‘zscore’, transformation = False, transformation_method = ‘yeo-johnson’, handle_unknown_categorical = True, unknown_categorical_method = ‘least_frequent’, pca = False, pca_method = ‘linear’, pca_components = None, ignore_low_variance = False, combine_rare_levels = False, rare_level_threshold = 0.10, bin_numeric_features = None, remove_outliers = False, outliers_threshold = 0.05, remove_multicollinearity = False, multicollinearity_threshold = 0.9, create_clusters = False, cluster_iter = 20, polynomial_features = False, polynomial_degree = 2, trigonometry_features = False, polynomial_threshold = 0.1, group_features = None, group_names = None, feature_selection = False, feature_selection_threshold = 0.8, feature_interaction = False, feature_ratio = False, interaction_threshold = 0.01, session_id = None, silent=False, profile = False)

In short, you can just use the below code instead of giving all those parameters.

setup(data,target)

4. Comparing the models

One of the main usages of PyCaret is that it can be used to compare several machine learning models based on performance metrics. The best model can thus be determined. These models are computed by using a 10 fold cross-validation. This is very useful to get an understanding of how the models are behaving and which model is better suited. The code to compare the models is very simple.

compare_models()

compare_models()

Result: The function returns a score grid that specifies the best models for each of the performance metrics.

The performance metrics used for classification are Accuracy, AUC, Recall, Precision, F1, Kappa.

The performance metrics used for regression are MAE, MSE, RMSE, R2, RMSLE, MAPE

The table is sorted by accuracy but it can be modified by giving a different value in the sort parameter. The fold value is 10 by default. This can be changed according to the problem that we are solving.

5. Creating a model

Once we get an understanding of which model is better, it is time to create a model. The code for creating a model is extremely simple.

create_model(‘model name’)

The output is a knn model and a score grid that illustrates Accuracy, AUC, Recall, Precision, F1 and kappa values.

knn_model = create_model('knn')

Result:

I have built a K-NN model using the representation ‘knn’ in the create_model (‘knn’)function. In a similar way, you can build other models by using any one of the below representations in the create_model(‘model name’).

By default, the model is created using a 10 fold CV. Instead, we can change it by using the fold parameter.

Create Model (using 7 fold CV):

knn_model = create_model(‘knn’, fold = 7)

knn_model = create_model('knn',fold = 7)

Output: The resultant knn_model obtained will be trained on 7 fold cross-validation.

Create Model (Round to 2 decimal points):

We can round off the performance metrics using a round parameter in create model function.

knn_model = create_model(‘knn’, round = 2)

knn_model = create_model('knn',round = 2)

Result: The metrics in the score grid will be rounded off to 2 digits.

6. Tune a Model

As the name says we can tune the model using tune_model( ) function, in the create_model( ) function the model is created using the default hyperparameters. The functionality of the tune_mode( ) function is it will tune the hyperparameters of a model on its own and produces a score grid as output.

Before Tuning:

knn_model = create_model('knn')

After Tuning:

tuned_knn = tune_model('knn')

Tuning_model( Using Optimization):

The default optimization method used is accuracy, but we can change this by using the optimization parameter of the tuned_model( ) function.

tuned_knn = tune_model(‘knn’,optimize=’AUC’)

tuned_knn = tune_model('knn',optimize='AUC')

In a similar way, we can use other optimization measures like ‘Recall’, ‘Precision’, ‘F1’.

Result: The performance measures value has an improvement in their scores.

7. Ensemble a Model

PyCaret also performs ensembling of the models. As ensembling increases the performance of the models (in most cases) we can ensemble our model using bagging, boosting, blending, and stacking in PyCaret.

ensemble_model(model_name)

Example: For this example, we will build a simple decision tree and perform ensembling on it.

Creating a simple decision tree:

Code:

dt = create_model('dt')

After Ensembling:

code:

bag_dt = ensemble_model(dt)

We can see a significant difference after ensembling. Bagging is the default technique.

In a similar way, we can perform Boosting, Blending, and Stacking. Click here for more info about them.

8. Plot and Evaluate the model:

After a model is created, it is very easy to plot the performance metrics of the model and analyze it. Different types of visualizations can be done using plot_model such as AUC, precision-recall curve, decision boundary, etc.

plot_model(model_name, plot = “type”)

Code for Plotting:

logreg = create_model('lr')
plot_model(logreg, plot = 'boundary')

We have used plot = “boundary” in our code, which indicates Decision Boundary. In a similar way, we can use other plots using their string type. The below table is the plot types supported by PyCaret.

Also, for the model which includes probabilities, we can predict the true probability of the results using the model calibration with the help of Calibrated Classifiers. Providing the probability values increase the interpretability and reduce uncertainty.

Calibrated_model(model_name)

Code:

#Create a simple decision tree
dt = create_model('dt')
#Calibrate the model
calib_dt = calibrate_model(dt)

Apart from this, there is an extremely useful function evaluate_model(model) which is used to display all the visualizations. This works only in the jupyter notebook interface. It provides an interactive user interface wherein we can select the type of visualization we need.

Code:

evaluate_model(logreg)

Here is a video representation of how evaluate_model works.

These visualizations are different for different machine learning modules. Click here to know more about visualizations.

9. Interpreting the Model

Interpretation of the model is also possible in PyCaret. The feature importance is done using shap values. The graph in the shap values consists of x and y-axis. The x-axis consists of shap values that show the impact of the feature in a positive or negative way. The y-axis consists of feature values.

model = create_model('xgboost')
interpret_model(model)

The shap value is mainly used to determine how important a feature is for predicting the class label. The red color on the right side shows that the feature is contributing positively.

10. Predicting the Model

All the results till now are based on the k fold cross-validation(train dataset). Now, we will predict the performance of the model on the test dataset.

code:

rf_holdout_pred = predict_model(rf)

Output after creating the model:

Output after predicting the model on the test dataset:

11. Saving the model

PyCaret enables us to save the entire model pipeline into a binary pickle file using the save_model(name, model_name = ‘’). Once the model is saved, we can load it whenever needed using load_model(). We will save our xgboost model that was created in step 9 with the name ‘pycaret_model’.

Code:

save_model(model, 'pycaret_model')

We can simply load this saved model by using the load_model() function.

Code:

load_saved_model = load_model('pycaret_model')
#Now our previous model is loaded into saved_model and is ready to predict/classify.

Instead of saving only the model, the entire experiment can also be saved in a similar way to saving the model by using save_experiment(experiment_name = ‘pycaret_experiment’). By doing this you can save all the models and their outputs.

load_experiment(experiment_name='pycaret_experiment'

And also the loading of the experiment can be achieved.

load_saved_exp = load_experiment(‘pycaret_experiment’)

There it is, your final trained model/experiment can be called and used using a single line of code. There’s also a feature for us to deploy the built model in AWS. Thus, we can build a whole pipeline model using very few lines of code.

CONCLUSION

To sum up, PyCaret is a very useful library that can help you save an ample amount of time taking into account that you have a basic understanding of the concepts in machine learning such as how an algorithm works, performance metrics, data pre-processing, etc. PyCaret produces remarkable results in very little time. I would definitely suggest exploring PyCaret because it is going to be worth it!

I would like to thank Moez Ali and the team of PyCaret for this library.

Thank you for reading until the end. If there are any mistakes or suggestions please feel free to comment.

If you would like to get in touch, reach out to me on LinkedIn.

References:

--

--

Always a student | Passionate about Big Data, Machine Learning and Deep Learning