PyCaret: The Machine Learning Omnibus — Part 2

Balaji Sundararaman
Towards Data Science
5 min readAug 11, 2020

--

…your one-stop shop for all your Machine Learning needs.

DATA PREPROCESSING & MODEL COMPARISON

Photo by Ashley Trethowan on Unsplash

This is second in a series of articles on the amazing pycaret package in python, that enables fast tracking and automation of virtually every stage of the ML project life cycle with ridiculously minimal lines of code. If you missed the first part, click the link below, in which we have covered briefly the initial set up, which, in a single line of code, completes all aspects of data pre-processing and takes us right to the modelling stage.

In this article, we will look at several arguments that can be passed to the setup() function to further control the preprocessing done by pycaret. By default thesetup function requires only the dataframe and the target feature whose category labels we want to predict. However, the feature datatypes automatically inferred by the function may not always be correct. In some instances, we may need to step in. In the Titanic dataset we are using, for example, the setup function correctly infers Pclass (passenger class), SibSp(Siblings & Spouses onboard) and Parch (Parents and Children onboard) as categorical features along with sex. pycaret will automatically one-hot encode the categorical features and in this case will do so for Pclass, Sibsp , Parch and sex. However, these features except sex have an inherent order to their levels (ordinality) and it would be more appropriate to label encode them to capture the order in the levels.

So, in the below output, on running setup, we type in Quit and address this issue by passing additional arguments listed below, to the setup function. Note we are excluding Name, Ticket and Cabin features from the modelling using the ignore_features argument.

ordinal_features()

The ordinal_features argument takes in a dictionary with the feature we want to mark as ordinal along with the respective category levels in order.

Normalizing

We can also set the normalize argument to True (default is false) and specify the method (say minmax, default is zscore) using the normalize_method argument.

Binning Numeric Features

Let's also bin the numeric features Age and Fares in this case, as it is unlikely every unit change in their values will impact our predictions of the target.

The Setup

Putting all the above together, our revised setup command looks like this:

And we hit Enter when pycaret prompts us to confirm its inference. With that single setup function, pycaret has completed:

  • missing value imputation
  • encoding of the categorical features (see line 11 and 12 below, now we have 31 features)
  • normalized the numeric features
  • binned the numeric features
  • split the dataset into train and test
Output continued below
Output continued below

Awesome! Right? That is not all. There are dozens of setup customizations you can explore depending on the nature of your dataset by passing additional arguments to setup. In the above output, wherever you see False/None, we have left the default setup configuration untouched. You can go ahead and enable the relevant ones based on the nature of the dataset and the transformations you want to do. The syntax to configure these additional arguments is simple. You can find the full list and documentation here for classification and here for regression.

So, preprocessing done and dusted with a single function!

Model Comparison with compare_model( )

compare_model() function enables us to generate 15 classification models and compare the performances on cross-validation across several classification metrics all at once! The models are also automatically sorted based on the Accuracy metric(default), which can be changed using the sort argument to other metrics you may prefer.

model comparison

What's more, you can click on any other preferred metric column header to interactively sort the model ranking based on the new metric. This is of great help in quickly identifying the best performing models on our preferred metric so that we can move on to hyperparameter tuning of the selected model.

The compare_models() returns the best performing model on the specified metric which can be stored as a model object.

The output of the compare_models() can be further customized by passing additional arguments to:

  • omit models ( blacklist)
  • consider only specific models ( whitelist)
  • specify cross-validation folds ( fold,default is 10)
  • round off decimals ( round, default is 4)
  • specify metric for sorting the models ( sort, default is accuracy)
  • number of top models to be returned ( n_select, default 1. if more than 1 then a list of top models is returned)

To conclude, we see that using pycaret, with effectively just two lines of code, we are able to zoom through the stages of preprocessing and preliminary model comparison in our ML project.

In the upcoming articles, we will check out the awesome capabilities of pycaret to create a model, tune model hyperparameters, ensemble models, stack models and more with functions that follow the ****_model() syntax.

If you liked this article, you may also want to check out the below articles on Exploratory Data Analysis(EDA) and visualization with minimal code and maximum output.

Thanks for your time! Stay Safe!!

--

--

Passionate about Data Analytics, Visualization and Machine Learning with extensive experience across functions in India’s emerging Fintech vertical