PyCaret: The Machine Learning Omnibus
…the one-stop-shop for all your machine learning needs
Any Machine Learning project journey starts with loading the dataset and ends (continues ?!) with the finalization of the optimum model or ensemble of models for predictions on unseen data and production deployment.
As machine learning practitioners, we are aware that there are several pit stops to be made along the way to arrive at the best possible prediction performance outcome. These intermediate steps include Exploratory Data Analysis (EDA), Data Preprocessing — missing value treatment, outlier treatment, changing data types, encoding categorical features, data transformation, feature engineering /selection, sampling, train-test split etc. to name a few — before we can embark on model building, evaluation and then prediction.
We end up importing dozens of python packages to help us do this and this means getting familiar with the syntax and parameters of multiple function calls within each of these packages.
Have you wished that there could be a single package that can handle the entire journey end to end with a consistent syntax interface? I sure have!
Enter PyCaret
These wishes were answered with PyCaret
package and it is now even more awesome with the release of pycaret2.0
.
Starting with this Article, I will post a series on how pycaret
helps us zip through the various stages of an ML project.
Installation
Installation is a breeze and is over in a few minutes with all dependencies also being installed. It is recommended to install using a virtual environment like python3 virtualenv or conda environments to avoid any clash with other pre-installed packages.
pip install pycaret==2.0
Once installed, we are ready to begin! We import the package into our notebook environment. We will take up a classification problem here. Similarly, the respective PyCaret modules can be imported for a scenario involving regression, clustering, anomaly detection, NLP and Association rules mining.
We will use the titanic
dataset from kaggle.com. You can download the dataset from here.
Let's check the first few rows of the dataset using the head()
function:
Setup
The setup()
function of pycaret
does most — correction, ALL, of the heavy-lifting, that normally is otherwise done in dozens of lines of code — in just a single line!
We just need to pass the dataframe and specify the name of the target feature as the arguments. The setup command generates the following output.
setup
has helpfully inferred the data types of the features in the dataset. If we agree to it, all we need to do is hit Enter
. Else, if you think the data types as inferred by setup
is not correct then you can type quit
in the field at the bottom and go back to the setup
function to make changes. We will see how to do that shortly. For now, lets hit Enter
and see what happens.
Whew! A whole lot seems to have happened under the hood in just one line of innocuous-looking code! Let's take stock:
- checked for missing values
- identified numeric and categorical features
- created train and test data sets from the original dataset
- imputed missing values in continuous features with mean
- imputed missing values in categorical features with a constant value
- done label-encoding
- ..and a whole host of other options seem to be available including outlier treatment, data scaling, feature transformation, dimensionality reduction, multi-collinearity treatment, feature selection and handling imbalanced data etc.!
But hey! what is that on lines 11 & 12? The number of features in the train and test datasets are 1745? Seems to be a case of label encoding gone berserk most probably from the categorical features like name
, ticket
and cabin
. Further in this article and in the next, we will look at how we can control the setup as per our requirements to address such cases proactively.
Customizing setup
To start with how can we exclude features from model building like the three features above? We pass the variables which we want to exclude in the ignore_features
argument of the setup
function. It is to be noted that the ID and DateTime columns, when inferred, are automatically set to be ignored for modelling.
Note below that pycaret
, while asking for our confirmation has dropped the above mentioned 3 features. Let's click Enter
and proceed.
In the resultant output (the truncated version is shown below), we can see that post setup, the dataset shape is more manageable now with label encoding done only of the remaining more relevant categorical features:
In the next Article in this series we will look in detail at further data preprocessing tasks we can achieve on the dataset using this single setup
function of pycaret
by passing additional arguments.
But before we go, let’s do a flash-forward to the amazing model comparison capabilities of pycaret
using the compare_model()
function.
Boom! All it takes is just compare_models()
to get the results of 15 classification modelling algorithms compared across various classification metrics on cross-validation. At a glance, we can see that CatBoost
classifier performs best across most of the metrics with Naive-Bayes
doing well on recall and Gradient Boosting
on precision. The top-performing model for each metric is highlighted automatically by pycaret
.
Depending on the model evaluation metric(s) we are interested in pycaret
helps us to straightaway zoom in on the top-performing model which we can further tune using the hyper-parameters. More on this in the upcoming Articles.
In conclusion, we have briefly seen glimpses of how pycaret
can help us to fast track through the ML project life cycle through minimal code combined with extensive and comprehensive customization of the critical data pre-processing stages.
You may also be interested in my other articles on awesome packages that use minimal code to deliver maximum results in Exploratory Data Analysis(EDA) and Visualization.
EDA in R with SmartEDA
Exploratory Data Analysis — the smarter and faster way..
towardsdatascience.com
Thanks for reading and would love to hear your feedback. Cheers!