Introduction to Regression in Python with PyCaret

Published in

Towards Data Science

15 min readDec 12, 2021

1. Introduction

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.

In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few lines only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more.

The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more technical expertise.

To learn more about PyCaret, you can check the official website or GitHub.

2. Tutorial Objective

In this tutorial we will learn:

Getting Data: How to import data from the PyCaret repository.
Setting up Environment: How to set up a regression experiment in PyCaret and get started with building regression models.
Create Model: How to create a model, perform cross-validation and evaluate regression metrics.
Tune Model: How to automatically tune the hyperparameters of a regression model.
Plot Model: How to analyze model performance using various plots.
Predict Model: How to make predictions on new/unseen data.
Save / Load Model: How to save/load a model for future use.

3. Installing PyCaret

Installation is easy and will only take a few minutes. PyCaret’s default installation from pip only installs hard dependencies as listed in the requirements.txt file.

pip install pycaret

To install the full version:

pip install pycaret[full]

4. What is Regression?

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome variable’, or ‘target’) and one or more independent variables (often called ‘features’, ‘predictors’, or ‘covariates’). The objective of regression in machine learning is to predict continuous values such as sales amount, quantity, temperature, etc.

Learn More about Regression

5. Overview of the Regression Module in PyCaret

PyCaret’s Regression module (pycaret.regression) is a supervised machine learning module that is used for predicting continuous values/outcomes using various techniques and algorithms. Regression can be used for predicting values/outcomes such as sales, units sold, temperature, or any number which is continuous.

The PyCaret’s regression module has over 25 algorithms and 10 plots to analyze the performance of models. Be it hyper-parameter tuning, ensembling, or advanced techniques like stacking, PyCaret’s regression module has it all.

6. Dataset for the Tutorial

In this tutorial, we will use the dataset from PyCaret’s dataset repository. The data contains 6000 records for training. Short descriptions of each column are as follows:

ID: Uniquely identifies each observation (diamond)
Carat Weight: The weight of the diamond in metric carats. One carat is equal to 0.2 grams, roughly the same weight as a paperclip
Cut: One of five values indicating the cut of the diamond in the following order of desirability (Signature-Ideal, Ideal, Very Good, Good, Fair)
Color: One of six values indicating the diamond’s color in the following order of desirability (D, E, F — Colorless, G, H, I — Near-colorless)
Clarity: One of seven values indicating the diamond’s clarity in the following order of desirability (F — Flawless, IF — Internally Flawless, VVS1 or VVS2 — Very, Very Slightly Included, or VS1 or VS2 — Very Slightly Included, SI1 — Slightly Included)
Polish: One of four values indicating the diamond’s polish (ID — Ideal, EX — Excellent, VG — Very Good, G — Good)
Symmetry: One of four values indicating the diamond’s symmetry (ID — Ideal, EX — Excellent, VG — Very Good, G — Good)
Report: One of two values “AGSL” or “GIA” indicates which grading agency reported the qualities of the diamond qualities
Price: The amount in USD that the diamond is valued Target Column

from pycaret.datasets import get_data
dataset = get_data('diamond')

#check the shape of data
dataset.shape(6000, 8)

In order to demonstrate the use of the predict_model function on unseen data, a sample of 600 records has been withheld from the original dataset to be used for predictions. This should not be confused with a train/test split as this particular split is performed to simulate a real-life scenario. Another way to think about this is that these 600 records are not available at the time when this machine learning experiment was performed.

data = dataset.sample(frac=0.9, random_state=786)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))>>> Data for Modeling: (5400, 8)
>>> Unseen Data For Predictions: (600, 8)

7. Setting up Environment in PyCaret

The setup function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline (we will see them in later tutorials).

When setup is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup function is executed. If all of the data types are correctly identified enter can be pressed to continue or quit can be typed to end the experiment.

Ensuring that the data types are correct is really important in PyCaret as it automatically performs multiple type-specific preprocessing tasks which are imperative for machine learning models.

Alternatively, you can also use numeric_features and categorical_features parameters in the setup to pre-define the data types.

from pycaret.regression import *
s = setup(data = data, target = 'Price', session_id=123)

Once the setup has been successfully executed it displays the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when setup is executed. The majority of these features are out of scope for the purposes of this tutorial. However, a few important things to note at this stage include:

session_id: A pseudo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the session_id is set as 123 for later reproducibility.
Original Data: Displays the original shape of the dataset. For this experiment, (5400, 8) means 5400 samples and 8 features including the target column.
Missing Values: When there are missing values in the original data, this will show as True. For this experiment, there are no missing values in the dataset.
Numeric Features: Number of features inferred as numeric. In this dataset, 1 out of 8 features is inferred as numeric.
Categorical Features: Number of features inferred as categorical. In this dataset, 6 out of 8 features are inferred as categorical.
Transformed Train Set: Displays the shape of the transformed training set. Notice that the original shape of (5400, 8) is transformed into (3779, 28) for the transformed train set. The number of features has increased from 8 to 28 due to categorical encoding
Transformed Test Set: Displays the shape of the transformed test/hold-out set. There are 1621 samples in the test/hold-out set. This split is based on the default value of 70/30 that can be changed using the train_size parameter in the setup .

Notice how a few tasks that are imperative to perform modeling are automatically handled, such as missing value imputation (in this case there are no missing values in training data, but we still need imputers for unseen data), categorical encoding, etc. Most of the parameters in the setup are optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to the intermediate and expert levels, we will cover them in much greater detail.

8. Comparing All Models

Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using k-fold cross-validation for metric evaluation. The output prints a scoring grid that shows average MAE, MSE, RMSE, R2, RMSLE, and MAPE across the folds (10 by default) along with the training time each model took.

best = compare_models()

One line of code and we have trained and evaluated over 20 models using cross-validation. The scoring grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using R2 (highest to lowest) which can be changed by passing sort parameter. For example, compare_models(sort = 'RMSLE') will sort the grid by RMSLE (lower to higher since lower is better).

If you want to change the fold parameter from the default value 10 to a different value then you can use the fold parameter. For example compare_models(fold = 5) will compare all models on 5 fold cross-validation. Reducing the number of folds will improve the training time.

By default, the compare_models return the best performing model based on default sort order but can be used to return a list of top N models by using n_select parameter.

9. Create a Model

create_model is the most granular function in PyCaret and is often the foundation behind most of the PyCaret functionalities. As the name suggests this function trains and evaluates a model using cross-validation that can be set with the fold parameter. The output prints a scoring grid that shows MAE, MSE, RMSE, R2, RMSLE, and MAPE by fold.

For the remaining part of this tutorial, we will work with the below models as our candidate models. The selections are for illustration purposes only and do not necessarily mean they are the top-performing or ideal for this type of data.

AdaBoost Regressor (‘ada’)
Light Gradient Boosting Machine (‘lightgbm’)
Decision Tree (‘dt’)

There are 25 regressors available in the model library of PyCaret. To see a complete list of all regressors either check the docstring or use models function to see the library.

models()

9.1 AdaBoost Regressor

ada = create_model('ada')

print(ada)>>> OUTPUT
AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear', n_estimators=50, random_state=123)

9.2 Light Gradient Boosting Machine

lightgbm = create_model('lightgbm')

9.3 Decision Tree

dt = create_model('dt')

Notice that the Mean score of all models matches with the score printed in compare_models. This is because the metrics printed in the compare_models score grid are the average scores across all CV folds. Similar to the compare_models, if you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter in the create_model function. For Example: create_model('dt', fold = 5) to create a Decision Tree using 5 fold cross-validation.

10. Tune a Model

When a model is created using the create_model function it uses the default hyperparameters to train the model. In order to tune hyperparameters, the tune_model function is used. This function automatically tunes the hyperparameters of a model using RandomGridSearch on a pre-defined search space. The output prints a scoring grid that shows MAE, MSE, RMSE, R2, RMSLE, and MAPE by fold. To use the custom search grid, you can pass custom_grid parameter in the tune_model function.

10.1 AdaBoost Regressor

tuned_ada = tune_model(ada)

print(tuned_ada)>>> OUTPUT
AdaBoostRegressor(base_estimator=None, learning_rate=0.05, loss='linear',n_estimators=90, random_state=123)

10.2 Light Gradient Boosting Machine

import numpy as np
lgbm_params = {'num_leaves': np.arange(10,200,10),
                        'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
                        'learning_rate': np.arange(0.1,1,0.1)
                        }tuned_lightgbm = tune_model(lightgbm, custom_grid = lgbm_params)

print(tuned_lightgbm)>>> OUTPUTLGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=60,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=120, objective=None,
              random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

10.3 Decision Tree

tuned_dt = tune_model(dt)

By default, tune_model optimizes R2 but this can be changed using THE optimize parameter. For example, tune_model(dt, optimize = 'MAE')will search for the hyperparameters of a Decision Tree Regressor that results in the lowest MAE instead of highest R2. For the purposes of this example, we have used the default metric R2 for the sake of simplicity only. The methodology behind selecting the right metric to evaluate a regressor is beyond the scope of this tutorial but if you would like to learn more about it, you can click here to develop an understanding of regression error metrics.

Metrics alone are not the only criteria you should consider when finalizing the best model for production. Other factors to consider include training time, the standard deviation of k-folds, etc. For now, let’s move forward considering the Tuned Light Gradient Boosting Machine stored in the tuned_lightgbm variable as our best model for the remainder of this tutorial.

11. Plot a Model

Before model finalization, the plot_model function can be used to analyze the performance across different aspects such as Residuals Plot, Prediction Error, Feature Importance, etc. This function takes a trained model object and returns a plot based on the test / hold-out set.

There are over 10 plots available, please see the plot_model documentation for the list of available plots.

11.1 Residual Plot

plot_model(tuned_lightgbm)

11.2 Prediction Error Plot

plot_model(tuned_lightgbm, plot = 'error')

11.3 Feature Importance Plot

plot_model(tuned_lightgbm, plot='feature')

Another way to analyze the performance of models is to use the evaluate_model function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model function.

evaluate_model(tuned_lightgbm)

12. Predict on Test / Hold-out Sample

Before finalizing the model, it is advisable to perform one final check by predicting the test/hold-out set and reviewing the evaluation metrics. If you look at the information grid in Section 6 above, you will see that 30% (1621 samples) of the data has been separated out as a test/hold-out sample. All of the evaluation metrics we have seen above are cross-validated results based on the training set (70%) only. Now, using our final trained model stored in the tuned_lightgbm , we will predict the hold-out sample and evaluate the metrics to see if they are materially different than the CV results.

predict_model(tuned_lightgbm);

The R2 on the test/hold-out set is 0.9652 compared to 0.9708 achieved on tuned_lightgbm CV results (in section 10.2 above). This is not a significant difference. If there is a large variation between the test/hold-out and CV results, then this would normally indicate over-fitting but could also be due to several other factors and would require further investigation. In this case, we will move forward with finalizing the model and predicting on unseen data (the 10% that we had separated in the beginning and never exposed to PyCaret).

13. Finalize Model for Deployment

Model finalization is the last step in the experiment. A machine learning workflow in PyCaret starts with the setup, followed by comparing all models using compare_models and shortlisting a few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling, stacking, etc. This workflow will eventually lead you to the best model for use in making predictions on new and unseen data.

The finalize_model function fits the model onto the complete dataset including the test/hold-out sample (30% in this case). The purpose of this function is to train the model on the complete dataset before it is deployed in production.

final_lightgbm = finalize_model(tuned_lightgbm)print(final_lightgbm)>>> OUTPUTLGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=60,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=120, objective=None,
              random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

Caution: One final word of caution. Once the model is finalized using the finalize_model, the entire dataset including the test/hold-out set is used for training. As such, if the model is used for predictions on the hold-out set after finalize_model is used, the information grid printed will be misleading as you are trying to predict on the same data that was used for modeling.

In order to demonstrate this point only, we will use final_lightgbm under predict_model to compare the information grid with the one above in section 12.

predict_model(final_lightgbm);

Notice how the R2 in the final_lightgbm has increased to 0.9891 from 0.9652, even though the model is the same. This is because the final_lightgbm variable is trained on the complete dataset including the test/hold-out set.

14. Predict on Unseen Data

The predict_model function is used to predict the unseen/new dataset. The only difference from the section above is that this time we will pass the data_unseen parameter. data_unseen is the variable created at the beginning of the tutorial and contains 10% (600 samples) of the original dataset which was never exposed to PyCaret. (see section 5 for explanation)

unseen_predictions = predict_model(final_lightgbm, data=data_unseen)
unseen_predictions.head()

The Label column is added onto the data_unseen set. The label is the predicted value using the final_lightgbm model. If you want predictions to be rounded, you can use the round parameter inside the predict_model. You can also check the metrics on this since you have an actual target column Price available. To do that we will use the pycaret.utils module.

from pycaret.utils import check_metric
check_metric(unseen_predictions.Price, unseen_predictions.Label, 'R2')>>> OUTPUT
0.9779

15. Saving the Model

We have now finished the experiment by finalizing the tuned_lightgbm model which is now stored in final_lightgbm variable. We have also used the model stored in final_lightgbm to predict data_unseen. This brings us to the end of our experiment, but one question is still to be asked: What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is no, PyCaret's inbuilt function save_model allows you to save the model along with the entire transformation pipeline for later use.

save_model(final_lightgbm,'Final LightGBM Model 25Nov2020')Transformation Pipeline and Model Successfully Saved

>>> OUTPUT
(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[], target='Price',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strategy='...
                  LGBMRegressor(boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, importance_type='split',
                                learning_rate=0.1, max_depth=60,
                                min_child_samples=20, min_child_weight=0.001,
                                min_split_gain=0.0, n_estimators=100, n_jobs=-1,
                                num_leaves=120, objective=None, random_state=123,
                                reg_alpha=0.0, reg_lambda=0.0, silent=True,
                                subsample=1.0, subsample_for_bin=200000,
                                subsample_freq=0)]],
          verbose=False),
 'Final LightGBM Model 25Nov2020.pkl')

16. Loading the Saved Model

To load a saved model at a future date in the same or an alternative environment, we would use PyCaret’s load_model function and then easily apply the saved model on new unseen data for prediction.

saved_final_lightgbm = load_model('Final LightGBM Model 25Nov2020')Transformation Pipeline and Model Successfully Loaded

Once the model is loaded in the environment, you can simply use it to predict any new data using the same predict_model function. Below we have applied the loaded model to predict the same data_unseen that we used in section 13 above.

new_prediction = predict_model(saved_final_lightgbm, data=data_unseen)new_prediction.head()

Notice that the results of unseen_predictions and new_prediction are identical.

from pycaret.utils import check_metric
check_metric(new_prediction.Price, new_prediction.Label, 'R2')0.9779

17. Wrap-up / Next Steps?

This tutorial has covered the entire machine learning pipeline from data ingestion, pre-processing, training the model, hyperparameter tuning, prediction, and saving the model for later use. We have completed all of these steps in less than 10 commands which are naturally constructed and very intuitive to remember such as create_model(), tune_model(), compare_models(). Re-creating the entire experiment without PyCaret would have taken well over 100 lines of code in most libraries.

We have only covered the basics of pycaret.regression. In the future tutorials we will go deeper into advanced pre-processing, ensembling, generalized stacking, and other techniques that allow you to fully customize your machine learning pipeline and are must know for any data scientist.

Thank you for reading 🙏

Important Links

⭐ Tutorials New to PyCaret? Check out our official notebooks!
📋 Example Notebooks created by the community.
📙 Blog Tutorials and articles by contributors.
📚 Documentation The detailed API docs of PyCaret
📺 Video Tutorials Our video tutorial from various events.
📢 Discussions Have questions? Engage with community and contributors.
🛠️ Changelog Changes and version history.
🌳 Roadmap PyCaret’s software and community development plan.

Author:

I write about PyCaret and its use-cases in the real world, If you would like to be notified automatically, you can follow me on Medium, LinkedIn, and Twitter.