The world’s leading publication for data science, AI, and ML professionals.

How to Run Complex Machine Learning Models With 2 Lines of Code

Create a complete data science project using PyCarat. From data cleaning to complex machine learning models – code along

MACHINE LEARNING

How to Create a Complex Data Science Project With 2 Lines of Code

Image by Vlada Karpovich. Source: Pexels
Image by Vlada Karpovich. Source: Pexels

A reader recently asked me in one of my blogs if I had tried PyCaret. I promised that I would try it, and I am so glad I did. PyCaret allows you to run a whole Data Science project, from data cleaning, dealing with class imbalance, to hyper-tuned machine learning models with two lines of code. Don’t believe me? That’s okay, I couldn’t believe either when I tried it the first time but the fact is that it works. Let me show you PyCaretin action first and then we can dive deeper into this library. For demonstration purposes, I will use the Titanic Survivor dataset, which includes categorical, numerical, and NaN values. Here is the result:

Image by the Author
Image by the Author

Here is what just happened: PyCaret dealt with the categorical data, divided the dataset into train and test set, did log experiment, checked for outliers, fixed the class imbalance, and ran models from Logistic Regression to XGBoost in less than 30 seconds. It also got an accuracy of 0.8154 with 2 lines of code. Now, let’s understand more about how PyCaret works and how you can use it.

What is PyCaret?

Here is what their website says:

PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment.

It seems that PyCaret does what it promises. It goes from preparing your data to deploying supervised and unsupervised models with 2 or 3 lines of code. I tested it in a few different projects, including some old ones that worked on in the past, and the results were very close to those I got after one week of work. To show you how it works, let’s create a classification project together. You can find the notebook I used for this blog here. Soon I will publish a regression project as well. Stay tuned!

Installation

There are a few ways to install PyCaret through your terminal. It is highly recommended to use a virtual environment to avoid conflicts with other Libraries. The first one is typing pip install pycaret in your terminal, which is a slim version of PyCaret and its hard dependencies. To install the full version, you can type pip install pycaret[full], which is the one I recommend.

Starting a New Project

PyCaret is so confident of what it delivers, that their team gives us 55 datasets that you can try the library on your own. These are the problems included: anomaly detection, association rule mining, binary and multiclass classification, clustering, NLP, and regression.

Once you have installed PyCaret, you can type the following on a Jupyter notebook to get the list of datasets that they put available.

# Get data
from pycaret.datasets import get_data
from pycaret.classification import *
index = get_data('index')

The list of datasets is long, and I will not add all of them here, but these are the classification datasets that you can choose from:

Image by the Author
Image by the Author

For this project, I will go with the credit card default project. It’s somewhat challenging and it’s very close to a real life problem. To choose a dataset, you just type data = get_data('name_of_the_dataset'). Since we will test the credit card default dataset, we need to type data = get_data('credit') and run the cell.

Now, we are ready for the data preparation. It will take one line of code. PyCaret will return a table with the dataset’s information and we will be able to make decisions about how we want to proceed. To do so, we will set up the dataset, the target label, name for the experiment, etc. Here is the most important part of the code:

clf1 = setup(data, target = 'default', session_id=123, log_experiment=True, experiment_name='default1')

After the following code is run, we will get the following question: Following data types have been inferred automatically, if they are correct press enter to continue or type ‘quit’ otherwise. Type enter to continue.

Image by the Author
Image by the Author

What this code above did was going through data cleaning, train test split, testing log transformation, polynomials, converting categoric data, solving class imbalance, and any other data preparation that you can think of. You can see that I wrote log_transformation = True as an example, but there are dozens of data preparation options. I highly encourage you to type shift + tab and check out all you can do with the data preparation. I could write a whole article about it (and I might do), but for now, let’s move on with the basics.

Running Models

Now, it’s time to run baseline models. So far, after loading the dataset, I have written one line of code. We will now run 16 Machine Learning models with another line of code. I will use 5 folds for the cross-validations.

Image by the Author
Image by the Author

Running all the models took 1:36 min, including some complex ensemble models. Let’s check the results:

Image by the Author
Image by the Author

A few things to notice here. First, XGBoost failed. That’s ok. We have 15 other models to analyze. PyCaret gives us multiple metrics and highlights the best results for each metric. For this project, we need to consider the recall metric, which is used when the cost of false negatives is high and we got a 0.90 recall score. You can also create individual models just by typing, for example qda = create_model('qda').

Image by the Author
Image by the Author

qda stands for the Quadratic Discriminant Analysis model. By default, PyCaret divides the test set into 10 folds. You can edit that by typing qda = create_model('Ada', fold = 5). To check the full list of models that you can run, the library they are using, and the abbreviation, type models().

Image by the Author
Image by the Author

And you can type tune_models(lr) to tune hyperparameters. Which in this case, improved some metrics, but it actually decreased the Recall score.

Image by the Author
Image by the Author

And there is a ton of other things that you can do with PyCaret, such as analyzing and interpreting models, testing models with holdout sets, saving and deploying models.

Image by the Author
Image by the Author

I encourage you to visit their website and check out their documentation, projects, and more tutorials. There is a lot to cover that I will keep to another article.

Final Thoughts

PyCaret seems to be amazing and delivers what is promised. However, as I always like to remind readers, data science is a complex role and you can not substitute a professional with a few lines of codes. If you like to use PyCaret as an additional step for your next project and understand your dataset’s potential, go ahead and do it. I tried it with multiple datasets and the results were very close to those that I was able to get when I worked a full week in a project. However, I don’t recommend using it for your final results or as a shortcut. You can add it as part of the process and get quick insights without having to type dozens of lines of code. Have fun trying it and let me know how it goes.

Soon I will also publish a regression project using PyCaret, so follow me here on Medium to be up-to-date. If you enjoyed this article, don’t forget to leave your applause 👏 . It motivates me to keep writing.


Related Articles