MACHINE LEARNING
How to Create a Complex Data Science Project With 2 Lines of Code

A reader recently asked me in one of my blogs if I had tried PyCaret
. I promised that I would try it, and I am so glad I did. PyCaret
allows you to run a whole Data Science project, from data cleaning, dealing with class imbalance, to hyper-tuned machine learning models with two lines of code. Don’t believe me? That’s okay, I couldn’t believe either when I tried it the first time but the fact is that it works. Let me show you PyCaret
in action first and then we can dive deeper into this library. For demonstration purposes, I will use the Titanic Survivor dataset, which includes categorical, numerical, and NaN
values. Here is the result:

Here is what just happened: PyCaret
dealt with the categorical data, divided the dataset into train and test set, did log experiment, checked for outliers, fixed the class imbalance, and ran models from Logistic Regression to XGBoost in less than 30 seconds. It also got an accuracy of 0.8154 with 2 lines of code. Now, let’s understand more about how PyCaret
works and how you can use it.
What is PyCaret?
Here is what their website says:
PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment.
It seems that PyCaret
does what it promises. It goes from preparing your data to deploying supervised and unsupervised models with 2 or 3 lines of code. I tested it in a few different projects, including some old ones that worked on in the past, and the results were very close to those I got after one week of work. To show you how it works, let’s create a classification project together. You can find the notebook I used for this blog here. Soon I will publish a regression project as well. Stay tuned!
Installation
There are a few ways to install PyCaret
through your terminal. It is highly recommended to use a virtual environment to avoid conflicts with other Libraries. The first one is typing pip install pycaret
in your terminal, which is a slim version of PyCaret
and its hard dependencies. To install the full version, you can type pip install pycaret[full]
, which is the one I recommend.
Starting a New Project
PyCaret
is so confident of what it delivers, that their team gives us 55 datasets that you can try the library on your own. These are the problems included: anomaly detection, association rule mining, binary and multiclass classification, clustering, NLP, and regression.
Once you have installed PyCaret
, you can type the following on a Jupyter notebook to get the list of datasets that they put available.
# Get data
from pycaret.datasets import get_data
from pycaret.classification import *
index = get_data('index')
The list of datasets is long, and I will not add all of them here, but these are the classification datasets that you can choose from:

For this project, I will go with the credit card default project. It’s somewhat challenging and it’s very close to a real life problem. To choose a dataset, you just type data = get_data('name_of_the_dataset')
. Since we will test the credit card default dataset, we need to type data = get_data('credit')
and run the cell.

Now, we are ready for the data preparation. It will take one line of code. PyCaret
will return a table with the dataset’s information and we will be able to make decisions about how we want to proceed. To do so, we will set up the dataset, the target label, name for the experiment, etc. Here is the most important part of the code:
clf1 = setup(data, target = 'default', session_id=123, log_experiment=True, experiment_name='default1')
After the following code is run, we will get the following question: Following data types have been inferred automatically, if they are correct press enter to continue or type ‘quit’ otherwise. Type enter
to continue.

What this code above did was going through data cleaning, train test split, testing log transformation, polynomials, converting categoric data, solving class imbalance, and any other data preparation that you can think of. You can see that I wrote log_transformation = True
as an example, but there are dozens of data preparation options. I highly encourage you to type shift + tab
and check out all you can do with the data preparation. I could write a whole article about it (and I might do), but for now, let’s move on with the basics.
Running Models
Now, it’s time to run baseline models. So far, after loading the dataset, I have written one line of code. We will now run 16 Machine Learning models with another line of code. I will use 5 folds for the cross-validations.

Running all the models took 1:36 min, including some complex ensemble models. Let’s check the results:

A few things to notice here. First, XGBoost failed. That’s ok. We have 15 other models to analyze. PyCaret
gives us multiple metrics and highlights the best results for each metric. For this project, we need to consider the recall metric, which is used when the cost of false negatives is high and we got a 0.90 recall score. You can also create individual models just by typing, for example qda = create_model('qda')
.

qda
stands for the Quadratic Discriminant Analysis model. By default, PyCaret
divides the test set into 10 folds. You can edit that by typing qda = create_model('Ada', fold = 5)
. To check the full list of models that you can run, the library they are using, and the abbreviation, type models()
.

And you can type tune_models(lr)
to tune hyperparameters. Which in this case, improved some metrics, but it actually decreased the Recall score.

And there is a ton of other things that you can do with PyCaret
, such as analyzing and interpreting models, testing models with holdout sets, saving and deploying models.

I encourage you to visit their website and check out their documentation, projects, and more tutorials. There is a lot to cover that I will keep to another article.
Final Thoughts
PyCaret
seems to be amazing and delivers what is promised. However, as I always like to remind readers, data science is a complex role and you can not substitute a professional with a few lines of codes. If you like to use PyCaret
as an additional step for your next project and understand your dataset’s potential, go ahead and do it. I tried it with multiple datasets and the results were very close to those that I was able to get when I worked a full week in a project. However, I don’t recommend using it for your final results or as a shortcut. You can add it as part of the process and get quick insights without having to type dozens of lines of code. Have fun trying it and let me know how it goes.
Soon I will also publish a regression project using PyCaret
, so follow me here on Medium to be up-to-date. If you enjoyed this article, don’t forget to leave your applause 👏 . It motivates me to keep writing.