The world’s leading publication for data science, AI, and ML professionals.

Why You Should Switch to Piskle[arn] for Exporting Your Scikit-Learn Models

Up to 2/3 reduction in file size

Written By: Amal Hasni & Dhia Hmila

Photo by Jonathan Pielmayer on Unsplash
Photo by Jonathan Pielmayer on Unsplash

Exporting your fitted model after the training phase is the last crucial step in every Data Science Project. However, as important as it is, the methods we use to store our models weren’t specifically designed for Data Science in mind.

In fact, python’s pickle or the well-established joblib package, that we often use with Scikit-learn , are general-purpose standard serialization methods that work on any python object. Therefore, they’re not as optimized as we’d like them to be.

After this article, you’ll see that we can do much better in terms of memory and time efficiency.

Table of contents

· What is piskle · What makes piskle special · How to use piskle · Supported estimators (so far) · Efficiency wise, what should you expect in numbers


What is piskle

Piskle is a Python package we created that allows you to serialize Scikit-learn’s final models in an optimized way. If you’re not familiar with the term, here’s how Wikipedia defines it:

Serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later

This is especially useful, if you have lots and lots of estimators to store (maybe updated versions of the same model), or if you’d like to store it on the cloud for a web application or an API.

If you’re wondering about the naming choice, piskle is a combination of pickle and scikit-learn 😉

What makes piskle special

Piskle offers the possibility to store Scikit-learn models (and python objects in general) efficiently enough to selectively keep the parts that need to be kept. In other terms, piskle stores only the attributes used by a certain action like the predict method. Those are the only attributes necessary to perform the needed action once the model is reloaded.

How to use piskle

To use piskle , you first need to pip install using the following command:

pip install piskle

The next thing you need is a model to export. You can use this as an example:

Exporting the model is then as easy as the following:

import piskle
piskle.dump(model, 'model.pskl')

Loading it is even easier:

model = piskle.load('model.pskl')

If you want even faster serialization, you can disable the optimize feature. Note that this feature reduces the size of the exported file even further and improves loading time.

piskle.dump(model, 'model.pskl', optimize=False)

Supported estimators (so far)

So far, piskle supports 23 scikit-learn estimators and transformers. The included models have been tested using the latest version of Scikit-learn (currently 0.24.0). You can check the full list here.

Efficiency wise, what should you expect in numbers

To demonstrate the potential of piskle, we can conduct a simple experiment. We will export the same model using three different methods and compare the sizes of the resulting files.

Before we start exporting scikit-learn’s models, let’s start by getting a big enough dataset we can use to highlight the difference piskle’s use can make. For convenience, we will use a python package called datasets that allows you to easily download more than 500 datasets. The dataset we chose is called Amazon Us Reviews and has textual attributes we can use with TF-IDF as follows:

To compare piskle with joblib and pickle, we export our model using the three packages and observe the resulting files using these lines of code:

Here’s a recap of the resulting three file sizes:

Image By Author
Image By Author
+----------+----------+----------+
|  Pickle  |  Joblib  |  Piskle  |
+----------+----------+----------+
| 1186 KB  | 1186 KB  | 388 KB   |
+----------+----------+----------+

We can observe a significant size reduction when using piskle as opposed to pickle and joblib: almost 67% gain in file size.

💡 Note that we can optimize this further using compression algorithms for the three packages.

Final thoughts

Piskle was born out of a real need for an efficient way to export a scikit-learn model for a Web App creation. It has demonstrated a significant efficiency on the models tested so far and thus proves to be of a high added value. Don’t hesitate to try it out, especially if you mean to store your model on the cloud and/or you’re short of space.

As piskle is still a work in progress with a lot of potential improvements planned, we will be pleased to receive your feedback and/or suggestions:

link to GitHub repository

Thank you for sticking this far and for your interest. Stay safe and we will see you in our next article 😊 !


Related Articles