Working with sparse data sets in pandas and sklearn

Dafni Sidiropoulou Velidou
Towards Data Science
7 min readNov 5, 2019

--

In Machine Learning, there are several settings in which we encounter sparse data sets. Below are some examples:

  • User ratings for recommendation systems
  • User clicks for content recommendation
  • Document vectors in natural language processing
Photo by David Dvořáček on Unsplash

Sparse data sets are frequently large, making it hard to use standard machine learning python tools such as pandas and sklearn. It is not uncommon for the memory of an average local machine not to suffice for the storage or processing of a large data set. Even if memory is sufficient, processing time can increase significantly.

In this article, we will give a few simple tips we can follow when working with large sparse data sets in python for machine learning projects.

What is a sparse matrix?

A sparse matrix is a matrix in which most of the elements are zero. On the contrary, a table in which the majority of elements are non zero is called dense. We define the sparsity of a matrix as the number of zero elements divided by the total number of elements. A matrix with sparsity greater than 0.5 is a sparse matrix.

Handling a sparse matrix as a dense one is frequently inefficient, making excessive use of memory.

When working with sparse matrices it is recommended to use dedicated data structures for efficient storage and processing. We will refer to some of the available structures in Python in the next sections.

Frequently, we start from a dense data set that includes categorical variables. Typically, we have to apply one-hot encoding for these variables. When these variables have high cardinality (large number of distinct values), one-hot encoding will generate a sparse data set.

Example

Consider the following table with user ratings for movies

Dense matrix

where “Rating” is the target variable for a multi-class classification problem.

Now imagine we want to train a Factorization Machines classifier. Factorization Machines (FMs) are a general predictor, that are able to perform well in problems with high sparsity, like recommender systems. According to the original paper we need to transform the data set into the below format:

Sparse matrix

In the above structures both input attributes (users and movies) are one-hot encoded. In pandas this is a simple, one-line transformation. Yet, in the case of large data sets it can be quite cumbersome.

Below we will demonstrate some ways that facilitate the transformation and processing of such data sets in pandas and sklearn.

Data set

We will use the MovieLens 100K public data set, available here. The training file contains 100.000 ratings, by 943 users on 1682 items. For the scope of this analysis we will ignore the timestamp column.

Let’s load the data into a pandas data frame.

One-hot encoding

Assuming we want to transform this data set to the format shown in the section above, we have to one-hot encode columns user_id and item_id. For the transformation we will use the get_dummies pandas function, that converts categorical variables into indicator variables.

Before we apply the transformation let’s check the memory usage of our original data frame. For that, we will use the memory_usage pandas function.

Memory usage of data frame is 2.4 MB

Now, let’s apply the transformation and check the memory usage of the transformed data frame.

After one-hot encoding, we have created one binary column for each user and one binary column for each item. So, the size of the new data frame is 100.000 * 2.626, including the target column.

(100000, 2626)Memory usage is 263.3 MB

We see that the memory usage of the transformed data frame is significantly larger compared to the original. This is expected as we have increased the number of columns of the data frame. Yet, most of the elements in the new data frame are zeros.

Tip 1: Use pandas sparse structures to store sparse data

Pandas Sparse Structures

Pandas provides data structures for efficient storage of sparse data. In these structures, zero values (or any other specified value) are not actually stored in the array.

Storing only the non-zero values and their positions is a common technique in storing sparse data sets.

We can use these structures to reduce the memory usage of our data set. You can think of this as a way to “compress” a data frame.

In our example, we will convert the one-hot encoded columns into SparseArrays, which are 1-d arrays where only non-zero values are stored.

rating                     int64
user_id_1 Sparse[uint8, 0]
user_id_2 Sparse[uint8, 0]
user_id_3 Sparse[uint8, 0]
user_id_4 Sparse[uint8, 0]
...
item_id_1678 Sparse[uint8, 0]
item_id_1679 Sparse[uint8, 0]
item_id_1680 Sparse[uint8, 0]
item_id_1681 Sparse[uint8, 0]
item_id_1682 Sparse[uint8, 0]
Length: 2626, dtype: object
Memory usage is 1.8 MB

If we check the dtypes of the new data frame we see that the columns we converted are now of type Sparse[uint8, 0]. This means that zero values are not stored and non-zero values are stored as uint8. The dtype of the non-zero elements can be set when converting to SparseArray.

Further, we see that we have managed to reduce the memory usage of our data frame significantly.

So far, we have managed to reduce the memory usage of the data frame, but to do so, we first created a large dense data frame in memory.

Tip 2: Use sparse option in pandas get_dummies

It is possible to create a sparse data frame directly, using the sparse parameter in pandas get_dummies. This parameter defaults to False. If True the encoded columns are returned as SparseArray. By setting sparse=True we create a sparse data frame directly, without previously having a dense data frame in memory.

rating                     int64
user_id_1 Sparse[uint8, 0]
user_id_2 Sparse[uint8, 0]
user_id_3 Sparse[uint8, 0]
user_id_4 Sparse[uint8, 0]
...
item_id_1678 Sparse[uint8, 0]
item_id_1679 Sparse[uint8, 0]
item_id_1680 Sparse[uint8, 0]
item_id_1681 Sparse[uint8, 0]
item_id_1682 Sparse[uint8, 0]
Length: 2626, dtype: object
Memory usage is 1.8 MB

Using the sparse option in one-hot encoding makes our workflow more efficient in terms of memory usage, as well as speed.

Let’s proceed with splitting the input and the target variables. We will create two sets of X, y vectors, using the dense and sparse data frames for comparison.

Split X, y

Memory usage is 262.5 MB
Memory usage is 1.0 MB

Train-test split & Model training

Next, we move to sklearn to perform train-test split on our data sets and train a Logistic Regression model. Although we used Factorization Machines as a reference model to create our training set, here we will train a simple Logistic Regression model in sklearn only to demonstrate differences in memory and speed among the dense and sparse data sets. The tips we will discuss in this example are transferable to Python FM implementations such as xlearn.

Pandas dataframe
Train-test split: 6.75 secs
Training: 34.82 secs


Sparse pandas dataframe
Train-test split: 17.17 secs
Training: 41.69 secs

We notice that although X_sparse is smaller, processing took longer compared to the dense X. The reason is that sklearn does not handle sparse data frames as such, according to the discussion here. Instead, sparse columns are converted to dense before being processed, causing the data frame size to explode.

Hence, the decrease in size achieved so far using sparse data types cannot be directly transferred into sklearn. A this point, we can make use of the scipy sparse formats and convert our pandas data frame into a scipy sparse matrix.

Tip 3: Convert to scipy sparse matrix

Scipy sparse matrices

Scipy package offers several types of sparse matrices for efficient storage. Sklearn and other machine learning packages such as imblearn accept sparse matrices as input. Therefore, when working with large sparse data sets, it is highly recommended to convert our pandas data frame into a sparse matrix before passing it to sklearn.

In this example we will use the lil and csr formats. In scipy docs you can see advantages and disadvantages of each format. To construct a matrix efficiently, it is advised to use either dok_matrix or lil_matrix. [source]

Below we define a function to convert a data frame to a scipy sparse matrix. We start by building a lil matrix column-wise and then convert it to csr.

Memory usage is 2.000004 MB

Let’s repeat train test split and model training with the csr matrix.

Pandas dataframe
Train-test split: 0.82 secs
Training: 3.06 secs


Sparse pandas dataframe
Train-test split: 17.14 secs
Training: 36.93 secs


Scipy sparse matrix
Train-test split: 0.05 secs
Training: 1.58 secs

Both train_test_split and model training were significantly faster when using X_sparse. Thus, we conclude that that working with sparse matrix is the most efficient option.

The advantage of sparse matrices will be even more apparent in larger data sets or data sets with higher sparsity.

Takeaways

  1. We can make use of pandas sparse dtypes while working with large sparse data frames in pandas
  2. We can also exploit the sparse option available in get_dummies, to automatically create sparse data frames
  3. We should consider converting our data to scipy sparse matrices when working with machine learning libraries

References

https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

--

--