The world’s leading publication for data science, AI, and ML professionals.

How to Empower Pandas with GPUs

A quick introduction to cuDF, an NVIDIA framework for accelerating Pandas

DATA SCIENCE

Photo by BoliviaInteligente on Unsplash
Photo by BoliviaInteligente on Unsplash

Pandas remains a crucial tool in Data analytics and machine learning endeavors, offering extensive capabilities for tasks such as data reading, transformation, cleaning, and writing. However, its efficiency with large datasets is somewhat limited, hindering its application in production environments or for constructing resilient data pipelines, despite its widespread use in data science projects.

Similar to Apache Spark, Pandas loads the data into memory for computation and transformation. But unlike Spark, Pandas is not a distributed compute platform, and therefore everything must be done on a single system CPU and memory (single-node processing). This feature limits the use of Pandas in two ways:

  1. Pandas on a single system cannot handle a large amount of data.
  2. Even for the data that fits into a single system memory, it may take considerable time to process a relatively small dataset.

Pandas on Steroid

The first issue is addressed by frameworks such as Dask. Dask DataFrame helps you process large tabular data by parallelizing Pandas on a distributed cluster of computers. In many ways, Pandas empowered by Dask is similar to Apache Spark (however, still Spark can handle large datasets more efficiently and that’s why it is a preferred tool among data engineers).

Although Dask enables parallel processing of large datasets across a cluster of machines, in reality, the data for most Machine Learning projects can be accommodated within a single system’s memory. Consequently, employing a cluster of machines for such projects might be excessive. Thus, there is a need for a tool that efficiently executes Pandas operations in parallel on a single machine, addressing the second issue mentioned earlier.

Whenever someone talks about parallel processing, the first word that comes to most engineers’ minds is GPU. For a long time, it was a wish to run Pandas on GPU for efficient parallel computing. The wish came true with the introduction of NVIDIA RAPIDS cuDF. cuDF (pronounced "KOO-dee-eff") is a GPU DataFrame library for processing data in parallel using GPU.

At NVIDIA GTC 2024, it was announced that RAPIDS cuDF can now bring GPU acceleration to 9.5M million pandas users without requiring them to change their code.

Installation

Before proceeding with the installation, ensure that you have access to an NVIDIA GPU on your system. If you possess an NVIDIA graphics card locally, you can execute cuDF on your machine. Alternatively, if you lack such hardware, you can still run your cuDF code on cloud platforms like Google Colab, which offers sufficient GPU resources for your use case. Personally, I have a GeForce RTX 3090 installed on my system, hence I opted to conduct this test locally. However, beginners may find it more convenient to run the cuDF codes in this tutorial on a Google Colab instance.

In addition, as of April 2024, cuDF only runs on Linux OS and Python 3.9 or higher. Please visit this page for system requirements, before starting the installation.

To facilitate easier comprehension for readers, I’ve divided the installation part into two sections. The first section helps users who want to install cuDF on their local system, and the second section guides the Google Colab users through the installation.

Installing on a System with a GPU

As the cuDF name suggests, it works on the CUDA platform (developed by NVIDIA). Therefore, you need to install the CUDA toolkit on your system first. I suggest to install CUDA 12.0 if you don’t have it installed. Otherwise, any CUDA 11 or newer should work fine.

If you are planning to run cuDF on your local system, after installing the CUDA toolkit, you need to install the RAPIDS framework. Installing RAPIDS using the UI command builder is easy and straightforward. You need to navigate to https://docs.rapids.ai/install and select your system CUDA and Python versions. You can choose the Standard Package (which includes cuDF, cuML, cuGraph, cuSpatial, cuProj, cuxfilter, cuCIM, RAFT) or choose cuDF only through "Choose Specific Package". The following screenshot is what I chose for my system.

Google Colab

If you are planning to run cuDF on Google Colab, you have an easier path. First, Google Colab and many similar services have the CUDA toolkit already installed on their GPU instances.

For installing RAPIDS, as this document from Google explains in detail, you need to take two steps. Step 1 is to choose a Colab runtime that uses a GPU accelerator. Step 2 is running the following commands in your notebook to install RAPIDS on your Colab instance via pip.

 !pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
 !rm -rf /usr/local/lib/python3.8/dist-packages/cupy*
 !pip install cupy-cuda11x 

Please read the "Getting Started with RAPIDS on Colab" section of the mentioned document for more details.

An Example to Test

To compare the capabilities of CuDF Pandas and standard Pandas, I ran a test using a fake dataset with 10 million records. You can find all the required files in this repository: https://github.com/tamiminaser/cudf-examples/tree/master/example1

First, we need to create a large test dataset. If you have access to a big dataset that you can load into Pandas, you can go ahead with your dataset. I decided to create a fake CSV dataset with 10 million rows and use it for this test. I used the following Python code (link to repo) to create my fake dataset and called it data_10000000.csv.

import csv
from faker import Faker
import datetime
from tqdm import tqdm

def datagenerate(records, headers):
    fake = Faker('en_US')
    with open("data_10000000.csv", 'wt') as csvFile:
        writer = csv.DictWriter(csvFile, fieldnames=headers)
        writer.writeheader()
        for i in tqdm(range(records)):
            full_name = fake.name()
            FLname = full_name.split(" ")
            Fname = FLname[0]
            Lname = FLname[1]
            domain_name = "@testDomain.com"
            userId = Fname +"."+ Lname + domain_name

            writer.writerow({
                    "user_id": fake.unique.random_int(min=1111111111111, max=9999999999999),
                    "name": fake.name(),
                    "date_of_birth" : fake.date(pattern="%d-%m-%Y", end_datetime=datetime.date(2000, 1,1)),
                    "country_of_birth": fake.country(),
                    "email": fake.email(),
                    "record_year": fake.year(),
                    "score": fake.random_int(min=1, max=999),
                    })

records = 100000000
headers = ["user_id", "name", "date_of_birth", "country_of_birth", "email", "record_year", "score"]
datagenerate(records, headers)
print("CSV generation complete!")

It will take a long time to create such a 10 million rows csv file and the output file size will be about 8 GB. You can lower the records number (for example to 3M records) in the code to produce smaller datasets. Also, I used my NVIDIA GeForce RTX 3090, which has 24 GB memory and I had no issue loading such a large dataset, but if you are using GPUs with smaller memories, you might need to work with a smaller dataset.

After producing the fake CSV file, you can do the following test on the file. First, let’s begin with standard Pandas and observe the processing time for this dataset. Below is the Jupyter Notebook we utilized for the initial phase of the test. (Note, I limited my code to 3M records here to loop the test 10 times for a more accurate comparison).

Now, let’s do the same processing using cuDF Pandas.

As you see, both codes are similar and the only difference is in the first cell of the cuDF notebook. In Jupiter Notebook, simply by adding %load_ext cudf.pandas before importing Pandas we could use cudf.pandas instead of standard Pandas and enjoy all of its benefits (it could not be simpler).

As you see, the group by process (cell #3) took about 1.06 seconds via standard Pandas, and it took only 122 ms via cuDF Pandas. It means that cuDF Pandas is about 8 times faster than the standard Pandas on CPU.

In Cell #4, we did a more complicated series of joining and transformations. For this one, it took 6 seconds for the standard Pandas to process the data, but it took about 109 ms for cuDF Pandas to do the same process. Therefore, cuDF Pandas could process the same data 55 times faster.

Summary

As you see, cuDF provides an API that enables us to transform Pandas dataframes on GPUs with minimum effort. The performance of cuDF compared to standard Pandas on a CPU is 10–100 times faster according to our test.

As mentioned at the beginning, Dask is another popular tool that enables us to run Pandas transformations and jobs on a cluster of machines. cuDF and Dask could be combined to even enhance the processing further and take advantage of both distributed compute systems as well as GPUs to process the data faster (read more).

Update

At Google I/O 2024, Laurence Moroney, the Head of AI Advocacy at Google, announced the integration of RAPIDS cuDF into Google Colab. This enables developers to significantly speed up pandas code, achieving up to a 50x performance boost on Google Colab GPU instances, all while allowing them to continue using pandas seamlessly as their data scales. Read more here.


Related Articles