The world’s leading publication for data science, AI, and ML professionals.

Doing Explanatory Data Analysis

Is it possible to do it entirely on a GPU?

I always felt in the dark about how useful these GPU devices are when we aren’t interested in training deep learning models. However, recently I read articles about Pandas and how to speed up processing. Those articles left me a bit curious, so I decided to power up my Data Science machine, chuck a big file at the GPU, and see if I could do an entire EDA exercise on a GPU. I had a sense of sunlight in my darkness, a bit like the person in the photo.

So can EDA be done entirely on a GPU? Let’s find out!


My Data Science Machine

Let us start by getting an overview of the host computer.

The main highlights of the system are:-

I assembled the system myself and upgraded components as the price point allowed on boards and GPU cards. This system can handle most tasks and should be sufficient for our challenge! I hope you agree that it is safe to say there are no hardware impediments!


A data set to experiment on

I chose a data file with 7M records for the GPU experiment. The 7+ Million Company Dataset is described on Kaggle [1] and used in other articles such as "Exploratory Data Analysis of 7 Million Companies using Python."

The file is 1.1 GB and has 7+ million rows and 11 columns. It kills Excel due to the 7+million rows. Alteryx swallows it whole. PowerBi and PowerQuery can work with the file, but I had to do summary views to load them in Excel.

Before downloading and using a dataset it is best to check the license, and determine ownership, and permissions. For the 7+ Million Company dataset that is licensed under the creative commons CC0.01. "You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission."

There are also publisher specific rules such as here on Towards Data Science. So please be careful with the choice of data set and the source, matching up the license to the publishers terms & conditions.


GPU set-up

Usually, calling Nvidia-smi is the first thing I do to check and determine if a GPU is available. We see that CUDA 11.4 is available with Nvidia driver 470.141.03. All consistent with the set-up I did when I installed the Quadro early this year.

You might notice that there are already some workloads on the GPU, which only leads to problems. Almost 1GB of the card’s RAM is already allocated. Therefore, we must shift the visual display workloads to the Intel onboard GPU. Re-configuring Xorg screen settings allowed the NVIDIA GPU to be free and dedicated to data work! I do enjoy watching Netflix in 4K!

Next, we need to install RAPIDS

Copy the command and paste that into the terminal. Let conda take care of everything but just make the correct selections based on your system to avoid conflicts and issues. Finally, activate the environment ( rapids-22.08 ), launch jupyter and do a quick test.

OK, we understand our hardware and have our software configured; now it is time for some EDA. Thanks to the Rapids team, I didn’t have a complicated dependency challenge.


Let’s just be lazy and throw the file on the card.

Since Rapids is installed in an environment, you must first activate the environment. I used the Anaconda Navigator and just selected from the available environments. I am using rapids-22.08, and this environment was created by the RAPIDS conda install command – see earlier.

I was surprised by the quick and dirty approach. It worked!

Let us go through the code in a few blocks

Loading the modules

Importing the regular suspects and CUDF. As I am unsure how far I can go, I am using both cudf(cu) and pandas(pd).

Building a GPU-based data frame

Note the .read_csv() method, which seems similar to the pandas method, but we would need to get under the hood to see what is different. For now, I just want to get a feel for how it performs, if at all!

Use the traditional approach of RAM and CPU computation

For the benchmark and perhaps to do a cudf.DataFrame.from_pandas!

Results

Surprisingly this worked out of the box. Previously I did a lot of ‘SUDOing‘ and other painful dependency issues resolution.

Still surprised that such a lazy approach worked – just about 2 seconds to read that large text file and transfer it to the GPU. Impressive! The file is 1.1GB on the disk. Notice the memory use – it is 1.3+ GB, hmm! Alteryx reported 6.4 seconds through Parallels, with Windows 11, on Mac Mini M1. Power Query was still running a query in the background after several minutes.

Unsurprisingly loading Pandas always works out of the box. It took 16 seconds to read, create the data structure and load the data frame into RAM. Almost X8 times slower. But check out the memory footprint size -> 602.0+ MB versus 1.3+ GB. To the eye, the data frames have the same columns and datatypes on both devices.

Using NVIDIA-smi again

Indeed the GPU is reporting Python as a user with 1605MiB of storage allocated. This is impressive.


Cleaning

So we know we can load the large file to the GPU, but what about cleaning? Missing values, in particular, can be computationally expensive.

if(df.isnull().values.any()):
    missing = df.isnull().sum()
    print(missing)
    print(" ")
    for col in df.columns:
        print (col, "-t", df[col].nunique())
df.dropna(axis=0,inplace=True)
df.notnull().sum()

A basic code to print out missing value summarizes, show the number of unique items per column, drop all rows with missing values in columns, and then dump a summary of all non-missing values.

That code is lightning-fast on the GPU but is slow on the conventional CPU/RAM approach.

Summarize missing: CPU/RAM: That took : 0:00:30.536083 
with GPU taking 0:00:01.035225

30 seconds on a conventional configuration versus almost nothing on the GPU. Aggregation and groupby are also extremely fast, and the syntax is the same as Pandas.

df.groupby('size range').agg({'year founded': ['min', 'max', 'count'], 'country': lambda x: x.nunique(), 'current employee estimate': 'median'})

df = df[df['year founded'] < 2023].copy()
pddf = pddf[pddf['year founded'] < 2023].copy()

Sub-setting and copy data frames also behave the same in both approaches. Cool!

Visualization

OK, cleaning a dataframe works better on the GPU, but what about a few graphs?

Well, here is what you might already be waiting for! the news is that CUDF does not wrap matplotlib as Pandas does. Therefore we cannot do plots directly off the GPU dataframe. Instead, we must copy the data from the GPU card to the motherboard RAM.

df.to_pandas().plot()

Naturally, that is where we stop. I cannot do an entire EDA exercise on a GPU, after all. I could prepare a dataframe pretty quickly, and it was virtually painless and speedy!

Warning

GPU devices do not behave like the conventional approach. Items like %%timeit%% cause memory errors on the GPU.

del pddf
del df

Deleting the dataframe from the GPU releases the allocation on the device and allows more data frames to be loaded. You must take care of releasing allocations, or the device will complain.

Repository

I loaded my code to GitHub, and you can find the notebook at the following link.

ReadingList/cudf.ipynb at main · CognitiveDave/ReadingList

Why not join medium? $50 bucks invested is a good return over 12 months

Join Medium with my referral link – David Moore

References / Citations

[1] The 7+ Million Company dataset is licensed under the creative commons CC0.01. You can request a copy directly from People Data Labs and read more about them here. The data is used here for loading on a GPU and for the creation of tactical summaries.


Related Articles