Quick Install Guide: Nvidia RAPIDS + BlazingSQL on AWS SageMaker

Iván Venzor
Towards Data Science
4 min readSep 27, 2019

--

RAPIDS was announced on October 10, 2018 and since then the folks in NVIDIA have worked day and night to add an impressive number of features each release. The preferred installation methods supported in the current version (0.9) are Conda and Docker (pip support was dropped in 0.7). In addition, RAPIDS it’s available for free in Google Colab and Microsoft’s Azure Machine Learning Service is also supported.

However, there may be people like me that would like/need to use RAPIDS in AWS SageMaker (mainly because our data is already on S3). This guide is intended as a Quick Installation Guide. It’s far from perfect but it might save you several hours of trial and error.

I will also include BlazingSQL, an SQL-engine on top of cuDF. As a Data Scientist, the ability to query the data is extremely useful!

Requirements:

There are two main requirements to install RAPIDS on SageMaker:

  1. Obviously you need a GPU instance. Currently in SageMaker there are only two types of Accelerated Instances: ml.p2 (NVIDIA K80) and ml.p3 (V100) instances. However, as RAPIDS requires NVIDIA Pascal architecture or newer we may only use ml.p3 instances.
  2. RAPIDS requires NVIDIA driver v410.48+ (in CUDA 10). AWS updated the driver in May. Therefore, RAPIDS v0.7 was the first version that could be installed in SageMaker.

Installation Procedure

The installation procedure for the current RAPIDS stable release (0.9) is as follows:

  1. Start or create your ml.p3 SageMaker instance. Once the instance is InService open it. I will be using JupyterLab for the remaining of this guide.
  2. In JupyterLab: Git -> Open Terminal, to open the shell and execute the following:
source /home/ec2-user/anaconda3/etc/profile.d/conda.sh
conda create --name rapids_blazing python=3.7
conda activate rapids_blazing

I strongly recommend creating a new environment. If you try to install RAPIDS in SageMaker conda python3 environment it will take hours to solve the environment and it’s also likely it will yield strange actions (for example, to install python 2 which RAPIDS doesn’t support, etc.).

3. Conda installs RAPIDS (0.9) and BlazingSQL (0.4.3) and a few other packages (in particular boto3 and s3fs are needed to work S3 files) as well as some dependencies for the Sagemaker package which will be pip installed in the next step. In RAPIDS version 0.9 dask-cudf was merged into the cuDF branch. It will take about 8 minutes to solve this environment:

conda install -c rapidsai -c nvidia -c numba -c conda-forge \
-c anaconda -c rapidsai/label/xgboost \
-c blazingsql/label/cuda10.0 -c blazingsql \
"blazingsql-calcite" "blazingsql-orchestrator" \
"blazingsql-ral" "blazingsql-python" \
"rapidsai/label/xgboost::xgboost=>0.9" "cudf=0.9" \
"cuml=0.9" "cugraph=0.9" "dask-cudf=0.9" \
"python=3.7" "ipykernel" "boto3" \
"PyYAML>=3.10,<4.3" "urllib3<1.25,>=1.21" \
"idna<2.8,>=2.5" "boto" "s3fs" "dask" \
"anaconda::cudatoolkit=10.0"

4. Install Sagemaker and flatbuffers packages and register the kernel to be used in JupyterLab:

pip install flatbuffers sagemaker
ipython kernel install --user --name=rapids_blazing

5. Wait about a minute and then open or create a new notebook and you should be able to select the new kernel: Kernel -> Change Kernel -> conda_rapids_blazing. Note: please do not use rapids_blazing kernel instead of conda_rapids_blazing as BlazinSQL won’t work if that kernel is used.

6. Let’s first import RAPIDS and BlazingSQL packages:

import cudf
import cuml
import dask
import pandas as pd
import dask_cudf
from blazingsql import BlazingContext
bc = BlazingContext()

We should get a “connection established” message.

7. Let’s do a first test to check cuDF is working:

df = cudf.DataFrame()
df[‘key’] = [0, 1, 2, 3, 4]
df[‘val’] = [float(i + 10) for i in range(5)]
print(df)

8. Test cuML:

df_float = cudf.DataFrame()
df_float[‘0’] = [1.0, 2.0, 5.0]
df_float[‘1’] = [4.0, 2.0, 1.0]
df_float[‘2’] = [4.0, 2.0, 1.0]
dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(df_float)
print(dbscan_float.labels_)

9. If there are no errors we have successfully imported and used basic cuDF and cuML functionality. Next step is to read and use data stored in S3. For example, to read some csv files with gzip compression:

import boto3
import sagemaker
from sagemaker import get_execution_role
role = get_execution_role()
df= dask_cudf.read_csv(‘s3://your-bucket/your-path-to-files/files*.csv.gz’, compression=’gzip’)
df2=df.compute()

9. Now we may use BlazinSQL to query our data:

bc.create_table(‘test’, df2)
result = bc.sql(‘SELECT count(*) FROM test’).get()
result_gdf = result.columns
print(result_gdf)

I will try to update and extend this guide beyond the installing process. Meanwhile here are three interesting results I got:

  • 7X improvement in cuDF (v0.10) read_csv compared to pandas read_csv.
  • 32X improvement in cuML LogisticRegression vs sklearn LogisticRegression.
  • 7X improvement in GPU xgboost (‘tree_method’:’gpu_hist’) vs non-GPU xgboost (‘tree_method’:’hist’).

Conclusion:

Looking beyond, RAPIDS version 0.10 will include some nice features for AWS users. For example, cudf.read_csv will be able to read s3 files directly and also a bug in dask-cudf.read_parquet while reading s3 files has been fixed and will be included in 0.10. I thank the RAPIDS team for the quick attention and solution of some of the github issues I have reported.

Any comments to this guide are welcome. May the GPU speed your analysis!!

--

--

Advanced Analytics (DS & ML) Deputy Director at Banregio Bank; Neural Networks Lecturer at UANL; PhD; MSc Astrophysics; MBA; BSc Computer Science