The world’s leading publication for data science, AI, and ML professionals.

Speed Cubing for Machine Learning

Episode 1: How to optimize Python code, CPU and I/O utilizations along with deployment in the Cloud.

Photo by Linus Ekenstam on Unsplash
Photo by Linus Ekenstam on Unsplash

Introduction

At some point in a machine learning project, you will want to overcome the limitations of your local machine (number of cores, memory, etc.), whether you want to generate a large amount of data to feed deep learning networks or train your algorithms as fast as possible. In this article, we will take you on a journey, sharing the difficulties and the solutions we found, from naive implementation to more advanced techniques, taking advantage of the various computational resources that may be found in the Cloud. We hope this will help you to efficiently manage and improve your productivity for your machine learning projects.

Generating numerous chunks of data (like… A lot!)

For our experiment, we want to be able to generate a large amount (let’s say 1,000,000) of chunks of data, in a few minutes, to serve as inputs for some Generative Adversarial Networks (GANs). If you have never heard of GANs, you can read my introductory article here [1].

The chunks we are talking about will be cubes of dimension n=100 (i.e. 1,000,000 of data points). This specific number of 1 megavoxels (data points for volumes) has been chosen to match the current state of the art of generating fake faces, that can be seen on the website thispersondoesnotexist.com. The technique used there is called StyleGAN2 [2] (a variant of GANs), where generated images are of size 1024 __ x 1024, thus on the order of magnitude of 1 _megapixel_s.

In order to reflect some data preparation from the real world, additional constraints will apply to build our cubes. They will be derived from dataframes and will be the result of stacking numpy arrays. For the sake of simplicity, those dataframes will have random values.

The objective here is to create cubes as fast as possible !

The first steps will be done on our local machine, which specs are described in Table 1. As you can see, this is already a pretty nice setup.

Table 1: Specifications of our local machine.
Table 1: Specifications of our local machine.

Input Data

We first generate 100 dataframes of size 1000 __ x 1000, containing random floats (rounded to 10 decimals), thanks to the numpy.random.rand function. The dataframes are then saved to disk using the pandas.DataFrame.to_csv function. Each csv file is about 13 Mo.

For each cube, we will have to use all the dataframes to extract a random subset from them. Using the pandas.read_csv function, it takes 20.87s to read all the dataframes, hence 4.79 files/s. This is very slow. For the record, if we wanted to build 1,000,000 cubes, at this pace, it would take more than 240 days !

Instead of using the csv format, let’s consider the parquet file format to save our dataframes. Using the fastparquet engine [3], each saved parquet file is now only 8 Mo. If you want to learn more about the parquet file format, you can check the official website here [4]. This time, reading all the 100 dataframes only takes 6.57s (or 15.21 files/s)! This represents a 3.2x speedup. Those first results are gathered in Table 2.

Table 2: Reading speed comparison, csv versus parquet (100 files).
Table 2: Reading speed comparison, csv versus parquet (100 files).

Building Our First Cubes

Photo by Max Henk on Unsplash
Photo by Max Henk on Unsplash

Using some constraints, the following steps will apply to build one cube:

  1. For each dataframe, we extract a sub-dataframe containing the first 100 rows and 100 random columns,
  2. Each sub-dataframe is then converted to a numpy array of dimension 100 x 100,
  3. The numpy arrays are stacked to create one cube.
# CREATING THE CUBES FROM FILES (READ IN A LOOP)
import numpy as np
import pandas as pd
ncubes = 100
dim = 100
for n in range(ncubes):
    cube = np.zeros(shape=(dim, dim, dim))
    # 'files_list' is the list of the parquet file paths
    for i, f in enumerate(files_list):
        df = pd.read_parquet(f)
        df = df.head(dim)
        rnd_cols = random.sample(range(1, df.shape[1]), dim)
        df = df.iloc[:, rnd_cols]
        layer = df.to_numpy()
        cube[i, :, :] = layer

The total time to get a batch of 100 cubes is about 661s (almost 11 minutes), hence a rate of 0.15 cubes/s.

I’m sure you already spotted the mistake here. Indeed, for each cube, we read the same 100 parquet files every time! In practice, you surely don’t want to loop over those files. The next improvement regarding the data structure is going to fix this.

Improving Speed – Step 1: The Data Structure

Photo by Fezbot2000 on Unsplash
Photo by Fezbot2000 on Unsplash

Since we don’t want to read the parquet files inside the loop for each cube, it could be a good idea to perform this task only once and upfront. So, we can build a dictionary df_dict, having files names as keys and dataframes as values. This operation is pretty fast and the dictionary is built in 7.33s only.

Now, we are going to write a function to create a cube, taking advantage of that dictionary already having dataframes read and stored as its own values.

# FUNCTION CREATING A CUBE FROM A DICTIONARY OF DATAFRAMES
def create_cube(dimc, dict_df):
    cube = np.zeros(shape=(dimc, dimc, dimc))
    for i, df in enumerate(dict_df.values()):
        df = df.head(dimc)
        rnd_cols = random.sample(range(1, df.shape[1]), dimc)
        df = df.iloc[:, rnd_cols]
        layer = df.to_numpy()
        cube[i, :, :] = layer
    return cube

This time, creating 100 cubes only took 6.61s for a rate of 15.13 cubes/s. This represents a 100x speedup compared to the previous version not using the dictionary of dataframes. Creating our batch of 1,000,000 cubes would now only take nearly 20 hours instead of the initial 240 days.

Now, we still use dataframes to build our cubes, maybe it’s time to go full NumPy to increase our speed.

Improving Speed – Step 2: NumPy Rocks!

Photo by Hanson Lu on Unsplash
Photo by Hanson Lu on Unsplash

The previous idea of using a dictionary of dataframes was interesting but may be improved by building, from the very beginning, a numpy.ndarray derived from the parquet files, from which we will sub-sample along the columns to create our cubes. Let’s first create this big boy:

# CREATING THE RAW DATA (NUMPY FORMAT)
arr_data = np.zeros(shape=(100, 1000, 1000))
# 'files_list' is the list of the parquet file paths
for i, j in enumerate(files_list):
    df = pd.read_parquet(j)
    layer = df.to_numpy()
    arr_data[i, :, :] = layer

Then, we have to modify our create_cube function accordingly and implement a full vectorization:

# FUNCTION CREATING A CUBE FROM RAW DATA (FULL NUMPY VERSION)
def create_cube_np(dimc):
    rnd_cols = random.sample(range(1, 1000), dimc)
    # First 100 rows, 100 random columns (vectorization)
    cube = arr_data[:, :100, rnd_cols]
    return cube

Using this new version, we are able to create 100 cubes in just 1.31s, hence a nice rate of 76.26 cubes/s.

Now, we can move on to the next step to go even faster. You guessed it, time for parallelization!

Improving Speed – Step 3: Parallelization

Photo by Marc-Olivier Jodoin on Unsplash
Photo by Marc-Olivier Jodoin on Unsplash

There are several ways to perform parallelization in Python [5][6]. Here, we will use the native multiprocessing Python package along with the imap_unordered function to perform asynchronous jobs. We plan to take advantage of the 12 cores from our local machine.

# PARALLELIZATION
from multiprocessing.pool import ThreadPool
proc = 12  # Number of workers
ncubes = 100
dim = 100
def work(none=None):
    return create_cube_np(dim)
with ThreadPool(processes=proc) as pool:
    cubes = pool.imap_unordered(work, (None for i in range(ncubes)))
    for n in range(ncubes):
        c = next(cubes)  # Cube is retrieved here

The ThreadPool package is imported here (instead of the usual Pool package) because we want to ensure the following:

  • Stay inside the same process,
  • Avoid transfer data between processes,
  • Getting around the Python Global Interpreter Lock (GIL) by using numpy-only operations (most of numpy calculations are unaffected by the GIL).

You can learn more about the difference between multiprocessing and multithreading in Python in this nice blog post [7].

Using this multithreading approach, we only need 0.28s to create one batch of 100 cubes. We reach a very good rate of 355.50 cubes/s, hence a 2370x speedup compared to the very first version (Table 3). Regarding our 1,000,000 cubes, the generation time has dropped under an hour.

Table 3: Local generation of cubes, speed results.
Table 3: Local generation of cubes, speed results.

Now, it’s time to fly by using virtual machine instances in the Cloud!

Improving Speed – Step 4: The Cloud

Photo by SpaceX on Unsplash
Photo by SpaceX on Unsplash

If we talk about Machine Learning as a Service (MLaaS), the top 4 cloud solutions are: Microsoft Azure, Amazon AWS, IBM Watson and Google Cloud Platform (GCP). In this study, we chose GCP but any other provider would do the job. You can select or customize your own configuration among a lot of different virtual machine instances, where you will be able to execute your code inside a Notebook.

The first question you want to ask yourself is the following one:

"What kind of instance do I want to create that matches my computation needs ?"

Basically, you can find three types of machines: general purpose, memory-optimized or compute-optimized (Table 4).

Table 4: Machine type recommendations for different workloads.
Table 4: Machine type recommendations for different workloads.

To compute the numpy.ndarray from Step 2, parquet files are first stored into a bucket in the Cloud. Then, several tests are conducted on different VM instances (Table 5), keeping the same multithreading code as in Step 3, and progressively increasing the number of vCPUs (workers). An example of results for one virtual machine is presented in Table 6.

Table 5: Selection of some virtual machine instances.
Table 5: Selection of some virtual machine instances.
Table 6: Speed results for the "n2-highcpu-80" instance.
Table 6: Speed results for the "n2-highcpu-80" instance.

In a terminal connected to your virtual machine, you can also visualize the activity of your vCPUs, using the htop linux command (Figure 1).

Figure 1: A look to our army of 160 cores working together ("m1-ultramem-160" instance).
Figure 1: A look to our army of 160 cores working together ("m1-ultramem-160" instance).

Conclusion

Figure 2: Performance of all the tested VM instances.
Figure 2: Performance of all the tested VM instances.

Looking at Figure 2, except for the m1-ultramem-160 instance (which is the most expensive), all the other instances perform pretty well, but follow the same pattern. The rate is increasing almost linearly with the number of workers and reaches a peak at 60 vCPUs. Beyond that limit, the rate drops drastically, most probably because of the overhead of the multithreading.

Among our selection, the winner is the n2-highcpu-80 instance (the second cheapest), reaching a rate of 2026.62 cubes/s, almost 2 billion data points per second. At this pace, we can generate 1,000,000 cubes in only 8 minutes.

Our initial objective was successfully achieved !

This whole experiment demonstrates that not only code matters but hardware too. We began with a rate of 0.15 cubes/s on our local machine to reach a very quick rate of 2027 cubes/s, using the Cloud. This is more than a 13,500x speedup!

And this is just the beginning… We can level up by using more advanced technologies and infrastructures. This would be for Episode 2.

References

[1] N. Morizet, Introduction to Generative Adversarial Networks (2020), Advestis Tech Report.

[2] T. Karras et al., Analyzing and Improving the Image Quality of StyleGAN (2019), NVIDIA Research.

[3] "fastparquet" official documentation (2020).

[4] "Apache parquet" official documentation (2020).

[5] Multiprocessing – Process-based parallelism (2020), The Python Standard Library.

[6] "Ray" framework official documentation (2020).

[7] N. Grigg, Illustrating Python multithreading vs multiprocessing (2015), blog post from nathangrigg.com.

About Us

Advestis is a European Contract Research Organization (CRO) with a deep understanding and practice of statistics, and interpretable machine learning techniques. The expertise of Advestis covers the modeling of complex systems and predictive analysis for temporal phenomena. LinkedIn: https://www.linkedin.com/company/advestis/


Related Articles