The world’s leading publication for data science, AI, and ML professionals.

Train & Predict Sci-Kit Learn Models in Parallel

Parallelize your Model Training and Prediction on Sci-Kit Learn

Image taken by Marc Kargel from Unsplash
Image taken by Marc Kargel from Unsplash

This article will outline how to train and tune certain models on sci-kit learn in parallel to optimize for time. I will also outline how you can parallelize the prediction process for these model(s). I’ll cover partitioning the data into multiple segments and generating predictions for each partition, as well as passing the entire dataset into multiple models. I’ve previously written an article on multiprocessing and how to use it for functions with single and multiple parameters. You can read it here, as we will be using the multiprocessing module for the content of this article. The following is the outline of this article:

Table of Contents

  • Parallelization on Sci-Kit Learn
  • Data
  • Requirements
  • Generate Data
  • Training Models
  • Normal Training
  • Parallel Training
  • Parallelize Model Evaluation (Cross Validation)
  • Normal Implementation
  • Parallel Implementation
  • Parallelize Model Prediction – Singular Model
  • Architecture
  • Normal Implementation
  • Parallel Implementation
  • Parallelize Model Prediction – Multiple Models
  • Architecture
  • Normal Implementation
  • Parallel Implementation
  • Concluding Remarks
  • Resources

Parallelization on Sci-Kit Learn

Most machine learning projects have 4 major components which take a large amount of computational power and time.

  1. Model training

    • Training multiple ML models on various train test splits
  2. Hyper-parameter tuning of models

    • Tuning the various hyper parameters associated with the models in order to maximize the model performance without overfitting it to the original data
  3. Model evaluations

    • Evaluating the model across various evaluation methods like cross validation, accuracy, classification report, etc.
  4. Model prediction

    • Generating predictions for the model such that the inference time is low.
    • Inference time associated with a machine learning model refers to the amount of time it takes for a model to process the input data and generate predictions.
    • You want to maximize the model performance while minimizing the inference time. Having a low inference time aids in scalable machine learning in production environments.

These 4 components are crucial to the Data Science pipeline, each plays a major role and each takes a large amount of time. Thankfully, when working on machine learning modelling through sci-kit learn (an open source library for model development), a lot of the parallelization is already built into a lot of the common functions and models you train on. The n_jobs parameter to a function from sklearn specifies the number of cores you want to use when running that function.

n_jobs is an integer, specifying the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. For example with n_jobs=-2, all CPUs but one are used. n_jobs is None by default, which means unset; it will generally be interpreted as n_jobs=1 – [1] https://scikit-learn.org/stable/glossary.html#term-n-jobs

Do note, that not all functions and models have this parameter available to use, meaning that not all features of the sklearn module are parallelized. For example, models like Random Forest and K-Nearest Neighbours have the n_jobs parameter, however models like Elastic Net and Gradient Boosting do not. To identify whether or not this parameter is available for the functions from sklearn you’re using, you can go to the documentation associated with the feature and check if n_jobs is in the parameters sections for that feature.

n_jobs parameter from KNeighborsClassifier. Image provided by the author.
n_jobs parameter from KNeighborsClassifier. Image provided by the author.

Data

We’re going to synthesize data for the tutorial following below. I’ll highlight how to generate some fake data, train, tune, evaluate and predict a Machine Learning model(s) the regular way and through parallelization. The main modules we are going to be referencing for this tutorial will be sklearn , multiprocessing , pandas and numpy . The following are the versions and requirements necessary to follow along in this tutorial. If you just want to reference the Jupyter Notebook associated with this tutorial, you can find it on my GitHub here.

Requirements

Python>=3.8.8
pandas>=1.2.4
numpy>=1.20.1
sklearn=0.24.1

We will also be dependent on the multiprocessing, random and itertools libraries. Do not worry as those libraries come pre-installed with Python.

Generate Data

The function above will randomly generate a CSV associated with book related data. The data is going to be used to generate two models, one normally without parallelization, and another through parallelization. There won’t be any feature engineering or preprocessing done on the data as the scope of this article is to highlight how to train and predict on sklearn models. The function above should yield a sample DataFrame as shown below:

Sample book data generated. Image provided by the author.
Sample book data generated. Image provided by the author.

Training Models

Normal Training

The gradient boosting classifier doesn’t have an n_jobs parameter, so you’re unable to parallelize the model training process for this model.

Parallel Training

The random forest classifier has the n_jobs parameter, so you can specify this to the number of cores you want to use.

Parallelize Model Evaluation (Cross Validation)

Normal Cross Validation

Parallel Cross Validation

Parallelize Model Predictions – Singular Model

We’re going to split the input data we pass into a model into multiple batches. These batches will be of equal size corresponding to the number of cores you want to use to parallelize the predictions.

Architecture

Model architecture for parallelizing the input data into multiple partitions and passing each partition through to the model in parallel. Image provided by the author.
Model architecture for parallelizing the input data into multiple partitions and passing each partition through to the model in parallel. Image provided by the author.

Normal Implementation

Parallel Implementation

For the parallel implementation, we will utilize the function we created during the normal implementation and optimize such that 1 parameter will be split into n partitions using the array_split function provided by numpy.

Parallelize Model Predictions – Multiple Models

Now we’ll aim to pass two individual models into two different cores, the first model will be the genre model we’ve generated above and the second will be the language model. We will pass the same data into each of these models and generate predictions for both models in parallel.

Architecture

Architecture for passing the same data through to multiple models in parallel. Each model will belong to its own core. Image provided by the author.
Architecture for passing the same data through to multiple models in parallel. Each model will belong to its own core. Image provided by the author.

Normal Implementation

We’ll begin by creating model_data, a nested list which holds the model, the feature columns and the target name. We can iteratively pass the input data and the associated features to generate predictions.

Parallel Implementation

We can once again utilize the array_split function from numpy to split the model_data we created above and pass each of the model data into a core. Each core will essentially run the multi_prediction function outlined in the normal implementation above.

Concluding Remarks

This article outlines how you can minimize the time spent on training, tuning and predicting models in sci-kit learn through parallel processing. Majority of the heavy lifting for training your model in parallel is already done in house by sci-kit learn.

I provided an in-depth tutorial showcasing different methods to generate predictions in parallel. One method outlined is when you have multiple models and you want to pass in the data through to each of those models to yield predictions. Each of the models you are trying to predict for would run as a separate process in parallel. Another method outlined is when you have a large dataset which you want to generate predictions for quicker. In this method, you can partition your dataset into multiple smaller segments (the number of segments you should have at most is equivalent to the maximum number of cores you want to use). You can pass in each segment with a model to yield predictions in parallel and merge the results back together.

Do note that this is mainly useful when you have a substantial number of cores and a large dataset. When running this process on a small amount of data, the normal (non parallel) implementation will run in a similar amount of time as the parallel implementation. The notebook associated with this repository runs on a small amount of data for efficiency as I don’t have much memory or cores available. However, you can easily increase the amount of data generated and thus increase the amount of data passed for both the training, evaluating and prediction components.

Another way to optimize the amount of time it takes for your model to predict is through sparse matrices. This concept is outside of the scope of this article but if you’re interested, you can read about it here.

If you want to follow along with the code associated with this article, you can check out my GitHub where I’ve posted the notebook here.

If you’re looking to transition into the data industry and want mentorship and guidance from seasoned mentors then you might want to check out Sharpest Minds. Sharpest Minds is a mentorship platform where mentors (who are seasoned practicing data scientists, machine learning engineers, research scientists, CTO, etc.) would aid in your development and learning to land a job in data. Check them out here.

Resources


If you enjoyed reading this article, you might also find others I’ve written about data science and machine learning to be interesting. Check them out below.

Recommendation System – Matrix Factorization (SVD) Explained

Link Prediction Recommendation Engines with Node2Vec

Recommendation Systems Explained


Related Articles