
This article will outline how to train and tune certain models on sci-kit learn in parallel to optimize for time. I will also outline how you can parallelize the prediction process for these model(s). I’ll cover partitioning the data into multiple segments and generating predictions for each partition, as well as passing the entire dataset into multiple models. I’ve previously written an article on multiprocessing and how to use it for functions with single and multiple parameters. You can read it here, as we will be using the multiprocessing module for the content of this article. The following is the outline of this article:
Table of Contents
- Parallelization on Sci-Kit Learn
- Data
- Requirements
- Generate Data
- Training Models
- Normal Training
- Parallel Training
- Parallelize Model Evaluation (Cross Validation)
- Normal Implementation
- Parallel Implementation
- Parallelize Model Prediction – Singular Model
- Architecture
- Normal Implementation
- Parallel Implementation
- Parallelize Model Prediction – Multiple Models
- Architecture
- Normal Implementation
- Parallel Implementation
- Concluding Remarks
- Resources
Parallelization on Sci-Kit Learn
Most machine learning projects have 4 major components which take a large amount of computational power and time.
-
Model training
- Training multiple ML models on various train test splits
-
Hyper-parameter tuning of models
- Tuning the various hyper parameters associated with the models in order to maximize the model performance without overfitting it to the original data
-
Model evaluations
- Evaluating the model across various evaluation methods like cross validation, accuracy, classification report, etc.
-
Model prediction
- Generating predictions for the model such that the inference time is low.
- Inference time associated with a machine learning model refers to the amount of time it takes for a model to process the input data and generate predictions.
- You want to maximize the model performance while minimizing the inference time. Having a low inference time aids in scalable machine learning in production environments.
These 4 components are crucial to the Data Science pipeline, each plays a major role and each takes a large amount of time. Thankfully, when working on machine learning modelling through sci-kit learn (an open source library for model development), a lot of the parallelization is already built into a lot of the common functions and models you train on. The n_jobs
parameter to a function from sklearn specifies the number of cores you want to use when running that function.
n_jobs
is an integer, specifying the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used. Forn_jobs
below -1, (n_cpus + 1 + n_jobs) are used. For example withn_jobs=-2
, all CPUs but one are used.n_jobs
isNone
by default, which means unset; it will generally be interpreted asn_jobs=1
– [1] https://scikit-learn.org/stable/glossary.html#term-n-jobs
Do note, that not all functions and models have this parameter available to use, meaning that not all features of the sklearn module are parallelized. For example, models like Random Forest and K-Nearest Neighbours have the n_jobs parameter, however models like Elastic Net and Gradient Boosting do not. To identify whether or not this parameter is available for the functions from sklearn you’re using, you can go to the documentation associated with the feature and check if n_jobs
is in the parameters sections for that feature.

Data
We’re going to synthesize data for the tutorial following below. I’ll highlight how to generate some fake data, train, tune, evaluate and predict a Machine Learning model(s) the regular way and through parallelization. The main modules we are going to be referencing for this tutorial will be sklearn
, multiprocessing
, pandas
and numpy
. The following are the versions and requirements necessary to follow along in this tutorial. If you just want to reference the Jupyter Notebook associated with this tutorial, you can find it on my GitHub here.
Requirements
Python>=3.8.8
pandas>=1.2.4
numpy>=1.20.1
sklearn=0.24.1
We will also be dependent on the multiprocessing
, random
and itertools
libraries. Do not worry as those libraries come pre-installed with Python.
Generate Data
The function above will randomly generate a CSV associated with book related data. The data is going to be used to generate two models, one normally without parallelization, and another through parallelization. There won’t be any feature engineering or preprocessing done on the data as the scope of this article is to highlight how to train and predict on sklearn models. The function above should yield a sample DataFrame as shown below:

Training Models
Normal Training
The gradient boosting classifier doesn’t have an n_jobs
parameter, so you’re unable to parallelize the model training process for this model.
Parallel Training
The random forest classifier has the n_jobs
parameter, so you can specify this to the number of cores you want to use.
Parallelize Model Evaluation (Cross Validation)
Normal Cross Validation
Parallel Cross Validation
Parallelize Model Predictions – Singular Model
We’re going to split the input data we pass into a model into multiple batches. These batches will be of equal size corresponding to the number of cores you want to use to parallelize the predictions.
Architecture

Normal Implementation
Parallel Implementation
For the parallel implementation, we will utilize the function we created during the normal implementation and optimize such that 1 parameter will be split into n partitions using the array_split
function provided by numpy.
Parallelize Model Predictions – Multiple Models
Now we’ll aim to pass two individual models into two different cores, the first model will be the genre model we’ve generated above and the second will be the language model. We will pass the same data into each of these models and generate predictions for both models in parallel.
Architecture

Normal Implementation
We’ll begin by creating model_data
, a nested list which holds the model, the feature columns and the target name. We can iteratively pass the input data and the associated features to generate predictions.
Parallel Implementation
We can once again utilize the array_split
function from numpy
to split the model_data
we created above and pass each of the model data into a core. Each core will essentially run the multi_prediction
function outlined in the normal implementation above.
Concluding Remarks
This article outlines how you can minimize the time spent on training, tuning and predicting models in sci-kit learn through parallel processing. Majority of the heavy lifting for training your model in parallel is already done in house by sci-kit learn.
I provided an in-depth tutorial showcasing different methods to generate predictions in parallel. One method outlined is when you have multiple models and you want to pass in the data through to each of those models to yield predictions. Each of the models you are trying to predict for would run as a separate process in parallel. Another method outlined is when you have a large dataset which you want to generate predictions for quicker. In this method, you can partition your dataset into multiple smaller segments (the number of segments you should have at most is equivalent to the maximum number of cores you want to use). You can pass in each segment with a model to yield predictions in parallel and merge the results back together.
Do note that this is mainly useful when you have a substantial number of cores and a large dataset. When running this process on a small amount of data, the normal (non parallel) implementation will run in a similar amount of time as the parallel implementation. The notebook associated with this repository runs on a small amount of data for efficiency as I don’t have much memory or cores available. However, you can easily increase the amount of data generated and thus increase the amount of data passed for both the training, evaluating and prediction components.
Another way to optimize the amount of time it takes for your model to predict is through sparse matrices. This concept is outside of the scope of this article but if you’re interested, you can read about it here.
If you want to follow along with the code associated with this article, you can check out my GitHub where I’ve posted the notebook here.
If you’re looking to transition into the data industry and want mentorship and guidance from seasoned mentors then you might want to check out Sharpest Minds. Sharpest Minds is a mentorship platform where mentors (who are seasoned practicing data scientists, machine learning engineers, research scientists, CTO, etc.) would aid in your development and learning to land a job in data. Check them out here.
Resources
If you enjoyed reading this article, you might also find others I’ve written about data science and machine learning to be interesting. Check them out below.
Recommendation System – Matrix Factorization (SVD) Explained