
Parallel and distributed computing
Learn to best use your compute resources
Shout out to Zeev Waks who’s been part of the project and the writing of this post.
Following our companion blog on sequential hyperparameter optimization, here we discuss the engineering considerations taken with respect to run time and cost. We specifically dive into approaches to speed up parameter search using parallel or distributed computing. This is important since hyperparameter optimization (HPO) is often one of the costliest and slowest aspects of model development.
We optimized our hyperparameters using AWS virtual machines (EC2 instances) as the hardware and Optuna as the software framework. Optuna is a relatively new open-source framework for HPO developed by Preferred Networks, Inc.
Parallel and distributed computing

The goal of parallel and Distributed Computing is to optimally use hardware resources to speed up computational tasks. While these two terms sound similar, and both indeed refer to running multiple processes simultaneously, there is an important distinction.
- Parallel computing refers to running multiple tasks simultaneously on the different processors of a single machine.
- Distributed computing refers to the ability to run tasks simultaneously on multiple autonomous machines.
Both parallel and distributed computing enable scaling up in order to shorten run duration. Under parallelism, run time is shortened by scaling up vertically, which means improving individual machines, for example by adding more processors or memory. For distributed computing, scaling up horizontally refers to adding more machines and can likewise improve performance.
There are many more aspects regarding the differences and the relationship between parallel and distributed computing [1], but we will not dive further into those in this blog.
Use case
The details of our use case can be found here. We used 9 hyperparameters and trained the model on a range of 18k to 80k examples (the range depended on the hyperparameter values), each containing roughly 80 features.
The code below shows how simple it is to define parallel or distributed computing in Optuna.
In our case, Parallel Computing can be used in two places, for training the regression model and for searching multiple hyperparameter combinations (i.e., Optuna trials) simultaneously on the same machine. In contrast, distributed computing can be used here primarily for searching different hyperparameter combinations on multiple machines. To manage the distribution job, we used a Redis DB endpoint under our AWS account (read here about other storage options).
Results
We measured run time and cost using various machines and parallel/distributed configurations to better understand the optimal composition for running our HPO search.

Both our models’ libraries, Scikit-learn (we used RandomForestRegressor) and XGBoost, as well as Optuna, support parallelism by defining a built-in parameter, _njobs. The value of _njobs states the number of jobs to run in parallel, with -1 indicating usage of all available processors.
In our case, the best configuration in terms of run duration was Optuna _njobs=1 and model _njobs=-1, meaning that we trained the model in parallel but did not use parallelism for testing hyperparameter combinations. Since this setup was optimal for a single machine, we did not continue to check additional configurations for the distributed run.

Interestingly, although the m5.24xlarge EC2 instance type is 6x times bigger than the m5.4xlarge instance, it improved study time only by 40% (a bit disappointing). However, using distributed computation with the two smaller m5.4xlarge instances, we both slightly improve study time (43% reduction versus 40%) and did it at about 1/3 of the cost.
The good news is that we can add more instances to improve our run times. Moreover, one of the machines was idle for about 6 minutes as it was allocated with a set of trials that collectively had a shorter run time. This difference would likely become less meaningful upon running more trials.
In addition, Optuna parallelism uses threads that are often well-suited for I/O-bound tasks, but in contrast are not optimal for CPU-intensive tasks such as training a model (known issue). Read about multithreading and multiprocessing here. In an attempt to overcome this issue, we run two different Python processes on the same m5.24xlarge machine and synced them using Redis, however this did not improve the run time in our case.
Summary
We demonstrated that by using Optuna for hyperparameter optimization, scaling up horizontally (i.e., adding more machines) can reduce both run durations and cost, at least in our case, in a more substantial manner than scaling up vertically.
We encourage the readers to try different configurations since the optimal setup in our situation may not be the ideal configuration for other scenarios.
Hai Rozencwajg is a Senior Data Scientist at Skyline AI, the company building the AI mastermind to solve RE.
References
[1] M. Raynal, Parallel Computing vs. Distributed Computing: A Great Confusion? (Position Paper) (2015), SpringerLink