The world’s leading publication for data science, AI, and ML professionals.

Submitting your MATLAB jobs using Slurm to High-Performance Clusters

A Tutorial on using SLURM for job submissions to High-Performance Clusters

Image created by the author using a MATLAB script
Image created by the author using a MATLAB script

In my previous article, I wrote about using PBS job schedulers for submitting jobs to High-Performance Clusters (Hpc) to meet our computation need. However, not all HPC support PBS jobs. Recently my institution also decided to use another kind of job scheduler called Slurm for its newly installed clusters.

How to Submit R Code Jobs using PBS to High-Performance Clusters

Taken from its documentation¹, Slurm is an open-source, fault-tolerant, and scalable cluster management and job scheduler Linux cluster. As a cluster workload manager, Slurm has three key functions:

  1. It allocates access to resources to users for some duration of time so they can execute their computational jobs.
  2. It provides a framework to start, execute, and monitor work (normally a parallel job) on the set of allocated nodes.
  3. It arbitrates contention for resources by managing a queue of pending work.

In this article, I will demonstrate how to submit Matlab jobs to the HPC node supporting the Slurm job scheduler.

You must be wondering why MATLAB, after all, is not open-source. There are ample amount of reasons for that. First, MATLAB supports true parallelism, unlike most python packages – except for those that are actually implemented in C++ and just provide wrappers at python levels. MATLAB has excellent support for distributed computing and underlying implementation is indeed in C++ which is extremely fast. Besides, MATLAB provides out-of-the-box algorithms to support applications in the domain of Control Engineering, Digital Signal Processing, Robotics, and so on where the python community still lags. Chances are that if you are a part of large organizations such as a university, your organization may have an institutional license for MATLAB.

Writing Slurm scripts for Job submission

Slurm submission scripts have two parts: (1) Resource Requests (2) Job Execution

The first part of the scripts specifies the number of nodes, maximum CPU time, the maximum amount of RAM, whether GPUs are needed, etc. that the job will request for running the computation task.

The second part of the script specifies which modules to load, what data files to load, and which program or code to run.

A Working Example

I presenting a working example, where I want to compute something in parallel leveraging the distributed computing toolbox provided by MATLAB. For that, I can use parfor instead of a conventional signal core for . Since the HPC system where I want to run this code has 94 cores, I will specify to create 94 workers so that 94 parallel jobs² can be created.

Let’s begin, first, create a new file called minimal_parfor.m in /home/u100/username directory:

L = linspace(0.000001, 200, 800000);
index = 1:length(L);
formatOut = 'yyyy_mm_dd_HH_MM_SS_FFF';
dt = datestr(now,formatOut);
datafolder  = "/home/u100/username";
outputfile = datafolder + "/"+"SIN_" + dt + ".csv";
if ~isfile(outputfile)
        fprintf("SIN Output file doesn't exist .. creatingn");
        fileID = fopen(outputfile,'w');
        fprintf(fileID,'L,Sin(L)n');
        fclose(fileID);
else
        fprintf("SIN Output file exists .. not creatingn");
end
parpool(94);
parfor ii = index
    S = sin(L(ii));
    fileID = fopen(outputfile,'a');
    fprintf(fileID,'%.10f,%.10fn', ...
        L(ii),S);
    fclose(fileID);
end

As we can see, I am requesting 94 workers using parpool . Further, I am computing sin values of a number in parallel and saving them in a file. I prefer the method of saving data in a file rather than storing it in a variable because I can take a look at the file while execution is still running. parfor randomly selects 94 indices as specified to run 94 parallel jobs. As a result, I won’t find indices being written in the file sequentially. However, I can always sort them later.

In the second step, I create a SLURM file in /home/u100/username directory and name it minimal.slurm:

#!/bin/bash
# --------------------------------------------------------------
### PART 1: Requests resources to run your job.
# --------------------------------------------------------------
### Optional. Set the job name
#SBATCH --job-name=minimal_parfor
### Optional. Set the output filename.
### SLURM reads %x as the job name and %j as the job ID
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
### REQUIRED. Specify the PI group for this job
#SBATCH --account=manager
### Optional. Request email when job begins and ends
### Specify high priority jobs
#SBATCH --qos=user_qos_manager
# SBATCH --mail-type=ALL
### Optional. Specify email address to use for notification
# SBATCH [email protected]
### REQUIRED. Set the partition for your job.
#SBATCH --partition=standard
### REQUIRED. Set the number of cores that will be used for this job.
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=94
### REQUIRED. Set the memory required for this job.
#SBATCH --mem=450gb
### REQUIRED. Specify the time required for this job, hhh:mm:ss
#SBATCH --time=200:00:00
# --------------------------------------------------------------
### PART 2: Executes bash commands to run your job
# --------------------------------------------------------------
### SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date
echo "CPUs per task: $SLURM_CPUS_PER_TASK"
### Load required modules/libraries if needed
module load matlab/r2020b
### This was recommended by MATLAb through technical support
ulimit -u 63536 
cd $PWD
matlab -nodisplay -nosplash -softwareopengl < /home/u100/username/mininal_parfor.m > /home/u100/username/out_mininal.txt
date
~

minimal.slurm is a bash script that specifies the resources to request in HPC and how to execute the MATLAB job. I specify 94 cpus using the command SBATCH - cpus-per-task=94 so that it can be available to MATLAB when it requests 94 workers through parpool. Further, I request 450 GB of RAM which will be available when my job starts running.

To submit the job to HPC, type

sbatch minimal.slurm

To get the status of your submitted job, you can type:

sacct

or

squeue | grep username

Once the job starts, finished, or terminated for any reason, I expect to receive an email on the specified email address in my slurm file.

I hope this article will be helpful to anyone using SLURM for job submissions who want to use the power of MATLAB’s Parallel Computing toolbox at

Parallel Computing Toolbox

For more detail on MATLAB’s distributed computing toolbox, please check out

References

  1. https://slurm.schedmd.com/quickstart.html
  2. https://www.mathworks.com/help/parallel-computing/parpool.html

Note: This article is in no way a solicitation for MATLAB or any product by Mathworks. The author holds no affiliation with Mathworks or a related entity.


Related Articles