Writing Programs for Super Computers

When, What and How to change your code.

Anuradha Wickramarachchi

Published in

Towards Data Science

5 min readOct 10, 2019

An abstract and imprecise impression

To give an idea of a super computer think of a power full set of machines, that you can use as one. Weird enough? Its not necessarily a single machine with a lot of memory or CPU power. Indeed they do have great computing power but not just by increasing speed or memory. Lets have a look at NCI (which I currently happen to work with). These are the specs at NCI.

89,256 cores in 4,500 Intel Xeon Sandy Bridge, Broadwell and Skylake nodes
128 GPUs in 32 NVIDIA Tesla K80 and P100 nodes
32 64-core Intel Xeon Phi processors in 32 nodes
64 cores in 4 IBM Power8 nodes
300 terabytes of memory
8 petabytes of operational disk storage
Hybrid FDR/EDR Mellanox Infiniband full fat tree interconnect (up to 100 Gb/sec)

Since this is a complete computational facility that thousands of user around Australia use, programs are accepted to run in the form of jobs, submitted to a queue under a specific project grant (usually funded by research institutions). Different queues charge varying amounts. But basically you are charged per CPU hour, times the queue cost (determined by queue specs and priority). If you have 1 CPU program running 1 hour; you consume 1 service unit multiplied by the queue cost (read more if you like). Usually grants are in order of thousands or millions of CPU hours.

Meaning of Specifications

At the time of your job submission you can specify the resources you need. Most important things here is that the systems architecture is always x64 for CPUs and CUDA for GPUs. So when you compile for the Linux operating system, you can run them in the super computer. Your job submission will look like this.

#!/bin/bash
#PBS -P ch35
#PBS -q biodev
#PBS -l ncpus=28
#PBS -l mem=512GB
#PBS -l jobfs=1GB
#PBS -l walltime=100:00:00
#PBS -l wdmodule load java/jdk1.8.0_60module use -a /short/qr59/aw5153/modules/
module load canu/1.8java_path=$(which java)canu -p meta -pacbio-raw cluster-0cluster-0.fastq -d canu-cluster-0cluster-0/ maxMemory=512 maxThreads=28 useGrid=false java=$java_path genomeSize=50m

This is a real script I used to assembly a metagenomic sample (with millions of reads. Or think like some program). I have requested 28 cores/cpus, from project ch35 with 512GB memory and 100 hours walltime. So I will consume 28 * 100 * 1.5 = 2800 * 1.5 = 4200 service units. 1.5 is the cost for biodev job queue. Usually memory is not charged since when you ask 28 cpus they allocate a complete node so we essentially have 576GB of RAM per node.

You can see that we have 89,256 cores in 4500 nodes. These nodes can be considered as separate computers connected with infiniband (very fast networking). These nodes can share data by message passing interfaces (MPI) but they do not have shared memory. Which means if you ask for 56 cores, I will get 2 nodes, 28 cores each, 576GB Max RAM each (1152GB or 1TB RAM total). But you cannot load one program with more than 576GB. This is because these nodes are NUMA nodes which stands for Non Uniform Memory Access. One process in one node cannot access RAM of the other node. So if your program is bigger you are done! Or Are you?

Let’s look at this scenario with a sample program

Suppose you want to count all possible n-grams of a huge dataset (millions of genomic reads — strings of about 100k characters each). Then have a lookup table of each n-grams count. Then for each string you want to annotate them with their n-gram counts in the entire corpus. Most efficient way will be to keep a huge hash table in memory and use it as a lookup table, iterate each 100k character string for their n-grams and annotate). The lookup table will take 25GB (assume). Doing this in sequentially requires a lot of time, so we want to process them in parallel.

When to Change?

In our sample scenario, we can have up to 28 threads per node (RAM for this particular scenario is not a big issue). But running this with just 28 cores will take days. Now we need to make this use more than 28 cores, more likely 100 or even more cores.

Reality: If I just ask for 1000 (this will automatically pick 36 nodes)cores and submit my job, it will stall with thrashing. This is because although I ask for 1000 cores my program will run on 1 node, because one process can run only on one node.

This is when we know our program has to change to suit super computing environments or grid systems.

What to Change?

We can overcome the above challenge if we can run 1000 parallel processes. This is because separate processes can run on separate nodes as Operating System (Or the Grid system) can schedule them effectively. But now our RAM requirement will escalate at once needing us 25*1000GB as we have to have separate lookup tables for each process. We will simple need 25TB of RAM. But for 36 nodes we only have 36*576GB = 20TB. Now this is where we need both multi processing and multi threading. So we need to change both threading and process management aspects to get an optimum use of our service units. Let’s see how we’d do that.

How to Change?

As we are trying to improve performance I will explain in terms of C++ and Python programming, which are quite often used in scientific computing. For multi threading we could use either OpenMP (C++) or multiprocessing library of python (You must use shared memory architecture. If not you’ll loose RAM again). For multi processing you can use Open-MPI (Both C++ and Python variants available). Now we can simply deploy processes with around 28 threads per process. Now each process will use only 25GB RAM residing on a single node. Likewise, we can load more 100k character strings to RAM at once to process with 28 threads (As we have a lot of RAM left in each node after loading lookup tables). This program runs a lot faster.

You’ll need some pre and post processing to chunk initial data set to 36 parts and gather all 36 outputs, which can be simple split and cat commands outside of the job queue. Following is a modified code from MPI-tutorials.

#include <mpi.h>
#include <omp.h>
#include <iostream>
#include <vector>
#include <string>int main(int argc, char** argv) {
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Print off a hello world message
    vector<string> data; // Load data set of rank to run with 28 threads
    #pragma omp parallel for num_threads(28)
    for(string s: data)
        // process    // Finalize the MPI environment.
    MPI_Finalize();
}

Your real program will be a lot complex than this. If you are doing your work in python make sure you have read the documents to ensure non redundant memory sharing when threading (Or it will clone all shared variables including the lookup table).

I intend to share the complete implementation, but not at this point as I intent to introduce many optimizations you could do in a future writing.

Hope it was a good read, at least for beginners.

Cheers!