Making Sense of Big Data

Which is faster, Python threads or processes? Some insightful examples

A series of examples which explain the advantages and disadvantages of threads vs. processes. With Dask code for your own experiments.

James Fulton

Published in

Towards Data Science

9 min readNov 4, 2021

If you are reading this, you have likely been trying to work out what the difference is between threads and processes in Python, and when you should use each. In order to explain some of the key differences, I’m going to show some example functions, and analyse how long they take to run using both threads and processes. Importantly, I’ll talk about why threads and processes have the different timings in each case.

I’m going to be using Dask to run the example functions using threads and processes. More details on how to do this with Dask are at the bottom of the article. For now, I’m going to focus on the differences between threads and processes.

1. A function which activates the GIL

Let’s start with a simple function, just a Python for-loop. It takes an argument n which is an integer of how many repeats in the loop.

def basic_python_loop(n):
    """Runs simple native Python loop which is memory light 
    and involves no data input or output."""
    mydict = {}
    for i in range(n):
        mydict[i%10] = i
    return None

This function only creates a 10 item dictionary, so uses very little memory, and it isn’t very demanding on the CPU. Your computer will actually spend most of its time interpreting the Python code, rather than running it.

The plot below shows the times taken to execute the function basic_python_loop(n=10_000_000) 16 times. For comparison, running this function just once took about 0.8 seconds.

In the figure, each of the 16 calls to the function are assigned a task number 1–16. The orange shaded bar shows when each function call started and when it finished. For example, task 8 in the left panel started ~5 seconds after the start of the computation, and ran until 5.6 seconds. The blue bar at the top shows the entire time required to complete all 16 function calls.

The three panels in the figure time the functions using a simple loop, using multi-threading and using parallel processing.

The full 16-part calculation takes the same length of time using threads as it takes using a loop. And interestingly, each of the individual tasks takes a lot longer using threads. This is because of the global interpreter lock (GIL). In Python, only one thread can read the code at once. This is a core feature of the Python language, but most other programming languages do not have this limitation. This means that other articles you read about multi-threading may not apply to Python.

In Python, threads work like a team of cooks sharing a single recipe book. Let’s say they have 3 dishes to prepare (3 threads), and there are 3 cooks (3 cores on your computer). A cook will read one line from the recipe, and go and complete it. Once they have completed the step they join the line to read their next step. If the steps in the recipe are short (like they are in a basic Python for-loop), the cooks will complete them very quickly and so will spend most of their time just waiting for their turn to read the recipe book. So each recipe takes longer to make than if a cook could have sole access to the book.

The 16-tasks shown above were run using 3 threads, this means there are 3 function calls (3 recipes) being handled at once. Since the steps are simple (like accessing a dictionary), the threads spend most of their time waiting to read the Python code. So although we are simultaneously running 3 function calls, each one takes 3 times longer to complete. So there is no benefit to using multi-threading here!

In the figure, you can see that processes were about 3 times faster than using a loop or using multi-threading to complete all 16 tasks. Each individual task took the same length of time when run using processes as it did using loops. This is because each process has its own GIL, so processes do not lock each other out like threads.

In Python, parallel processing is like a team of cooks, but every cook has their own kitchen and recipe book.

2. Loading data from CSV

In this function we load a randomly chosen CSV from a directory. All CSVs in the directory are the same size.

def load_csv():
    """Load, but do not return, a CSV file."""
    # Choose a random CSV from the directory
    file = np.random.choice(glob.glob(f"{temp_dir}/*.csv"))
    df = pd.read_csv(file)
    return None

Threads and processes took about as long as each other, and both were faster than using a loop. In this function, unlike the previous one, each task completed by threads takes the same amount of time as when completed by the loop. But why don’t threads get slowed down by the GIL here?

In this function, most of the time is spent running the line pd.read_csv(file). This function causes Python to run a large chunk of C code to load the data from file. When running this C code, a thread releases the Python interpreter so that other threads can read the Python code, and it won’t need to read the Python code again until it has finished loading. This means that the threads don’t lock each other out as much as in the previous example. They aren’t all fighting to read the code at once.

Most of the functions in NumPy, scipy, and pandas are written in C and so they also cause threads to release the GIL and avoid locking each other out.

In the recipe-cook analogy, this is like having an step in the recipe which says “knead the dough for 5 minutes”. The instruction is quick to read, but takes a long time to complete.

3. A NumPy function which uses a lot of CPU

This next function uses NumPy to create a random array and find its inverse. Basically, it is just a computationally heavy calculation.

def numpy_cpu_heavy_function(n):
    """Runs a CPU intensive calculation, but involves 
    no data input or output."""
    x = np.linalg.inv(np.random.normal(0, 1, (n,n)))
    return None

The timings for this function are shown below, usingn=2000 so that the array is of shape 2000x2000.

Strangely, we didn’t get much of a speedup using either threads or processes. But why? in the previous examples we got a 3 times speed-up since the computer used has 3 cores. This is because NumPy itself uses multi-threading and uses multiple cores. This means when you try to run many of these NumPy functions in parallel, using using threads or processes, you are limited by computing power. Each core is trying to recruit additional cores to run its NumPy calculation. So once all of your cores are running at 100% speed, there is no way to get more computing power out of them.

4. Functions with large inputs or outputs

So far, the functions we have used have either had no input arguments, or else the input argument is just a single integer. A key difference between processes and threads is shown when we have sizeable inputs or outputs.

def transfer_data(x):
    """Transfer data into and out of a function."""
    return x*2

In the function above, we will pass in an array x and return the array doubled. In the results shown belowx was an array of dimensions 10,000x1000 and was 80 MB in memory.

In this case, threads completed the task in 0.36 seconds, and the loop took 0.51 seconds, but processes took over 14 times longer. This is because processes each have their own separate pool of memory. When we pass the array x into the function, and run it using processes, x must be copied into each process from the main Python session. This took about 3.5 seconds. Once the array was copied, the processes could double the array very quickly, but then it took another 3.5 seconds to copy the doubled arrays back to the main session. Contrary to this, threads share the same memory space with the main Python session, so there is no need to copy the array across and back again.

In the cook analogy, processes are like 3 cooks each with their own recipe books and own kitchens. The kitchens are in separate locations, so if we want the cooks to run some recipes, we need to carry the the ingredients to them, then they can cook the dish, and we need to go and collect the dishes. Sometimes it is easier to use threads and have them cook the dishes in the same kitchen, even if that means they have to read from the same recipe book.

Threads vs. Processes - A Brief Summary

Let’s summarise:

The GIL means that only one thread can read the Python code at once. This means multiple threads can lock each other out.
Using external calls to C code, like in NumPy, SciPy, and pandas functions means threads will release the GIL while they run these functions. This means threads are less like to have to wait for a chance to read the code.
Processes each have their own memory pool. This means it is slow to copy large amounts of data into them, or out of them. For example when running functions on large input arrays or DataFrames.
Threads share the same memory as the main Python session, so there is no need to copy data to or from them.

These lead to some rules of thumb for speeding up calculations.

First of all, on data transfer. If your functions take in or return large chunks of data, use threads; otherwise you will waste too much time transferring data. However, ask yourself, do you really need to have data as an argument, could you load data inside the function? If so, then you could still use processes.
Next on Python loops. If your function must use simple native Python loops, then use processes. However, ask yourself, could these loops be replaced with NumPy array operations? If so you could still use threads.
If you are using computationally expensive NumPy/etc operations, then you may not gain much by using threads or processes.

Running functions using threads and processes using Dask

Thoroughly covering Dask in this article would make it too long, so instead, I’ll cover the essential parts, and link to a notebook, so that these plots are reproduceable.

If you are interested, I also have a Dask course on DataCamp where I cover Dask more thoroughly with interactive videos and exercises. The first chapter of that course is free, takes about 30 minutes, and covers all the Dask used below and in the notebooks.

We can use Dask to run calculations using threads or processes. First we import Dask, and use the dask.delayed function to create a list of lazily evaluated results.

import daskn = 10_000_000lazy_results= []
for i in range(16):
    lazy_results.append(dask.delayed(basic_python_loop)(n))

Note that the function basic_python_loop hasn’t actually been run yet since it is lazily evaluated. Instead, only the instructions to run it have been stored.

We can run the calculation using multi-threading like:

results = dask.compute(lazy_results, scheduler='threads')

Or can run the calculation using multi-processing like:

results = dask.compute(lazy_results, scheduler='processes')

These are the simplest methods, but in the experiments for this article, I wanted more control over the number of threads and processes used. To do this, you can create a client which sets up a pool of processes and/or threads which you use to complete the computation.

For example, to create and use a pool of 3 processes, you can use:

process_client = Client(
    processes=True, 
    n_workers=3, 
    threads_per_worker=1
)results = process_client.compute(lazy_results)

Some similar stories and further reading

I took a lot of inspiration from Brendan Fortuner’s medium post from a few years ago. In fact a lot of what I’ve done here recreates his examples, but I wanted to go a little deeper than what is in the original article he wrote.

Intro to Threads and Processes in Python

Beginner’s guide to parallel programming

medium.com

If you are using native Python loops a lot in your code, then you should definitely be using Numba. It can speed up these loops to speeds approaching that of C. Best of all, if you use Numba correctly (see notebook) you can set it so that your loop functions do not lock the GIL. This means you can use your Numba loop functions, which are already much faster, and run them in parallel with multi-threading.

Speed Up your Algorithms Part 2— Numba

Get C++/Fortran like speed for your functions with Numba

towardsdatascience.com

Finally, the rules of thumb for using threads and processes, that we arrived at in this article, are quite similar to the Dask best practices described here.