The world’s leading publication for data science, AI, and ML professionals.

Speed up Pandas code with Numpy

But I can't vectorise this, can I?  …. yes, you probably can!

Image by AI (Dalle-3)
Image by AI (Dalle-3)

In one of the first articles I wrote on Medium, I talked about using the apply() method on Pandas dataframes and said it should be avoided, if possible, on larger dataframes. I’ll put a link to that article at the end of this one if you want to check it out.

Although I talked then a bit about possible alternatives, i.e. using vectorisation, I didn’t give many examples of using vectorisation, so I intend to remedy that here. Specifically, I want to talk about how Numpy and a couple of its lesser-known methods ( whereand select) can be used to speed up Pandas operations that involve complex if/then/else conditions.

Vectorisation in the context of Pandas refers to the method of applying operations to entire blocks of data at once rather than iterating through them row by row or element by element. This approach is possible due to Pandas’ reliance on NumPy, which supports vectorised operations that are highly optimized and written in C, enabling faster processing. When you use vectorised operations in Pandas, such as applying arithmetic operations or functions to DataFrame or Series objects, the operations are dispatched to multiple data elements simultaneously.

This not only leads to more concise and readable code but can significantly boost performance by reducing the overhead of Python loops and taking advantage of modern CPUs’ capabilities to perform operations on multiple data points in parallel. Vectorization is a key feature that makes Pandas powerful for data manipulation and analysis in Python.

The problem is that some vectorisation operations are obvious and happen without you even realising that’s what’s being done under the hood. Many times, though, when the types of operations you want to perform become more complex, it is difficult to see how vectorisation can be applied in these situations.

In this article, I’ll discuss some common scenarios in which this is the case and show you how vectorisation can be applied.

Getting some data to work with

In many of my articles looking at the performance of Python libraries and database systems, I use a synthetic set of sales data for my testing. The schema of this data set looks like this,

  • order_id (int)
  • order_date (date)
  • customer_id (int)
  • customer_name (str)
  • product_id (int)
  • product_name (str)
  • category (str)
  • quantity (int)
  • price (float)
  • total (float)

Here is a Python program that you can use to generate such a data set. It produces a CSV. The number of records to create and the location of the output file is configurable. It uses the NumPy and polars libraries, so you must install these before running it.

For this test, as I have a fairly high-spec PC, I’ll create and use a 1 million record CSV.

# generate the 1m record CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
    names = np.asarray(
        [
            "Laptop",
            "Smartphone",
            "Desk",
            "Chair",
            "Monitor",
            "Printer",
            "Paper",
            "Pen",
            "Notebook",
            "Coffee Maker",
            "Cabinet",
            "Plastic Cups",
        ]
    )

    categories = np.asarray(
        [
            "Electronics",
            "Electronics",
            "Office",
            "Office",
            "Electronics",
            "Electronics",
            "Stationery",
            "Stationery",
            "Stationery",
            "Electronics",
            "Office",
            "Sundry",
        ]
    )

    product_id = np.random.randint(len(names), size=nrows)
    quantity = np.random.randint(1, 11, size=nrows)
    price = np.random.randint(199, 10000, size=nrows) / 100

    # Generate random dates between 2010-01-01 and 2023-12-31
    start_date = datetime(2010, 1, 1)
    end_date = datetime(2023, 12, 31)
    date_range = (end_date - start_date).days

    # Create random dates as np.array and convert to string format
    order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

    # Define columns
    columns = {
        "order_id": np.arange(nrows),
        "order_date": order_dates,
        "customer_id": np.random.randint(100, 1000, size=nrows),
        "customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
        "product_id": product_id + 200,
        "product_names": names[product_id],
        "categories": categories[product_id],
        "quantity": quantity,
        "price": price,
        "total": price * quantity,
    }

    # Create Polars DataFrame and write to CSV with explicit delimiter
    df = pl.DataFrame(columns)
    df.write_csv(filename, separator=',',include_header=True)  # Ensure comma is used as the delimiter

# Generate data with random order_date and save to CSV
generate(1_000_000, "/mnt/d/sales_data/sales_data_1m.csv")

Setting up our development environment

Before we get to the example code, let’s set up a separate development environment. That way, what we do won’t interfere with other versions of libraries, programming, etc… we might have on the go for other projects we’re working on.

I use Miniconda for this, but you can use whatever method suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

Once the environment is created, switch to it using the activatecommand, and then install Jupyter and any required Python libraries.

#create our test environment
(base) C:Usersthoma>conda create -n pandas_vect python=3.12 -y
# Now activate it
(base) C:Usersthoma>conda activate pandas_vect
# Install python libraries, etc ...
(pandas_vect) C:Usersthoma>conda install pandas numpy jupyter -y

Now type in jupyter notebook into your command prompt. You should see a jupyter notebook open in your browser. If that doesn’t happen automatically, what you’ll likely see is a screenful of information after the jupyter notebookcommand.

Near the bottom, there will be a URL that you should copy and paste into your browser to initiate the Jupyter Notebook.

Your URL will be different to mine, but it should look something like this:-

http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69da

The code

To begin, let’s read our data set into a Pandas dataframe. We’ll time everything to get an idea of any speed-ups we gain.

import pandas as pd
import numpy as np
import time

# Start the timer
start_time = time.time()

# Path to the CSV file
file_path = "d:sales_datasales_data_1m.csv"

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Display the last 10 rows of the DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken to read the CSV file : {elapsed_time} seconds")

#
# Here is our output
#

  order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price   total  
0    Smartphone  Electronics         3  90.02  270.06  
1       Printer  Electronics         6  12.74   76.44  
2      Notebook   Stationery         8  48.35  386.80  
3        Laptop  Electronics         3  74.85  224.55  
4       Cabinet       Office         6  53.77  322.62  

Time taken to read the CSV file : 1.0295870304107666 seconds

Example 1 – Setting the scene

So, many operations you do on a Pandas dataframe are inherently vectorised. For example, suppose we want to multiply the quantity field by 5 and update the total value column.

# Start the timer
start_time = time.time()

df['quantity'] *= 5

# Update 'total' to reflect the new 'quantity'
df['total'] = df['quantity'] * df['price']

# Display the updated DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price    total  
0    Smartphone  Electronics        15  90.02  1350.30  
1       Printer  Electronics        30  12.74   382.20  
2      Notebook   Stationery        40  48.35  1934.00  
3        Laptop  Electronics        15  74.85  1122.75  
4       Cabinet       Office        30  53.77  1613.10  

Time taken : 0.009307861328125 seconds

It took less than one-hundredth of a second to process 1 million records. How do we know the above was a vectorisation process? Apart from the minuscule amount of time it took to run, we can check to see how long it takes to do the same thing using a non-vectorised method—specifically, using the apply() method.

If you weren’t very experienced in coding Pandas, this is a method you might have devised to solve this problem in the first place.

Using the Apply() method

Apply() allows you to run a function that will be applied to every record in a dataframe. This should be quicker than using a for loop to iterate over the dataframe rows, but how does it compare to the original vectorised code?

# Start the timer
start_time = time.time()

# Define the function to update 'quantity' and 'total'
def update_row(row):
    row['quantity'] *= 5
    row['total'] = row['quantity'] * row['price']
    return row

# Apply the function to each row
df = df.apply(update_row, axis=1)

# Display the updated DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price    total  
0    Smartphone  Electronics        15  90.02  1350.30  
1       Printer  Electronics        30  12.74   382.20  
2      Notebook   Stationery        40  48.35  1934.00  
3        Laptop  Electronics        15  74.85  1122.75  
4       Cabinet       Office        30  53.77  1613.10 

Time taken : 75.53943586349487 seconds

So, as you can see from the timng of ther above operation, we’ve established that vectorisation in Pandas coding is essential when processing large datasets and that it often just gets implemented "behind the scenes" for simpler code problems.

But what happens when our coding needs are slightly more complicated?

Example 2 -Vectorise an If/then/else condition

In this example, say we want to do the same operation as before (quantity 5), but this time only for Smartphone products. For any other products, we want (quantity 2).

Here is the naive apply() implementation.

def update_record(row):
    if row['product_names'] == 'Smartphone':
        row['quantity'] *= 5
    else:
        row['quantity'] *= 2
    row['total'] = row['quantity'] * row['price']
    return row

# Start the timer
start_time = time.time()

# Apply the update_record function to each row
df = df.apply(update_record, axis=1)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

# Display the updated DataFrame
print(df)
print(f"Time taken: {elapsed_time} seconds")

        order_id  order_date  customer_id   customer_name  product_id  
0              0  2022-08-01          245    Customer_884         201   
1              1  2022-02-19          701   Customer_1672         205   
2              2  2017-01-01          184  Customer_21720         208   
3              3  2013-03-09          275  Customer_23770         200   
4              4  2022-04-23          960  Customer_23790         210   
...          ...         ...          ...             ...         ...   
999995    999995  2011-05-08          408  Customer_26518         202   
999996    999996  2019-02-11          850   Customer_4581         208   
999997    999997  2021-11-19          399  Customer_28681         205   
999998    999998  2016-05-02          714  Customer_12693         209   
999999    999999  2018-08-12          324  Customer_28553         207   

       product_names   categories  quantity  price    total  
0         Smartphone  Electronics        15  90.02  1350.30  
1            Printer  Electronics        12  12.74   152.88  
2           Notebook   Stationery        16  48.35   773.60  
3             Laptop  Electronics         6  74.85   449.10  
4            Cabinet       Office        12  53.77   645.24  
...              ...          ...       ...    ...      ...  
999995          Desk       Office        12  32.29   387.48  
999996      Notebook   Stationery         6   8.16    48.96  
999997       Printer  Electronics         6  92.69   556.14  
999998  Coffee Maker  Electronics         4  18.10    72.40  
999999           Pen   Stationery        16  93.04  1488.64  

[1000000 rows x 10 columns]
Time taken: 78.65310955047607 seconds

At almost 80 secs, the run time is wayyyy too long.

There are a couple of ways we can improve this using vectorisation. The first is the obvious way and probably what most experienced Pandas coders would turn to as it seems natural. It’s quick, too.

# Start the timer
start_time = time.time()

# Multiply 'quantity' by 5 for Smartphones, by 2 for others
df.loc[df['product_names'] == 'Smartphone', 'quantity'] *= 5
df.loc[df['product_names'] != 'Smartphone', 'quantity'] *= 2

# Update 'total' based on new 'quantity'
df['total'] = df['quantity'] * df['price']

# Display the updated DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price    total  
0    Smartphone  Electronics        15  90.02  1350.30  
1       Printer  Electronics        12  12.74   152.88  
2      Notebook   Stationery        16  48.35   773.60  
3        Laptop  Electronics         6  74.85   449.10  
4       Cabinet       Office        12  53.77   645.24  
Time taken : 0.14528226852416992 seconds

The second way I’ll show uses a NumPy method you might not have encountered before – the NumPy where() function, and it’s even faster than regular vectorisation. Numpy.where() takes three arguments. The first argument is the condition you’re testing for. The second is what is returned if the test condition returns True, and the third is returned if the test condition is False.

# Start the timer
start_time = time.time()

# Update 'quantity' using numpy.where
df['quantity'] = np.where(df['product_names'].values == 'Smartphone', df['quantity'].values * 5, df['quantity'].values * 2)

# Recalculate 'total' based on the new 'quantity'
df['total'] = df['quantity'] * df['price']

# Display the updated DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price    total  
0    Smartphone  Electronics        15  90.02  1350.30  
1       Printer  Electronics        12  12.74   152.88  
2      Notebook   Stationery        16  48.35   773.60  
3        Laptop  Electronics         6  74.85   449.10  
4       Cabinet       Office        12  53.77   645.24  

Time taken : 0.026806116104125977 seconds

Using NumPy, in this case, was more than 5x times faster than the regular vectorisation speed and a whopping 3500x faster than using apply().

Example 3 – vectorise multiple if/then/else conditions.

Nested if/then/else conditions can be handled using nested Numpy.where conditions. But it gets harder to read and maintain such code when there are many such cases. If you’ve ever used nested IF statements in Excel, you’ll know what I mean.

In this instance, we can use another pretty cool NumPy method not many people have heard of, which is called select().

Like where(), select() also takes 3 arguments. The first is a Python list of conditions to test for. The second argument is a Python list of choices that tells NumPy what to return if the equivalent condition is met. The third argument is a default statement of what to return if none of the conditions are met.

The order of these is important. The first item in the choices list goes with the first item in the conditions list, and so on.

In a sense, it’s a bit like a case statement you would use if programming in Python, C, Java, etc …

Take our previous example to an extreme. Say we want to multiply our initial quantity by a different amount for each different type of product.

import pandas as pd
import numpy as np

# Start the timer
start_time = time.time()

# Define conditions
conditions = [
    df['product_names'] == 'Smartphone',
    df['product_names'] == 'Printer',
    df['product_names'] == 'Notebook',
    df['product_names'] == 'Laptop',
    df['product_names'] == 'Cabinet'
]

# Define choices
choices = [
    df['quantity'] * 1.5,
    df['quantity'] * 2.5,
    df['quantity'] * 3.5,
    df['quantity'] * 4.5,
    df['quantity'] * 5.5
]

# Default value if none of the conditions are met
default_choice = 0

# Update 'quantity' using numpy.select
df['quantity'] = np.select(conditions, choices, default=default_choice)

# Recalculate 'total' based on the new 'quantity'
df['total'] = df['quantity'] * df['price']

# Display the updated DataFrame
print(df.head(10))

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

 order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   
5         5  2019-07-10          197  Customer_25587         202   
6         6  2014-11-12          510   Customer_6912         204   
7         7  2016-07-12          150  Customer_17761         200   
8         8  2016-11-12          997  Customer_23801         209   
9         9  2017-01-23          151  Customer_30325         207   

  product_names   categories  quantity  price     total  
0    Smartphone  Electronics       4.5  90.02   405.090  
1       Printer  Electronics      15.0  12.74   191.100  
2      Notebook   Stationery      28.0  48.35  1353.800  
3        Laptop  Electronics      13.5  74.85  1010.475  
4       Cabinet       Office      33.0  53.77  1774.410  
5          Desk       Office       0.0  47.17     0.000  
6       Monitor  Electronics       0.0  22.50     0.000  
7        Laptop  Electronics      40.5  49.33  1997.865  
8  Coffee Maker  Electronics       0.0  47.22     0.000  
9           Pen   Stationery       0.0   3.50     0.000  

Time taken : 0.26836657524108887 seconds

Example 4 – Vectorise nested multiple if/then/else conditions

My final example is similar to the previous one, but now the quantity for Smartphones only gets multiplied by 1.5 if the existing quantity for Smartphones is > 5. Otherwise, the quantity gets multiplied by 2.

The other conditions are unchanged. In other words, we have a nested if/then/else scenario. If this logic was in a function in a programming language like Python, it would look similar to this,

def adjust_quantity(row):
    if row['product_names'] == 'Smartphone':
        if row['quantity'] > 5:
            row['quantity'] *= 1.5
        else:
            row['quantity'] *= 2
    elif row['product_names'] == 'Printer':
        row['quantity'] *= 2.5
    elif row['product_names'] == 'Notebook':
        row['quantity'] *= 3.5
    elif row['product_names'] == 'Laptop':
        row['quantity'] *= 4.5
    elif row['product_names'] == 'Cabinet':
        row['quantity'] *= 5.5
    else:
        row['quantity'] = 0  # Default case if none of the conditions are met
    return row

Can we vectorise this? Yes, we can by modifying the same method as before. This time, we add some extra boolean logical tests and the extra actions required to the choices list.

I won’t repeat the whole code, just the changes.

# Define conditions
conditions = [
  (df['product_names'] == 'Smartphone') & (df['quantity'] > 5),
  (df['product_names'] == 'Smartphone') &amp; (df['quantity'] <=5),    
  df['product_names'] == 'Printer',
  df['product_names'] == 'Notebook',
  df['product_names'] == 'Laptop',
  df['product_names'] == 'Cabinet'
]

# Define choices
choices = [
  df['quantity'] * 1.5,
  df['quantity'] * 2,    
  df['quantity'] * 2.5,
  df['quantity'] * 3.5,
  df['quantity'] * 4.5,
  df['quantity'] * 5.5
]

Running our new code, we get this output,

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   
5         5  2019-07-10          197  Customer_25587         202   
6         6  2014-11-12          510   Customer_6912         204   
7         7  2016-07-12          150  Customer_17761         200   
8         8  2016-11-12          997  Customer_23801         209   
9         9  2017-01-23          151  Customer_30325         207   

  product_names   categories  quantity  price     total  
0    Smartphone  Electronics       6.0  90.02   540.120  
1       Printer  Electronics      15.0  12.74   191.100  
2      Notebook   Stationery      28.0  48.35  1353.800  
3        Laptop  Electronics      13.5  74.85  1010.475  
4       Cabinet       Office      33.0  53.77  1774.410  
5          Desk       Office       0.0  47.17     0.000  
6       Monitor  Electronics       0.0  22.50     0.000  
7        Laptop  Electronics      40.5  49.33  1997.865  
8  Coffee Maker  Electronics       0.0  47.22     0.000  
9           Pen   Stationery       0.0   3.50     0.000  
Time taken : 0.33173537254333496 seconds

The run time is slightly slower than the previous run as more work is being done.

Summary

Hopefully, you’ll take away two things from this article. The first is that vectorisation is an essential ingredient for performant Pandas code when dealing with medium to large data sets. The second is that even if the operations you need to apply to dataframe records are complex, they can still often be vectorised.

I explained why vectorisation in Pandas is essential for performant code and I showed serveral examples of how we can use vectorisation even in complex data processing scenarios

_That’s all from me for now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

If you liked this content, Medium thinks you’ll find these articles interesting, too.

Structured output in the OpenAI API

PySpark Explained: Delta Tables

Here’s the link to the story I referred to at the beginning about avoiding the use of the apply() method in Pandas when dealing with large dataframes.

Thinking of using Pandas apply() on a big dataframe: Stop! – read this instead


Related Articles