
In one of the first articles I wrote on Medium, I talked about using the apply() method on Pandas dataframes and said it should be avoided, if possible, on larger dataframes. I’ll put a link to that article at the end of this one if you want to check it out.
Although I talked then a bit about possible alternatives, i.e. using vectorisation, I didn’t give many examples of using vectorisation, so I intend to remedy that here. Specifically, I want to talk about how Numpy and a couple of its lesser-known methods ( where
and select
) can be used to speed up Pandas operations that involve complex if/then/else conditions.
Vectorisation in the context of Pandas refers to the method of applying operations to entire blocks of data at once rather than iterating through them row by row or element by element. This approach is possible due to Pandas’ reliance on NumPy, which supports vectorised operations that are highly optimized and written in C, enabling faster processing. When you use vectorised operations in Pandas, such as applying arithmetic operations or functions to DataFrame or Series objects, the operations are dispatched to multiple data elements simultaneously.
This not only leads to more concise and readable code but can significantly boost performance by reducing the overhead of Python loops and taking advantage of modern CPUs’ capabilities to perform operations on multiple data points in parallel. Vectorization is a key feature that makes Pandas powerful for data manipulation and analysis in Python.
The problem is that some vectorisation operations are obvious and happen without you even realising that’s what’s being done under the hood. Many times, though, when the types of operations you want to perform become more complex, it is difficult to see how vectorisation can be applied in these situations.
In this article, I’ll discuss some common scenarios in which this is the case and show you how vectorisation can be applied.
Getting some data to work with
In many of my articles looking at the performance of Python libraries and database systems, I use a synthetic set of sales data for my testing. The schema of this data set looks like this,
order_id
(int)order_date
(date)customer_id
(int)customer_name
(str)product_id
(int)product_name
(str)category
(str)quantity
(int)price
(float)total
(float)
Here is a Python program that you can use to generate such a data set. It produces a CSV. The number of records to create and the location of the output file is configurable. It uses the NumPy and polars libraries, so you must install these before running it.
For this test, as I have a fairly high-spec PC, I’ll create and use a 1 million record CSV.
# generate the 1m record CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta
def generate(nrows: int, filename: str):
names = np.asarray(
[
"Laptop",
"Smartphone",
"Desk",
"Chair",
"Monitor",
"Printer",
"Paper",
"Pen",
"Notebook",
"Coffee Maker",
"Cabinet",
"Plastic Cups",
]
)
categories = np.asarray(
[
"Electronics",
"Electronics",
"Office",
"Office",
"Electronics",
"Electronics",
"Stationery",
"Stationery",
"Stationery",
"Electronics",
"Office",
"Sundry",
]
)
product_id = np.random.randint(len(names), size=nrows)
quantity = np.random.randint(1, 11, size=nrows)
price = np.random.randint(199, 10000, size=nrows) / 100
# Generate random dates between 2010-01-01 and 2023-12-31
start_date = datetime(2010, 1, 1)
end_date = datetime(2023, 12, 31)
date_range = (end_date - start_date).days
# Create random dates as np.array and convert to string format
order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])
# Define columns
columns = {
"order_id": np.arange(nrows),
"order_date": order_dates,
"customer_id": np.random.randint(100, 1000, size=nrows),
"customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
"product_id": product_id + 200,
"product_names": names[product_id],
"categories": categories[product_id],
"quantity": quantity,
"price": price,
"total": price * quantity,
}
# Create Polars DataFrame and write to CSV with explicit delimiter
df = pl.DataFrame(columns)
df.write_csv(filename, separator=',',include_header=True) # Ensure comma is used as the delimiter
# Generate data with random order_date and save to CSV
generate(1_000_000, "/mnt/d/sales_data/sales_data_1m.csv")
Setting up our development environment
Before we get to the example code, let’s set up a separate development environment. That way, what we do won’t interfere with other versions of libraries, programming, etc… we might have on the go for other projects we’re working on.
I use Miniconda for this, but you can use whatever method suits you best.
If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,
Once the environment is created, switch to it using the activate
command, and then install Jupyter and any required Python libraries.
#create our test environment
(base) C:Usersthoma>conda create -n pandas_vect python=3.12 -y
# Now activate it
(base) C:Usersthoma>conda activate pandas_vect
# Install python libraries, etc ...
(pandas_vect) C:Usersthoma>conda install pandas numpy jupyter -y
Now type in jupyter notebook
into your command prompt. You should see a jupyter notebook open in your browser. If that doesn’t happen automatically, what you’ll likely see is a screenful of information after the jupyter notebook
command.
Near the bottom, there will be a URL that you should copy and paste into your browser to initiate the Jupyter Notebook.
Your URL will be different to mine, but it should look something like this:-
http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69da
The code
To begin, let’s read our data set into a Pandas dataframe. We’ll time everything to get an idea of any speed-ups we gain.
import pandas as pd
import numpy as np
import time
# Start the timer
start_time = time.time()
# Path to the CSV file
file_path = "d:sales_datasales_data_1m.csv"
# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)
# Display the last 10 rows of the DataFrame
print(df.head())
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken to read the CSV file : {elapsed_time} seconds")
#
# Here is our output
#
order_id order_date customer_id customer_name product_id
0 0 2022-08-01 245 Customer_884 201
1 1 2022-02-19 701 Customer_1672 205
2 2 2017-01-01 184 Customer_21720 208
3 3 2013-03-09 275 Customer_23770 200
4 4 2022-04-23 960 Customer_23790 210
product_names categories quantity price total
0 Smartphone Electronics 3 90.02 270.06
1 Printer Electronics 6 12.74 76.44
2 Notebook Stationery 8 48.35 386.80
3 Laptop Electronics 3 74.85 224.55
4 Cabinet Office 6 53.77 322.62
Time taken to read the CSV file : 1.0295870304107666 seconds
Example 1 – Setting the scene
So, many operations you do on a Pandas dataframe are inherently vectorised. For example, suppose we want to multiply the quantity field by 5 and update the total value column.
# Start the timer
start_time = time.time()
df['quantity'] *= 5
# Update 'total' to reflect the new 'quantity'
df['total'] = df['quantity'] * df['price']
# Display the updated DataFrame
print(df.head())
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")
#
# Output
#
order_id order_date customer_id customer_name product_id
0 0 2022-08-01 245 Customer_884 201
1 1 2022-02-19 701 Customer_1672 205
2 2 2017-01-01 184 Customer_21720 208
3 3 2013-03-09 275 Customer_23770 200
4 4 2022-04-23 960 Customer_23790 210
product_names categories quantity price total
0 Smartphone Electronics 15 90.02 1350.30
1 Printer Electronics 30 12.74 382.20
2 Notebook Stationery 40 48.35 1934.00
3 Laptop Electronics 15 74.85 1122.75
4 Cabinet Office 30 53.77 1613.10
Time taken : 0.009307861328125 seconds
It took less than one-hundredth of a second to process 1 million records. How do we know the above was a vectorisation process? Apart from the minuscule amount of time it took to run, we can check to see how long it takes to do the same thing using a non-vectorised method—specifically, using the apply() method.
If you weren’t very experienced in coding Pandas, this is a method you might have devised to solve this problem in the first place.
Using the Apply() method
Apply() allows you to run a function that will be applied to every record in a dataframe. This should be quicker than using a for loop to iterate over the dataframe rows, but how does it compare to the original vectorised code?
# Start the timer
start_time = time.time()
# Define the function to update 'quantity' and 'total'
def update_row(row):
row['quantity'] *= 5
row['total'] = row['quantity'] * row['price']
return row
# Apply the function to each row
df = df.apply(update_row, axis=1)
# Display the updated DataFrame
print(df.head())
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")
#
# Output
#
order_id order_date customer_id customer_name product_id
0 0 2022-08-01 245 Customer_884 201
1 1 2022-02-19 701 Customer_1672 205
2 2 2017-01-01 184 Customer_21720 208
3 3 2013-03-09 275 Customer_23770 200
4 4 2022-04-23 960 Customer_23790 210
product_names categories quantity price total
0 Smartphone Electronics 15 90.02 1350.30
1 Printer Electronics 30 12.74 382.20
2 Notebook Stationery 40 48.35 1934.00
3 Laptop Electronics 15 74.85 1122.75
4 Cabinet Office 30 53.77 1613.10
Time taken : 75.53943586349487 seconds
So, as you can see from the timng of ther above operation, we’ve established that vectorisation in Pandas coding is essential when processing large datasets and that it often just gets implemented "behind the scenes" for simpler code problems.
But what happens when our coding needs are slightly more complicated?
Example 2 -Vectorise an If/then/else condition
In this example, say we want to do the same operation as before (quantity 5), but this time only for Smartphone products. For any other products, we want (quantity 2).
Here is the naive apply() implementation.
def update_record(row):
if row['product_names'] == 'Smartphone':
row['quantity'] *= 5
else:
row['quantity'] *= 2
row['total'] = row['quantity'] * row['price']
return row
# Start the timer
start_time = time.time()
# Apply the update_record function to each row
df = df.apply(update_record, axis=1)
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
# Display the updated DataFrame
print(df)
print(f"Time taken: {elapsed_time} seconds")
order_id order_date customer_id customer_name product_id
0 0 2022-08-01 245 Customer_884 201
1 1 2022-02-19 701 Customer_1672 205
2 2 2017-01-01 184 Customer_21720 208
3 3 2013-03-09 275 Customer_23770 200
4 4 2022-04-23 960 Customer_23790 210
... ... ... ... ... ...
999995 999995 2011-05-08 408 Customer_26518 202
999996 999996 2019-02-11 850 Customer_4581 208
999997 999997 2021-11-19 399 Customer_28681 205
999998 999998 2016-05-02 714 Customer_12693 209
999999 999999 2018-08-12 324 Customer_28553 207
product_names categories quantity price total
0 Smartphone Electronics 15 90.02 1350.30
1 Printer Electronics 12 12.74 152.88
2 Notebook Stationery 16 48.35 773.60
3 Laptop Electronics 6 74.85 449.10
4 Cabinet Office 12 53.77 645.24
... ... ... ... ... ...
999995 Desk Office 12 32.29 387.48
999996 Notebook Stationery 6 8.16 48.96
999997 Printer Electronics 6 92.69 556.14
999998 Coffee Maker Electronics 4 18.10 72.40
999999 Pen Stationery 16 93.04 1488.64
[1000000 rows x 10 columns]
Time taken: 78.65310955047607 seconds
At almost 80 secs, the run time is wayyyy too long.
There are a couple of ways we can improve this using vectorisation. The first is the obvious way and probably what most experienced Pandas coders would turn to as it seems natural. It’s quick, too.
# Start the timer
start_time = time.time()
# Multiply 'quantity' by 5 for Smartphones, by 2 for others
df.loc[df['product_names'] == 'Smartphone', 'quantity'] *= 5
df.loc[df['product_names'] != 'Smartphone', 'quantity'] *= 2
# Update 'total' based on new 'quantity'
df['total'] = df['quantity'] * df['price']
# Display the updated DataFrame
print(df.head())
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")
#
# Output
#
order_id order_date customer_id customer_name product_id
0 0 2022-08-01 245 Customer_884 201
1 1 2022-02-19 701 Customer_1672 205
2 2 2017-01-01 184 Customer_21720 208
3 3 2013-03-09 275 Customer_23770 200
4 4 2022-04-23 960 Customer_23790 210
product_names categories quantity price total
0 Smartphone Electronics 15 90.02 1350.30
1 Printer Electronics 12 12.74 152.88
2 Notebook Stationery 16 48.35 773.60
3 Laptop Electronics 6 74.85 449.10
4 Cabinet Office 12 53.77 645.24
Time taken : 0.14528226852416992 seconds
The second way I’ll show uses a NumPy method you might not have encountered before – the NumPy where() function, and it’s even faster than regular vectorisation. Numpy.where() takes three arguments. The first argument is the condition you’re testing for. The second is what is returned if the test condition returns True, and the third is returned if the test condition is False.
# Start the timer
start_time = time.time()
# Update 'quantity' using numpy.where
df['quantity'] = np.where(df['product_names'].values == 'Smartphone', df['quantity'].values * 5, df['quantity'].values * 2)
# Recalculate 'total' based on the new 'quantity'
df['total'] = df['quantity'] * df['price']
# Display the updated DataFrame
print(df.head())
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")
#
# Output
#
order_id order_date customer_id customer_name product_id
0 0 2022-08-01 245 Customer_884 201
1 1 2022-02-19 701 Customer_1672 205
2 2 2017-01-01 184 Customer_21720 208
3 3 2013-03-09 275 Customer_23770 200
4 4 2022-04-23 960 Customer_23790 210
product_names categories quantity price total
0 Smartphone Electronics 15 90.02 1350.30
1 Printer Electronics 12 12.74 152.88
2 Notebook Stationery 16 48.35 773.60
3 Laptop Electronics 6 74.85 449.10
4 Cabinet Office 12 53.77 645.24
Time taken : 0.026806116104125977 seconds
Using NumPy, in this case, was more than 5x times faster than the regular vectorisation speed and a whopping 3500x faster than using apply().
Example 3 – vectorise multiple if/then/else conditions.
Nested if/then/else conditions can be handled using nested Numpy.where conditions. But it gets harder to read and maintain such code when there are many such cases. If you’ve ever used nested IF statements in Excel, you’ll know what I mean.
In this instance, we can use another pretty cool NumPy method not many people have heard of, which is called select().
Like where(), select() also takes 3 arguments. The first is a Python list of conditions
to test for. The second argument is a Python list of choices
that tells NumPy what to return if the equivalent condition is met. The third argument is a default
statement of what to return if none of the conditions are met.
The order of these is important. The first item in the choices list goes with the first item in the conditions list, and so on.
In a sense, it’s a bit like a case statement you would use if programming in Python, C, Java, etc …
Take our previous example to an extreme. Say we want to multiply our initial quantity by a different amount for each different type of product.
import pandas as pd
import numpy as np
# Start the timer
start_time = time.time()
# Define conditions
conditions = [
df['product_names'] == 'Smartphone',
df['product_names'] == 'Printer',
df['product_names'] == 'Notebook',
df['product_names'] == 'Laptop',
df['product_names'] == 'Cabinet'
]
# Define choices
choices = [
df['quantity'] * 1.5,
df['quantity'] * 2.5,
df['quantity'] * 3.5,
df['quantity'] * 4.5,
df['quantity'] * 5.5
]
# Default value if none of the conditions are met
default_choice = 0
# Update 'quantity' using numpy.select
df['quantity'] = np.select(conditions, choices, default=default_choice)
# Recalculate 'total' based on the new 'quantity'
df['total'] = df['quantity'] * df['price']
# Display the updated DataFrame
print(df.head(10))
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")
#
# Output
#
order_id order_date customer_id customer_name product_id
0 0 2022-08-01 245 Customer_884 201
1 1 2022-02-19 701 Customer_1672 205
2 2 2017-01-01 184 Customer_21720 208
3 3 2013-03-09 275 Customer_23770 200
4 4 2022-04-23 960 Customer_23790 210
5 5 2019-07-10 197 Customer_25587 202
6 6 2014-11-12 510 Customer_6912 204
7 7 2016-07-12 150 Customer_17761 200
8 8 2016-11-12 997 Customer_23801 209
9 9 2017-01-23 151 Customer_30325 207
product_names categories quantity price total
0 Smartphone Electronics 4.5 90.02 405.090
1 Printer Electronics 15.0 12.74 191.100
2 Notebook Stationery 28.0 48.35 1353.800
3 Laptop Electronics 13.5 74.85 1010.475
4 Cabinet Office 33.0 53.77 1774.410
5 Desk Office 0.0 47.17 0.000
6 Monitor Electronics 0.0 22.50 0.000
7 Laptop Electronics 40.5 49.33 1997.865
8 Coffee Maker Electronics 0.0 47.22 0.000
9 Pen Stationery 0.0 3.50 0.000
Time taken : 0.26836657524108887 seconds
Example 4 – Vectorise nested multiple if/then/else conditions
My final example is similar to the previous one, but now the quantity for Smartphones only gets multiplied by 1.5 if the existing quantity for Smartphones is > 5. Otherwise, the quantity gets multiplied by 2.
The other conditions are unchanged. In other words, we have a nested if/then/else scenario. If this logic was in a function in a programming language like Python, it would look similar to this,
def adjust_quantity(row):
if row['product_names'] == 'Smartphone':
if row['quantity'] > 5:
row['quantity'] *= 1.5
else:
row['quantity'] *= 2
elif row['product_names'] == 'Printer':
row['quantity'] *= 2.5
elif row['product_names'] == 'Notebook':
row['quantity'] *= 3.5
elif row['product_names'] == 'Laptop':
row['quantity'] *= 4.5
elif row['product_names'] == 'Cabinet':
row['quantity'] *= 5.5
else:
row['quantity'] = 0 # Default case if none of the conditions are met
return row
Can we vectorise this? Yes, we can by modifying the same method as before. This time, we add some extra boolean logical tests and the extra actions required to the choices
list.
I won’t repeat the whole code, just the changes.
# Define conditions
conditions = [
(df['product_names'] == 'Smartphone') & (df['quantity'] > 5),
(df['product_names'] == 'Smartphone') & (df['quantity'] <=5),
df['product_names'] == 'Printer',
df['product_names'] == 'Notebook',
df['product_names'] == 'Laptop',
df['product_names'] == 'Cabinet'
]
# Define choices
choices = [
df['quantity'] * 1.5,
df['quantity'] * 2,
df['quantity'] * 2.5,
df['quantity'] * 3.5,
df['quantity'] * 4.5,
df['quantity'] * 5.5
]
Running our new code, we get this output,
order_id order_date customer_id customer_name product_id
0 0 2022-08-01 245 Customer_884 201
1 1 2022-02-19 701 Customer_1672 205
2 2 2017-01-01 184 Customer_21720 208
3 3 2013-03-09 275 Customer_23770 200
4 4 2022-04-23 960 Customer_23790 210
5 5 2019-07-10 197 Customer_25587 202
6 6 2014-11-12 510 Customer_6912 204
7 7 2016-07-12 150 Customer_17761 200
8 8 2016-11-12 997 Customer_23801 209
9 9 2017-01-23 151 Customer_30325 207
product_names categories quantity price total
0 Smartphone Electronics 6.0 90.02 540.120
1 Printer Electronics 15.0 12.74 191.100
2 Notebook Stationery 28.0 48.35 1353.800
3 Laptop Electronics 13.5 74.85 1010.475
4 Cabinet Office 33.0 53.77 1774.410
5 Desk Office 0.0 47.17 0.000
6 Monitor Electronics 0.0 22.50 0.000
7 Laptop Electronics 40.5 49.33 1997.865
8 Coffee Maker Electronics 0.0 47.22 0.000
9 Pen Stationery 0.0 3.50 0.000
Time taken : 0.33173537254333496 seconds
The run time is slightly slower than the previous run as more work is being done.
Summary
Hopefully, you’ll take away two things from this article. The first is that vectorisation is an essential ingredient for performant Pandas code when dealing with medium to large data sets. The second is that even if the operations you need to apply to dataframe records are complex, they can still often be vectorised.
I explained why vectorisation in Pandas is essential for performant code and I showed serveral examples of how we can use vectorisation even in complex data processing scenarios
_That’s all from me for now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._
If you liked this content, Medium thinks you’ll find these articles interesting, too.
Here’s the link to the story I referred to at the beginning about avoiding the use of the apply() method in Pandas when dealing with large dataframes.
Thinking of using Pandas apply() on a big dataframe: Stop! – read this instead