
As some of you may know, I’m a big fan of the DuckDB Python library, and I’ve written many articles on it. I was also one of the first to write an article about an even newer Python library called Fireducks and helped bring that to people’s attention.
If you’ve never heard of these useful libraries, check out the links below for an introduction to them.
Both libraries are increasing their share of data science workloads where it could be argued that data manipulation and general wrangling are at least as important as the data analysis and insight that the machine learning side of things brings.
The core foundations of both tools are very different; DuckDB is a modern, embedded analytics database designed for efficient processing and querying of gigabytes of data from various sources. Fireducks is designed to be a much faster replacement for Pandas.
Their key commonality, however, is that they are both highly performant for general mid-sized Data Processing tasks. If that’s your use case, which one should you choose? That’s what we’ll find out today.
Here are the tests I’ll perform.
- read a large CSV file into memory, i.e. a DuckDB table and a Fireducks dataframe
- perform some typical data processing tasks against both sets of in-memory data
- create a new column in the in-memory data sets based on existing table/data frame column data.
- write out the updated in-memory data sets as CSV and Parquet
Input data set
I created a CSV file with fake sales data containing 100 million records.
The schema of the input data is this,
order_id
(int)order_date
(date)customer_id
(int)customer_name
(str)product_id
(int)product_name
(str)category
(str)quantity
(int)price
(float)total
(float)
Here is a Python program you can use to create the CSV file. On my system, this resulted in a file of approximately 7.5GB.
# generate the 100m CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta
def generate(nrows: int, filename: str):
names = np.asarray(
[
"Laptop",
"Smartphone",
"Desk",
"Chair",
"Monitor",
"Printer",
"Paper",
"Pen",
"Notebook",
"Coffee Maker",
"Cabinet",
"Plastic Cups",
]
)
categories = np.asarray(
[
"Electronics",
"Electronics",
"Office",
"Office",
"Electronics",
"Electronics",
"Stationery",
"Stationery",
"Stationery",
"Electronics",
"Office",
"Sundry",
]
)
product_id = np.random.randint(len(names), size=nrows)
quantity = np.random.randint(1, 11, size=nrows)
price = np.random.randint(199, 10000, size=nrows) / 100
# Generate random dates between 2010-01-01 and 2023-12-31
start_date = datetime(2010, 1, 1)
end_date = datetime(2023, 12, 31)
date_range = (end_date - start_date).days
# Create random dates as np.array and convert to string format
order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])
# Define columns
columns = {
"order_id": np.arange(nrows),
"order_date": order_dates,
"customer_id": np.random.randint(100, 1000, size=nrows),
"customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
"product_id": product_id + 200,
"product_names": names[product_id],
"categories": categories[product_id],
"quantity": quantity,
"price": price,
"total": price * quantity,
}
# Create Polars DataFrame and write to CSV with explicit delimiter
df = pl.DataFrame(columns)
df.write_csv(filename, separator=',',include_header=True) # Ensure comma is used as the delimiter
# Generate data with random order_date and save to CSV
generate(100_000_000, "/mnt/d/sales_data/sales_data_100m.csv")
Installing WSL2 Ubuntu
Fireducks only runs under Linux, so as I usually run Windows, I’ll be using WSL2 Ubuntu for my Linux environment, but the same code should work on any Linux/Unix setup. I have a full guide on installing WSL2 here.
Setting up a dev environment
OK, we should set up a separate development environment before starting our coding examples. That way, what we do won’t interfere with other versions of libraries, Programming, etc….. we might have on the go for other projects.
I use Miniconda for this, but you can use whatever method suits you best.
If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,
Once the environment is created, switch to it using the activate
command, and then install Jupyter and any required Python libraries.
#create our test environment
(base) $ conda create -n duck_battle python=3.11 -y
# Now activate it
(base) $ conda activate duck_battle
# Install python libraries, etc ...
(duck_battle) $ pip install jupyter fireducks duckdb
Test 1 – Reading a large CSV file and display the last 10 records
DuckDB
import duckdb
print(duckdb.__version__)
'1.1.3'
# DuckDB read CSV file
#
import duckdb
import time
# Start the timer
start_time = time.time()
# Create a connection to an in-memory DuckDB database
con = duckdb.connect(':memory:')
# Create a table from the CSV file
con.execute(f"CREATE TABLE sales AS SELECT * FROM read_csv('/mnt/d/sales_data/sales_data_100m.csv',header=true)")
# Fetch the last 10 rows
query = "SELECT * FROM sales ORDER BY rowid DESC LIMIT 10"
df = con.execute(query).df()
# Display the last 10 rows
print("nLast 10 rows of the file:")
print(df)
# End the timer and calculate the total elapsed time
total_elapsed_time = time.time() - start_time
print(f"DuckDB: Time taken to read the CSV file and display the last 10 records: {total_elapsed_time} seconds")
#
# DuckDB output
#
Last 10 rows of the file:
order_id order_date customer_id customer_name product_id product_names
0 99999999 2023-06-16 102 Customer_9650 203 Chair
1 99999998 2022-03-02 709 Customer_23966 208 Notebook
2 99999997 2019-05-10 673 Customer_25709 202 Desk
3 99999996 2011-10-21 593 Customer_29352 200 Laptop
4 99999995 2011-10-24 501 Customer_29289 202 Desk
5 99999994 2023-09-27 119 Customer_15532 209 Coffee Maker
6 99999993 2015-01-15 294 Customer_27081 200 Laptop
7 99999992 2016-04-07 379 Customer_1353 207 Pen
8 99999991 2010-09-19 253 Customer_29439 204 Monitor
9 99999990 2016-05-19 174 Customer_11294 210 Cabinet
categories quantity price total
0 Office 4 59.58 238.32
1 Stationery 1 78.91 78.91
2 Office 5 9.12 45.60
3 Electronics 3 67.42 202.26
4 Office 7 53.78 376.46
5 Electronics 2 55.10 110.20
6 Electronics 9 86.01 774.09
7 Stationery 5 21.56 107.80
8 Electronics 4 5.17 20.68
9 Office 9 65.10 585.90
DuckDB: Time taken to read the CSV file and display the last 10 records: 59.23184013366699 seconds
Fireducks
import fireducks
import fireducks.pandas as pd
print(fireducks.__version__)
print(pd.__version__)
1.1.6
2.2.3
# Fireducks read CSV
#
import fireducks.pandas as pd
import time
# Start the timer
start_time = time.time()
# Path to the CSV file
file_path = "/mnt/d/sales_data/sales_data_100m.csv"
# Read the CSV file into a DataFrame
df_fire = pd.read_csv(file_path)
# Display the last 10 rows of the DataFrame
print(df_fire.tail(10))
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to read the CSV file and display the last 10 records: {elapsed_time} seconds")
#
# Fireducks output
#
order_id order_date customer_id customer_name product_id
99999990 99999990 2016-05-19 174 Customer_11294 210
99999991 99999991 2010-09-19 253 Customer_29439 204
99999992 99999992 2016-04-07 379 Customer_1353 207
99999993 99999993 2015-01-15 294 Customer_27081 200
99999994 99999994 2023-09-27 119 Customer_15532 209
99999995 99999995 2011-10-24 501 Customer_29289 202
99999996 99999996 2011-10-21 593 Customer_29352 200
99999997 99999997 2019-05-10 673 Customer_25709 202
99999998 99999998 2022-03-02 709 Customer_23966 208
99999999 99999999 2023-06-16 102 Customer_9650 203
product_names categories quantity price total
99999990 Cabinet Office 9 65.10 585.90
99999991 Monitor Electronics 4 5.17 20.68
99999992 Pen Stationery 5 21.56 107.80
99999993 Laptop Electronics 9 86.01 774.09
99999994 Coffee Maker Electronics 2 55.10 110.20
99999995 Desk Office 7 53.78 376.46
99999996 Laptop Electronics 3 67.42 202.26
99999997 Desk Office 5 9.12 45.60
99999998 Notebook Stationery 1 78.91 78.91
99999999 Chair Office 4 59.58 238.32
Fireducks: Time taken to read the CSV file and display the last 10 records: 65.69259881973267 seconds
There is not much in it; DuckDB edges it by about 6 seconds.
Test 2— Calculate total sales by category
DuckDB
# duckdb process data
#
import duckdb
import time
# Start total runtime timer
query_sql="""
SELECT
categories,
SUM(total) AS total_sales
FROM sales
GROUP BY categories
ORDER BY total_sales DESC
"""
start_time = time.time()
# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckDB: Time for sales by category calculation: {time.time() - start_time} seconds")
results
#
# DuckDb output
#
DuckDB: Time for sales by category calculation: 0.1401681900024414 seconds
categories total_sales
0 Electronics 1.168493e+10
1 Stationery 7.014109e+09
2 Office 7.006807e+09
3 Sundry 2.338428e+09
Fireducks
import fireducks.pandas as pd
# Start the timer
start_time = time.time()
total_sales_by_category = df_fire.groupby('categories')['total'].sum().sort_values(ascending=False)
print(total_sales_by_category)
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate sales by category: {elapsed_time} seconds")
#
# Fireducks output
#
categories
Electronics 1.168493e+10
Stationery 7.014109e+09
Office 7.006807e+09
Sundry 2.338428e+09
Name: total, dtype: float64
Fireducks: Time taken to calculate sales by category: 0.13571524620056152 seconds
There is not much in it there, either. Fireducks shades it.
Test 3— Top 5 customer spend
DuckDB
# duckdb process data
#
import duckdb
import time
# Start total runtime timer
query_sql="""
SELECT
customer_id,
customer_name,
SUM(total) AS total_purchase
FROM sales
GROUP BY customer_id, customer_name
ORDER BY total_purchase DESC
LIMIT 5
"""
start_time = time.time()
# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckdDB: Time to calculate top 5 customers: {time.time() - start_time} seconds")
results
#
# DuckDb output
#
DuckdDB: Time to calculate top 5 customers: 1.4588654041290283 seconds
customer_id customer_name total_purchase
0 681 Customer_20387 6892.96
1 740 Customer_30499 6613.11
2 389 Customer_22686 6597.35
3 316 Customer_185 6565.38
4 529 Customer_1609 6494.35
Fireducks
import fireducks.pandas as pd
# Start the timer
start_time = time.time()
top_5_customers = df_fire.groupby(['customer_id', 'customer_name'])['total'].sum().sort_values(ascending=False).head(5)
print(top_5_customers)
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate top 5 customers: {elapsed_time} seconds")
#
# Fireducks output
#
customer_id customer_name
681 Customer_20387 6892.96
740 Customer_30499 6613.11
389 Customer_22686 6597.35
316 Customer_1859 6565.38
529 Customer_1609 6494.35
Name: total, dtype: float64
Fireducks: Time taken to calculate top 5 customers: 2.823930263519287 seconds
DuckDB wins that one, being almost twice as fast as Fireducks.
Test 4— Monthly sales figures
DuckDB
import duckdb
import time
# Start total runtime timer
query_sql="""
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total) AS monthly_sales
FROM sales
GROUP BY month
ORDER BY month
"""
start_time = time.time()
# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckDB: Time for seasonal trend calculation: {time.time() - start_time} seconds")
results
#
# DuckDB output
#
DuckDB: Time for seasonal trend calculation: 0.16109275817871094 seconds
month monthly_sales
0 2010-01-01 1.699500e+08
1 2010-02-01 1.535730e+08
2 2010-03-01 1.702968e+08
3 2010-04-01 1.646421e+08
4 2010-05-01 1.704506e+08
... ... ...
163 2023-08-01 1.699263e+08
164 2023-09-01 1.646018e+08
165 2023-10-01 1.692184e+08
166 2023-11-01 1.644883e+08
167 2023-12-01 1.643962e+08
168 rows × 2 columns
Fireducks
import fireducks.pandas as pd
import time
def seasonal_trend():
# Ensure 'order_date' is datetime
df_fire['order_date'] = pd.to_datetime(df_fire['order_date'])
# Extract 'month' as string
df_fire['month'] = df_fire['order_date'].dt.strftime('%Y-%m')
# Group by 'month' and sum 'total'
results = (
df_fire.groupby('month')['total']
.sum()
.reset_index()
.sort_values('month')
)
print(results)
start_time = time.time()
seasonal_trend()
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time for seasonal trend calculation: {time.time() - start_time} seconds")
#
# Fireducks Output
#
month total
0 2010-01 1.699500e+08
1 2010-02 1.535730e+08
2 2010-03 1.702968e+08
3 2010-04 1.646421e+08
4 2010-05 1.704506e+08
.. ... ...
163 2023-08 1.699263e+08
164 2023-09 1.646018e+08
165 2023-10 1.692184e+08
166 2023-11 1.644883e+08
167 2023-12 1.643962e+08
[168 rows x 2 columns]
Fireducks: Time for seasonal trend calculation: 3.109074354171753 seconds
DuckDB was significantly quicker in this example.
Test 5— Average order by product
DuckDB
import duckdb
import time
# Start total runtime timer
query_sql="""
SELECT
product_id,
product_names,
AVG(total) AS avg_order_value
FROM sales
GROUP BY product_id, product_names
ORDER BY avg_order_value DESC
"""
start_time = time.time()
# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckDB: Time for average order by product calculation: {time.time() - start_time} seconds")
results
#
# DuckDb output
#
DuckDB: Time for average order by product calculation: 0.13720130920410156 seconds
product_id product_names avg_order_value
0 206 Paper 280.529144
1 208 Notebook 280.497268
2 201 Smartphone 280.494779
3 207 Pen 280.491508
4 205 Printer 280.470150
5 200 Laptop 280.456913
6 209 Coffee Maker 280.445365
7 211 Plastic Cups 280.440161
8 210 Cabinet 280.426960
9 202 Desk 280.367135
10 203 Chair 280.364045
11 204 Monitor 280.329706
Fireducks
import fireducks.pandas as pd
# Start the timer
start_time = time.time()
avg_order_value = df_fire.groupby(['product_id', 'product_names'])['total'].mean().sort_values(ascending=False)
print(avg_order_value)
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time for average order calculation: {time.time() - start_time} seconds")
#
# Fireducks output
#
product_id product_names
206 Paper 280.529144
208 Notebook 280.497268
201 Smartphone 280.494779
207 Pen 280.491508
205 Printer 280.470150
200 Laptop 280.456913
209 Coffee Maker 280.445365
211 Plastic Cups 280.440161
210 Cabinet 280.426960
202 Desk 280.367135
203 Chair 280.364045
204 Monitor 280.329706
Name: total, dtype: float64
Fireducks: Time for average order calculation: 0.06766319274902344 seconds
Fireducks gets one back there and was twice as fast as DuckDB.
Test 6— product performance analysis
DuckDB
import duckdb
import time
# Start total runtime timer
query_sql="""
WITH yearly_sales AS (
SELECT
EXTRACT(YEAR FROM order_date) AS year,
SUM(total) AS total_sales
FROM sales
GROUP BY year
)
SELECT
year,
total_sales,
LAG(total_sales) OVER (ORDER BY year) AS prev_year_sales,
(total_sales - LAG(total_sales) OVER (ORDER BY year)) / LAG(total_sales) OVER (ORDER BY year) * 100 AS yoy_growth
FROM yearly_sales
ORDER BY year
"""
start_time = time.time()
# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckDB: Time for product performance analysis calculation: {time.time() - start_time} seconds")
results
#
# DuckDb output
#
Time for product performance analysis calculation: 0.03958845138549805 seconds
year total_sales prev_year_sales yoy_growth
0 2010 2.002066e+09 NaN NaN
1 2011 2.002441e+09 2.002066e+09 0.018739
2 2012 2.008966e+09 2.002441e+09 0.325848
3 2013 2.002901e+09 2.008966e+09 -0.301900
4 2014 2.000773e+09 2.002901e+09 -0.106225
5 2015 2.001931e+09 2.000773e+09 0.057855
6 2016 2.008762e+09 2.001931e+09 0.341229
7 2017 2.002164e+09 2.008762e+09 -0.328457
8 2018 2.002383e+09 2.002164e+09 0.010927
9 2019 2.002891e+09 2.002383e+09 0.025383
10 2020 2.008585e+09 2.002891e+09 0.284318
11 2021 2.000244e+09 2.008585e+09 -0.415281
12 2022 2.004500e+09 2.000244e+09 0.212756
13 2023 1.995672e+09 2.004500e+09 -0.440401
Fireducks
import fireducks.pandas as pd
# Start the timer
start_time = time.time()
df_fire['year'] = pd.to_datetime(df_fire['order_date']).dt.year
yearly_sales = df_fire.groupby('year')['total'].sum().sort_index()
yoy_growth = yearly_sales.pct_change() * 100
result = pd.DataFrame({
'year': yearly_sales.index,
'total_sales': yearly_sales.values,
'prev_year_sales': yearly_sales.shift().values,
'yoy_growth': yoy_growth.values
})
print(result)
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time for product performance analysis calculation: {time.time() - start_time} seconds")
#
# Fireducks output
#
year total_sales prev_year_sales yoy_growth
0 2010 2.002066e+09 NaN NaN
1 2011 2.002441e+09 2.002066e+09 0.018739
2 2012 2.008966e+09 2.002441e+09 0.325848
3 2013 2.002901e+09 2.008966e+09 -0.301900
4 2014 2.000773e+09 2.002901e+09 -0.106225
5 2015 2.001931e+09 2.000773e+09 0.057855
6 2016 2.008762e+09 2.001931e+09 0.341229
7 2017 2.002164e+09 2.008762e+09 -0.328457
8 2018 2.002383e+09 2.002164e+09 0.010927
9 2019 2.002891e+09 2.002383e+09 0.025383
10 2020 2.008585e+09 2.002891e+09 0.284318
11 2021 2.000244e+09 2.008585e+09 -0.415281
12 2022 2.004500e+09 2.000244e+09 0.212756
13 2023 1.995672e+09 2.004500e+09 -0.440401
Time for product performance analysis calculation: 0.17495489120483398 seconds
DuckDB is quicker this time.
Test 7 – Add a new column to the data set and update its value
DuckDB
import duckdb
from datetime import datetime
start_time = time.time()
# Add new columns
con.execute("""
ALTER TABLE sales ADD COLUMN total_with_tax FLOAT
"""
)
# Perform the calculations and update the table
con.execute("""
UPDATE sales
SET total_with_tax = CASE
WHEN total <= 100 THEN total * 1.125 -- 12.5% tax
WHEN total > 100 AND total <= 200 THEN total * 1.15 -- 15% tax
WHEN total > 200 AND total <= 500 THEN total * 1.17 -- 17% tax
WHEN total > 500 THEN total * 1.20 -- 20% tax
END;
""")
print(f"Time to add new column: {time.time() - start_time} seconds")
# Verify the new columns
result = con.execute("""
SELECT
*
FROM sales
LIMIT 10;
""").fetchdf()
print(result)
#
# DuckDB output
#
Time to add new column: 2.4016575813293457 seconds
order_id order_date customer_id customer_name product_id product_names
0 0 2021-11-25 238 Customer_25600 211 Plastic Cups
1 1 2017-06-10 534 Customer_14188 209 Coffee Maker
2 2 2010-02-15 924 Customer_14013 207 Pen
3 3 2011-01-26 633 Customer_6120 211 Plastic Cups
4 4 2014-01-11 561 Customer_1352 205 Printer
5 5 2021-04-19 533 Customer_5342 208 Notebook
6 6 2012-03-14 684 Customer_21604 207 Pen
7 7 2017-07-01 744 Customer_30291 201 Smartphone
8 8 2013-02-13 678 Customer_32618 204 Monitor
9 9 2023-01-04 340 Customer_16898 207 Pen
categories quantity price total total_with_tax
0 Sundry 2 99.80 199.60 229.539993
1 Electronics 8 7.19 57.52 64.709999
2 Stationery 6 70.98 425.88 498.279602
3 Sundry 6 94.38 566.28 679.536011
4 Electronics 4 44.68 178.72 205.528000
5 Stationery 4 21.85 87.40 98.324997
6 Stationery 3 93.66 280.98 328.746613
7 Electronics 6 39.41 236.46 276.658203
8 Electronics 2 4.30 8.60 9.675000
9 Stationery 2 6.67 13.34 15.007500
Fireducks
import numpy as np
import time
import fireducks.pandas as pd
# Start total runtime timer
start_time = time.time()
# Define tax rate conditions and choices
conditions = [
(df_fire['total'] <= 100),
(df_fire['total'] > 100) & (df_fire['total'] <= 200),
(df_fire['total'] > 200) & (df_fire['total'] <= 500),
(df_fire['total'] > 500)
]
choices = [1.125, 1.15, 1.17, 1.20]
# Calculate total_with_tax using np.select for efficiency
df_fire['total_with_tax'] = df_fire['total'] * np.select(conditions, choices)
# Print total runtime
print(f"Fireducks: Time to add new column: {time.time() - start_time} seconds")
print(df_fire)
#
# Fireducks oputput
#
Fireducks: Time to add new column: 2.7112433910369873 seconds
order_id order_date customer_id customer_name product_id
0 0 2021-11-25 238 Customer_25600 211
1 1 2017-06-10 534 Customer_14188 209
2 2 2010-02-15 924 Customer_14013 207
3 3 2011-01-26 633 Customer_6120 211
4 4 2014-01-11 561 Customer_1352 205
... ... ... ... ... ...
99999995 99999995 2011-10-24 501 Customer_29289 202
99999996 99999996 2011-10-21 593 Customer_29352 200
99999997 99999997 2019-05-10 673 Customer_25709 202
99999998 99999998 2022-03-02 709 Customer_23966 208
99999999 99999999 2023-06-16 102 Customer_9650 203
product_names categories quantity price total month year
0 Plastic Cups Sundry 2 99.80 199.60 2021-11 2021
1 Coffee Maker Electronics 8 7.19 57.52 2017-06 2017
2 Pen Stationery 6 70.98 425.88 2010-02 2010
3 Plastic Cups Sundry 6 94.38 566.28 2011-01 2011
4 Printer Electronics 4 44.68 178.72 2014-01 2014
... ... ... ... ... ... ... ...
99999995 Desk Office 7 53.78 376.46 2011-10 2011
99999996 Laptop Electronics 3 67.42 202.26 2011-10 2011
99999997 Desk Office 5 9.12 45.60 2019-05 2019
99999998 Notebook Stationery 1 78.91 78.91 2022-03 2022
99999999 Chair Office 4 59.58 238.32 2023-06 2023
total_with_tax
0 229.54000
1 64.71000
2 498.27960
3 679.53600
4 205.52800
... ...
99999995 440.45820
99999996 236.64420
99999997 51.30000
99999998 88.77375
99999999 278.83440
[100000000 rows x 13 columns]
They have very similar run times yet again. A draw.
Test 8 – Write out the updated data to a CSV file
DuckDB
start_time = time.time()
# Write the modified sales_data table to a CSV file
start = time.time()
con.execute("""
COPY (SELECT * FROM sales) TO '/mnt/d/sales_data/final_sales_data_duckdb.csv' WITH (HEADER TRUE, DELIMITER ',')
""")
print(f"DuckDB: Time to write CSV to file: {time.time() - start_time} seconds")
DuckDB: Time to write CSV to file: 54.899176597595215 seconds
Fireducks
# fireducks write data back to CSV
#
import fireducks.pandas as pd
# Tidy up DF before writing out
cols_to_drop = ['year', 'month']
df_fire = df_fire.drop(columns=cols_to_drop)
df_fire['total_with_tax'] = df_fire['total_with_tax'].round(2)
df_fire['order_date'] = df_fire['order_date'].dt.date
# Start total runtime timer
start_time = time.time()
df_fire.to_csv('/mnt/d/sales_data/fireducks_sales.csv',quoting=0,index=False)
# Print total runtime
print(f"Fireducks: Time to write CSV to file: {time.time() - start_time} seconds")
Fireducks: Time to write CSV to file: 54.490307331085205 seconds
Too close to call again.
Test 9— Write out the updated data to a parquet file
DuckDB
# DuckDB write Parquet data
#
start_time = time.time()
# Write the modified sales_data table to a Parquet file
start = time.time()
con.execute("COPY sales TO '/mnt/d/sales_data/final_sales_data_duckdb.parquet' (FORMAT 'parquet');")
print(f"DuckDB: Time to write parquet to file: {time.time() - start_time} seconds")
DuckDB: Time to write parquet to file: 30.011869192123413 seconds
Fireducks
import fireducks.pandas as pd
import time
# Start total runtime timer
start_time = time.time()
df_fire.to_parquet('/mnt/d/sales_data/fireducks_sales.parquet')
# Print total runtime
print(f"Fireducks: Time to write Parquet to file: {time.time() - start_time} seconds")
Fireducks: Time to write Parquet to file: 86.29632377624512 seconds
That’s the first major discrepancy between run times. Fireducks took almost a minute longer to write out its data to Parquet than did DuckDB.
Summary
So, what are we to make of all this? Simply put, there is nothing much in it between these two libraries. Both are superfast and capable of processing large data sets. Once your data is in memory, either in a DuckDB table or Fireducks dataframe, both libraries are equally capable of processing it in double quick time
The choice of which one to use depends on your existing infrastructure and skill set.
If you’re a database person, DuckDB is the obvious library to use, as your SQL skills would be instantly transferable.
Alternatively, if you’re already embedded in the Pandas’ world, Fireducks would be a great choice for you.
_OK, that’s all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories, follow me or subscribe to get notified when I post new content._
If you like this content, you might find these articles interesting, too.