The world’s leading publication for data science, AI, and ML professionals.

Battle of the Ducks

DuckDB vs Fireducks: the ultimate throwdown

Image by AI (Dalle-3)
Image by AI (Dalle-3)

As some of you may know, I’m a big fan of the DuckDB Python library, and I’ve written many articles on it. I was also one of the first to write an article about an even newer Python library called Fireducks and helped bring that to people’s attention.

If you’ve never heard of these useful libraries, check out the links below for an introduction to them.

DuckDB

New Pandas rival, FireDucks, brings the smoke!

Both libraries are increasing their share of data science workloads where it could be argued that data manipulation and general wrangling are at least as important as the data analysis and insight that the machine learning side of things brings.

The core foundations of both tools are very different; DuckDB is a modern, embedded analytics database designed for efficient processing and querying of gigabytes of data from various sources. Fireducks is designed to be a much faster replacement for Pandas.

Their key commonality, however, is that they are both highly performant for general mid-sized Data Processing tasks. If that’s your use case, which one should you choose? That’s what we’ll find out today.

Here are the tests I’ll perform.

  • read a large CSV file into memory, i.e. a DuckDB table and a Fireducks dataframe
  • perform some typical data processing tasks against both sets of in-memory data
  • create a new column in the in-memory data sets based on existing table/data frame column data.
  • write out the updated in-memory data sets as CSV and Parquet

Input data set

I created a CSV file with fake sales data containing 100 million records.

The schema of the input data is this,

  • order_id (int)
  • order_date (date)
  • customer_id (int)
  • customer_name (str)
  • product_id (int)
  • product_name (str)
  • category (str)
  • quantity (int)
  • price (float)
  • total (float)

Here is a Python program you can use to create the CSV file. On my system, this resulted in a file of approximately 7.5GB.

# generate the 100m CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
    names = np.asarray(
        [
            "Laptop",
            "Smartphone",
            "Desk",
            "Chair",
            "Monitor",
            "Printer",
            "Paper",
            "Pen",
            "Notebook",
            "Coffee Maker",
            "Cabinet",
            "Plastic Cups",
        ]
    )

    categories = np.asarray(
        [
            "Electronics",
            "Electronics",
            "Office",
            "Office",
            "Electronics",
            "Electronics",
            "Stationery",
            "Stationery",
            "Stationery",
            "Electronics",
            "Office",
            "Sundry",
        ]
    )

    product_id = np.random.randint(len(names), size=nrows)
    quantity = np.random.randint(1, 11, size=nrows)
    price = np.random.randint(199, 10000, size=nrows) / 100

    # Generate random dates between 2010-01-01 and 2023-12-31
    start_date = datetime(2010, 1, 1)
    end_date = datetime(2023, 12, 31)
    date_range = (end_date - start_date).days

    # Create random dates as np.array and convert to string format
    order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

    # Define columns
    columns = {
        "order_id": np.arange(nrows),
        "order_date": order_dates,
        "customer_id": np.random.randint(100, 1000, size=nrows),
        "customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
        "product_id": product_id + 200,
        "product_names": names[product_id],
        "categories": categories[product_id],
        "quantity": quantity,
        "price": price,
        "total": price * quantity,
    }

    # Create Polars DataFrame and write to CSV with explicit delimiter
    df = pl.DataFrame(columns)
    df.write_csv(filename, separator=',',include_header=True)  # Ensure comma is used as the delimiter

# Generate data with random order_date and save to CSV
generate(100_000_000, "/mnt/d/sales_data/sales_data_100m.csv")

Installing WSL2 Ubuntu

Fireducks only runs under Linux, so as I usually run Windows, I’ll be using WSL2 Ubuntu for my Linux environment, but the same code should work on any Linux/Unix setup. I have a full guide on installing WSL2 here.

Setting up a dev environment

OK, we should set up a separate development environment before starting our coding examples. That way, what we do won’t interfere with other versions of libraries, Programming, etc….. we might have on the go for other projects.

I use Miniconda for this, but you can use whatever method suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

Once the environment is created, switch to it using the activatecommand, and then install Jupyter and any required Python libraries.

#create our test environment
(base) $ conda create -n duck_battle python=3.11 -y
# Now activate it
(base) $ conda activate duck_battle
# Install python libraries, etc ...
(duck_battle) $ pip install jupyter fireducks duckdb

Test 1 – Reading a large CSV file and display the last 10 records

DuckDB

import duckdb

print(duckdb.__version__)

'1.1.3'
# DuckDB read CSV file 
#
import duckdb
import time

# Start the timer
start_time = time.time()

# Create a connection to an in-memory DuckDB database
con = duckdb.connect(':memory:')

# Create a table from the CSV file
con.execute(f"CREATE TABLE sales AS SELECT * FROM read_csv('/mnt/d/sales_data/sales_data_100m.csv',header=true)")

# Fetch the last 10 rows
query = "SELECT * FROM sales ORDER BY rowid DESC LIMIT 10"
df = con.execute(query).df()

# Display the last 10 rows
print("nLast 10 rows of the file:")
print(df)

# End the timer and calculate the total elapsed time
total_elapsed_time = time.time() - start_time

print(f"DuckDB: Time taken to read the CSV file and display the last 10 records: {total_elapsed_time} seconds")

#
# DuckDB output
#

Last 10 rows of the file:
   order_id order_date  customer_id   customer_name  product_id product_names  
0  99999999 2023-06-16          102   Customer_9650         203         Chair   
1  99999998 2022-03-02          709  Customer_23966         208      Notebook   
2  99999997 2019-05-10          673  Customer_25709         202          Desk   
3  99999996 2011-10-21          593  Customer_29352         200        Laptop   
4  99999995 2011-10-24          501  Customer_29289         202          Desk   
5  99999994 2023-09-27          119  Customer_15532         209  Coffee Maker   
6  99999993 2015-01-15          294  Customer_27081         200        Laptop   
7  99999992 2016-04-07          379   Customer_1353         207           Pen   
8  99999991 2010-09-19          253  Customer_29439         204       Monitor   
9  99999990 2016-05-19          174  Customer_11294         210       Cabinet   

    categories  quantity  price   total  
0       Office         4  59.58  238.32  
1   Stationery         1  78.91   78.91  
2       Office         5   9.12   45.60  
3  Electronics         3  67.42  202.26  
4       Office         7  53.78  376.46  
5  Electronics         2  55.10  110.20  
6  Electronics         9  86.01  774.09  
7   Stationery         5  21.56  107.80  
8  Electronics         4   5.17   20.68  
9       Office         9  65.10  585.90  

DuckDB: Time taken to read the CSV file and display the last 10 records: 59.23184013366699 seconds

Fireducks

import fireducks
import fireducks.pandas as pd

print(fireducks.__version__)
print(pd.__version__)

1.1.6
2.2.3
# Fireducks read CSV
#
import fireducks.pandas as pd
import time

# Start the timer
start_time = time.time()

# Path to the CSV file
file_path = "/mnt/d/sales_data/sales_data_100m.csv"

# Read the CSV file into a DataFrame
df_fire = pd.read_csv(file_path)

# Display the last 10 rows of the DataFrame
print(df_fire.tail(10))

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to read the CSV file and display the last 10 records: {elapsed_time} seconds")         

#
# Fireducks output
#

          order_id  order_date  customer_id   customer_name  product_id  
99999990  99999990  2016-05-19          174  Customer_11294         210   
99999991  99999991  2010-09-19          253  Customer_29439         204   
99999992  99999992  2016-04-07          379   Customer_1353         207   
99999993  99999993  2015-01-15          294  Customer_27081         200   
99999994  99999994  2023-09-27          119  Customer_15532         209   
99999995  99999995  2011-10-24          501  Customer_29289         202   
99999996  99999996  2011-10-21          593  Customer_29352         200   
99999997  99999997  2019-05-10          673  Customer_25709         202   
99999998  99999998  2022-03-02          709  Customer_23966         208   
99999999  99999999  2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total  
99999990       Cabinet       Office         9  65.10  585.90  
99999991       Monitor  Electronics         4   5.17   20.68  
99999992           Pen   Stationery         5  21.56  107.80  
99999993        Laptop  Electronics         9  86.01  774.09  
99999994  Coffee Maker  Electronics         2  55.10  110.20  
99999995          Desk       Office         7  53.78  376.46  
99999996        Laptop  Electronics         3  67.42  202.26  
99999997          Desk       Office         5   9.12   45.60  
99999998      Notebook   Stationery         1  78.91   78.91  
99999999         Chair       Office         4  59.58  238.32 

Fireducks: Time taken to read the CSV file and display the last 10 records: 65.69259881973267 seconds

There is not much in it; DuckDB edges it by about 6 seconds.

Test 2— Calculate total sales by category

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    categories, 
    SUM(total) AS total_sales
FROM sales
GROUP BY categories
ORDER BY total_sales DESC
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for sales by category calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for sales by category calculation: 0.1401681900024414 seconds

  categories  total_sales
0 Electronics 1.168493e+10
1 Stationery  7.014109e+09
2 Office      7.006807e+09
3 Sundry      2.338428e+09

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

total_sales_by_category = df_fire.groupby('categories')['total'].sum().sort_values(ascending=False)
print(total_sales_by_category)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate sales by category: {elapsed_time} seconds")

#
# Fireducks output
#

categories
Electronics    1.168493e+10
Stationery     7.014109e+09
Office         7.006807e+09
Sundry         2.338428e+09
Name: total, dtype: float64

Fireducks: Time taken to calculate sales by category:  0.13571524620056152 seconds

There is not much in it there, either. Fireducks shades it.

Test 3— Top 5 customer spend

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    customer_id, 
    customer_name, 
    SUM(total) AS total_purchase
FROM sales
GROUP BY customer_id, customer_name
ORDER BY total_purchase DESC
LIMIT 5
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckdDB: Time to calculate top 5 customers: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckdDB: Time to calculate top 5 customers: 1.4588654041290283 seconds

  customer_id customer_name  total_purchase
0 681         Customer_20387 6892.96
1 740         Customer_30499 6613.11
2 389         Customer_22686 6597.35
3 316         Customer_185   6565.38
4 529         Customer_1609  6494.35

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

top_5_customers = df_fire.groupby(['customer_id', 'customer_name'])['total'].sum().sort_values(ascending=False).head(5)
print(top_5_customers)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate top 5 customers: {elapsed_time} seconds")

#
# Fireducks output
#

customer_id  customer_name 
681          Customer_20387    6892.96
740          Customer_30499    6613.11
389          Customer_22686    6597.35
316          Customer_1859     6565.38
529          Customer_1609     6494.35
Name: total, dtype: float64
Fireducks: Time taken to calculate top 5 customers: 2.823930263519287 seconds

DuckDB wins that one, being almost twice as fast as Fireducks.

Test 4— Monthly sales figures

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total) AS monthly_sales
FROM sales
GROUP BY month
ORDER BY month
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for seasonal trend calculation: {time.time() - start_time} seconds")

results

# 
# DuckDB output
#

DuckDB: Time for seasonal trend calculation: 0.16109275817871094 seconds

  month        monthly_sales
0 2010-01-01   1.699500e+08
1 2010-02-01   1.535730e+08
2 2010-03-01   1.702968e+08
3 2010-04-01   1.646421e+08
4 2010-05-01   1.704506e+08
... ... ...
163 2023-08-01 1.699263e+08
164 2023-09-01 1.646018e+08
165 2023-10-01 1.692184e+08
166 2023-11-01 1.644883e+08
167 2023-12-01 1.643962e+08

168 rows × 2 columns

Fireducks

import fireducks.pandas as pd
import time

def seasonal_trend():
    # Ensure 'order_date' is datetime
    df_fire['order_date'] = pd.to_datetime(df_fire['order_date'])

    # Extract 'month' as string
    df_fire['month'] = df_fire['order_date'].dt.strftime('%Y-%m')

    # Group by 'month' and sum 'total'
    results = (
        df_fire.groupby('month')['total']
        .sum()
        .reset_index()
        .sort_values('month')
    )
    print(results)

start_time = time.time()
seasonal_trend()
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for seasonal trend calculation: {time.time() - start_time} seconds")

#
# Fireducks Output
#

       month         total
0    2010-01  1.699500e+08
1    2010-02  1.535730e+08
2    2010-03  1.702968e+08
3    2010-04  1.646421e+08
4    2010-05  1.704506e+08
..       ...           ...
163  2023-08  1.699263e+08
164  2023-09  1.646018e+08
165  2023-10  1.692184e+08
166  2023-11  1.644883e+08
167  2023-12  1.643962e+08

[168 rows x 2 columns]
Fireducks: Time for seasonal trend calculation: 3.109074354171753 seconds

DuckDB was significantly quicker in this example.

Test 5— Average order by product

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    product_id,
    product_names,
    AVG(total) AS avg_order_value
FROM sales
GROUP BY product_id, product_names
ORDER BY avg_order_value DESC
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for average order by product calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for average order by product calculation: 0.13720130920410156 seconds

  product_id product_names avg_order_value
0 206        Paper         280.529144
1 208        Notebook      280.497268
2 201        Smartphone    280.494779
3 207        Pen           280.491508
4 205        Printer       280.470150
5 200        Laptop        280.456913
6 209        Coffee Maker  280.445365
7 211        Plastic Cups  280.440161
8 210        Cabinet       280.426960
9 202        Desk          280.367135
10 203       Chair         280.364045
11 204       Monitor       280.329706

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

avg_order_value = df_fire.groupby(['product_id', 'product_names'])['total'].mean().sort_values(ascending=False)
print(avg_order_value)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for average order calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

product_id  product_names
206         Paper            280.529144
208         Notebook         280.497268
201         Smartphone       280.494779
207         Pen              280.491508
205         Printer          280.470150
200         Laptop           280.456913
209         Coffee Maker     280.445365
211         Plastic Cups     280.440161
210         Cabinet          280.426960
202         Desk             280.367135
203         Chair            280.364045
204         Monitor          280.329706
Name: total, dtype: float64
Fireducks: Time for average order calculation: 0.06766319274902344 seconds

Fireducks gets one back there and was twice as fast as DuckDB.

Test 6— product performance analysis

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
WITH yearly_sales AS (
    SELECT 
        EXTRACT(YEAR FROM order_date) AS year,
        SUM(total) AS total_sales
    FROM sales
    GROUP BY year
)
SELECT 
    year,
    total_sales,
    LAG(total_sales) OVER (ORDER BY year) AS prev_year_sales,
    (total_sales - LAG(total_sales) OVER (ORDER BY year)) / LAG(total_sales) OVER (ORDER BY year) * 100 AS yoy_growth
FROM yearly_sales
ORDER BY year
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for product performance analysis calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

Time for product performance analysis  calculation: 0.03958845138549805 seconds

   year total_sales prev_year_sales yoy_growth
0  2010 2.002066e+09 NaN            NaN
1  2011 2.002441e+09 2.002066e+09   0.018739
2  2012 2.008966e+09 2.002441e+09   0.325848
3  2013 2.002901e+09 2.008966e+09  -0.301900
4  2014 2.000773e+09 2.002901e+09  -0.106225
5  2015 2.001931e+09 2.000773e+09   0.057855
6  2016 2.008762e+09 2.001931e+09   0.341229
7  2017 2.002164e+09 2.008762e+09  -0.328457
8  2018 2.002383e+09 2.002164e+09   0.010927
9  2019 2.002891e+09 2.002383e+09   0.025383
10 2020 2.008585e+09 2.002891e+09   0.284318
11 2021 2.000244e+09 2.008585e+09  -0.415281
12 2022 2.004500e+09 2.000244e+09   0.212756
13 2023 1.995672e+09 2.004500e+09  -0.440401

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

df_fire['year'] = pd.to_datetime(df_fire['order_date']).dt.year
yearly_sales = df_fire.groupby('year')['total'].sum().sort_index()
yoy_growth = yearly_sales.pct_change() * 100

result = pd.DataFrame({
    'year': yearly_sales.index,
    'total_sales': yearly_sales.values,
    'prev_year_sales': yearly_sales.shift().values,
    'yoy_growth': yoy_growth.values
})

print(result)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time for product performance analysis  calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

    year   total_sales  prev_year_sales  yoy_growth
0   2010  2.002066e+09              NaN         NaN
1   2011  2.002441e+09     2.002066e+09    0.018739
2   2012  2.008966e+09     2.002441e+09    0.325848
3   2013  2.002901e+09     2.008966e+09   -0.301900
4   2014  2.000773e+09     2.002901e+09   -0.106225
5   2015  2.001931e+09     2.000773e+09    0.057855
6   2016  2.008762e+09     2.001931e+09    0.341229
7   2017  2.002164e+09     2.008762e+09   -0.328457
8   2018  2.002383e+09     2.002164e+09    0.010927
9   2019  2.002891e+09     2.002383e+09    0.025383
10  2020  2.008585e+09     2.002891e+09    0.284318
11  2021  2.000244e+09     2.008585e+09   -0.415281
12  2022  2.004500e+09     2.000244e+09    0.212756
13  2023  1.995672e+09     2.004500e+09   -0.440401

Time for product performance analysis  calculation: 0.17495489120483398 seconds

DuckDB is quicker this time.

Test 7 – Add a new column to the data set and update its value

DuckDB

import duckdb

from datetime import datetime

start_time = time.time()

# Add new columns
con.execute("""
ALTER TABLE sales ADD COLUMN total_with_tax FLOAT
"""
)

# Perform the calculations and update the table
con.execute("""
UPDATE sales
SET total_with_tax = CASE 
    WHEN total <= 100 THEN total * 1.125  -- 12.5% tax
    WHEN total > 100 AND total <= 200 THEN total * 1.15   -- 15% tax
    WHEN total > 200 AND total <= 500 THEN total * 1.17   -- 17% tax
    WHEN total > 500 THEN total * 1.20   -- 20% tax
END;
""")

print(f"Time to add new column: {time.time() - start_time} seconds")

# Verify the new columns
result = con.execute("""
    SELECT 
        *
    FROM sales
    LIMIT 10;
""").fetchdf()

print(result)

#
# DuckDB output
#

Time to add new column: 2.4016575813293457 seconds

   order_id order_date  customer_id   customer_name  product_id product_names  
0         0 2021-11-25          238  Customer_25600         211  Plastic Cups   
1         1 2017-06-10          534  Customer_14188         209  Coffee Maker   
2         2 2010-02-15          924  Customer_14013         207           Pen   
3         3 2011-01-26          633   Customer_6120         211  Plastic Cups   
4         4 2014-01-11          561   Customer_1352         205       Printer   
5         5 2021-04-19          533   Customer_5342         208      Notebook   
6         6 2012-03-14          684  Customer_21604         207           Pen   
7         7 2017-07-01          744  Customer_30291         201    Smartphone   
8         8 2013-02-13          678  Customer_32618         204       Monitor   
9         9 2023-01-04          340  Customer_16898         207           Pen   

    categories  quantity  price   total  total_with_tax  
0       Sundry         2  99.80  199.60      229.539993  
1  Electronics         8   7.19   57.52       64.709999  
2   Stationery         6  70.98  425.88      498.279602  
3       Sundry         6  94.38  566.28      679.536011  
4  Electronics         4  44.68  178.72      205.528000  
5   Stationery         4  21.85   87.40       98.324997  
6   Stationery         3  93.66  280.98      328.746613  
7  Electronics         6  39.41  236.46      276.658203  
8  Electronics         2   4.30    8.60        9.675000  
9   Stationery         2   6.67   13.34       15.007500  

Fireducks

import numpy as np
import time
import fireducks.pandas as pd

# Start total runtime timer
start_time = time.time()
# Define tax rate conditions and choices
conditions = [
    (df_fire['total'] <= 100),
    (df_fire['total'] > 100) &amp; (df_fire['total'] <= 200),
    (df_fire['total'] > 200) &amp; (df_fire['total'] <= 500),
    (df_fire['total'] > 500)
]

choices = [1.125, 1.15, 1.17, 1.20]

# Calculate total_with_tax using np.select for efficiency
df_fire['total_with_tax'] = df_fire['total'] * np.select(conditions, choices)

# Print total runtime
print(f"Fireducks: Time to add new column: {time.time() - start_time} seconds")
print(df_fire)

#
# Fireducks oputput
#

Fireducks: Time to add new column: 2.7112433910369873 seconds

          order_id order_date  customer_id   customer_name  product_id  
0                0 2021-11-25          238  Customer_25600         211   
1                1 2017-06-10          534  Customer_14188         209   
2                2 2010-02-15          924  Customer_14013         207   
3                3 2011-01-26          633   Customer_6120         211   
4                4 2014-01-11          561   Customer_1352         205   
...            ...        ...          ...             ...         ...   
99999995  99999995 2011-10-24          501  Customer_29289         202   
99999996  99999996 2011-10-21          593  Customer_29352         200   
99999997  99999997 2019-05-10          673  Customer_25709         202   
99999998  99999998 2022-03-02          709  Customer_23966         208   
99999999  99999999 2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total    month  year  
0         Plastic Cups       Sundry         2  99.80  199.60  2021-11  2021   
1         Coffee Maker  Electronics         8   7.19   57.52  2017-06  2017   
2                  Pen   Stationery         6  70.98  425.88  2010-02  2010   
3         Plastic Cups       Sundry         6  94.38  566.28  2011-01  2011   
4              Printer  Electronics         4  44.68  178.72  2014-01  2014   
...                ...          ...       ...    ...     ...      ...   ...   
99999995          Desk       Office         7  53.78  376.46  2011-10  2011   
99999996        Laptop  Electronics         3  67.42  202.26  2011-10  2011   
99999997          Desk       Office         5   9.12   45.60  2019-05  2019   
99999998      Notebook   Stationery         1  78.91   78.91  2022-03  2022   
99999999         Chair       Office         4  59.58  238.32  2023-06  2023   

          total_with_tax  
0              229.54000  
1               64.71000  
2              498.27960  
3              679.53600  
4              205.52800  
...                  ...  
99999995       440.45820  
99999996       236.64420  
99999997        51.30000  
99999998        88.77375  
99999999       278.83440  

[100000000 rows x 13 columns]

They have very similar run times yet again. A draw.

Test 8 – Write out the updated data to a CSV file

DuckDB

start_time = time.time()

# Write the modified sales_data table to a CSV file
start = time.time()
con.execute("""
    COPY (SELECT * FROM sales) TO '/mnt/d/sales_data/final_sales_data_duckdb.csv' WITH (HEADER TRUE, DELIMITER ',')
""")

print(f"DuckDB: Time to write CSV to file: {time.time() - start_time} seconds")

DuckDB: Time to write CSV to file: 54.899176597595215 seconds

Fireducks

# fireducks write data back to CSV
#
import fireducks.pandas as pd

# Tidy up DF before writing out
cols_to_drop = ['year', 'month']
df_fire = df_fire.drop(columns=cols_to_drop)
df_fire['total_with_tax'] = df_fire['total_with_tax'].round(2) 
df_fire['order_date'] = df_fire['order_date'].dt.date

# Start total runtime timer
start_time = time.time()

df_fire.to_csv('/mnt/d/sales_data/fireducks_sales.csv',quoting=0,index=False)

# Print total runtime
print(f"Fireducks: Time to write CSV  to file: {time.time() - start_time} seconds")

Fireducks: Time to write CSV  to file: 54.490307331085205 seconds

Too close to call again.

Test 9— Write out the updated data to a parquet file

DuckDB

# DuckDB write Parquet data
# 

start_time = time.time()

# Write the modified sales_data table to a Parquet file
start = time.time()
con.execute("COPY sales TO '/mnt/d/sales_data/final_sales_data_duckdb.parquet' (FORMAT 'parquet');")

print(f"DuckDB: Time to write parquet to file: {time.time() - start_time} seconds")

DuckDB: Time to write parquet to file: 30.011869192123413 seconds

Fireducks

import fireducks.pandas as pd
import time

# Start total runtime timer
start_time = time.time()

df_fire.to_parquet('/mnt/d/sales_data/fireducks_sales.parquet')

# Print total runtime
print(f"Fireducks: Time to write Parquet to file: {time.time() - start_time} seconds")

Fireducks: Time to write Parquet to file: 86.29632377624512 seconds

That’s the first major discrepancy between run times. Fireducks took almost a minute longer to write out its data to Parquet than did DuckDB.

Summary

So, what are we to make of all this? Simply put, there is nothing much in it between these two libraries. Both are superfast and capable of processing large data sets. Once your data is in memory, either in a DuckDB table or Fireducks dataframe, both libraries are equally capable of processing it in double quick time

The choice of which one to use depends on your existing infrastructure and skill set.

If you’re a database person, DuckDB is the obvious library to use, as your SQL skills would be instantly transferable.

Alternatively, if you’re already embedded in the Pandas’ world, Fireducks would be a great choice for you.

_OK, that’s all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories, follow me or subscribe to get notified when I post new content._

If you like this content, you might find these articles interesting, too.

Building a Data Dashboard

Speed up Pandas code with Numpy


Related Articles