The world’s leading publication for data science, AI, and ML professionals.

Three reasons why developers should use DuckDB

Developers often have to analyse data, e.g. assessing the impact of an outage. DuckDB is a single tool for querying multiple data sources.

Source: Unsplash
Source: Unsplash

How software developers can use DuckDB for data analysis

Software developers have to wear many hats: from writing code, designing systems to analysing data dumps during an incident. Most of our tools are optimised for the task – for writing code we have powerful IDEs, for designing systems we have feature-rich diagramming tools.

For data Analysis, do software developers have the best tool for the job? In this article, I list three key reasons why DuckDB, an open-source analytical database is a must-have tool for software developers.

Reason #1: Uses universally understood SQL

Imagine you work for a food-delivery company as a software developer. You receive an email that there is a sudden increase in payment-related customer complaints. The email includes a CSV file like this with some orders categorised by the nature of the complaint. As a developer under the heat, you may be inclined to quickly lookup how to analyse a CSV file on StackOverflow, which tells us to use awk.

awk -F',' 
  'NR > 1 {count[$6]++} END 
  {for (value in count) print value, count[value]}' 
  datagenerator/adjusted_transactions.csv | sort

CUSTOMER_SUPPORT_REFUND 8494
INSUFFICIENT_FUNDS 1232
MANUAL_ADJUSTMENT 162
REVERSED_PAYMENT 62815

It is natural to ask a followup question: how often do we see these errors per order? Answering iterative questions using tools like awk can be challenging because of its unfamiliar syntax. Moreover, had the data been in another format like JSON, we would have to use a different tool like jq with its completely different syntax and usage pattern.

DuckDB solves the problem of needing specific tooling for specific data formats by providing a unified SQL interface to a wide array of file types. Developers use SQL on a very regular basis and it is the language used to query the most deployed database in the world. Owing to the ubiquity of SQL, non-relational data systems have added support for accessing data using SQL: **** like MongoDB, Spark, Elastic-search and AWS Athena.

Going back to the original CSV file, using duckdb and SQL we can quite simply find how often an error is reported per order:

duckdb -c "
  with per_order_counts AS (
       select
         order_id, reason,
         count(transaction_id) as num_reports
       from 'datagenerator/adjusted_transactions.csv'
       group by 1,2
  )
  select reason, avg(num_reports) AS avg_per_order_count
  from per_order_counts group by 1 order by reason;"
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         reason          β”‚ avg_per_order_count β”‚
β”‚         varchar         β”‚       double        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ CUSTOMER_SUPPORT_REFUND β”‚  10.333333333333334 β”‚
β”‚ INSUFFICIENT_FUNDS      β”‚   2.871794871794872 β”‚
β”‚ MANUAL_ADJUSTMENT       β”‚                 1.2 β”‚
β”‚ REVERSED_PAYMENT        β”‚   50.57568438003221 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Reason #2: Supports multiple databases and file types

Assume our fictional food-delivery app is built using microservices. Say, there is a users microservice which stores user information in PostgreSQL and another orders microservice which stores order information in MySQL.

It is very hard to answer the following cross-microservice question: Are VIP users affected more compared to non-VIP users?

Typical setups to solve this use data pipelines to aggregate data from all microservices in one data warehouse, which is expensive and not easy to keep updated in realtime.

Using DuckDB, we can attach a MySQL database and a PostgreSQL database to join data across databases and filter against a CSV file. The database setup and code is available in this repository:

ATTACH 'host=localhost port=5432 dbname=flock user=swan password=mallard'
  AS pg_db (TYPE postgres_scanner, READ_ONLY);

ATTACH 'host=localhost port=3306 database=flock user=swan password=mallard'
  AS mysql_db (TYPE mysql_scanner, READ_ONLY);

select u.tier, 
       count(distinct o.id) as order_count
from pg_db.users u join mysql_db.orders o 
     on u.id = o.created_by 
where o.id IN (
     select order_id 
     from 'datagenerator/adjusted_transactions.csv'
) 
group by 1 ;

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  tier   β”‚ order_count β”‚
β”‚ varchar β”‚    int64    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ plus    β”‚         276 β”‚
β”‚ normal  β”‚         696 β”‚
β”‚ club    β”‚         150 β”‚
β”‚ vip     β”‚         148 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

In the above snippet we queried PostgreSQL, MySQL and a CSV file, ** but DuckDB supports many other data sources** like Microsoft Excel, JSON and S3 files – all using the same SQL interface.

Reason #3: Portability and Extensibility

DuckDB runs on the command-line shell as a standalone process without any additional dependency (like a server process). This portability makes DuckDB comparable to other Unix tools like sed, jq , sort and awk.

DuckDB can also be imported as a library in programs written in languages like Python and Javascript. In fact, DuckDB can also run in the browser – in this link, a SQL query fetches Rich Hickey’s repositories from Github and groups them by language – all from within the browser:

Screenshot showing DuckDB running in the browser. Image by the author.
Screenshot showing DuckDB running in the browser. Image by the author.

For features not included in DuckDB, community extensions can be used to add more functionality like cryptographic hash functions which were added as a community extension.

Conclusion

Data Analysis is an iterative process of asking questions about the data to get to an explanation of why something is happening. Quoting Carl Jung, "to ask the right question is already half the solution to a problem".

With traditional command-line tools, between the data and the question, there is the additional step of figuring out how to answer that question. This interrupts the iterative questioning process.

According to the Unix philosophy, simple tools when "combined with other programs, become general and useful tools" (source). Because each of these tools have their own usage patterns, composing these tools breaks the iterative questioning method by introducing an additional step of figuring out how to answer a question. Image by the author.
According to the Unix philosophy, simple tools when "combined with other programs, become general and useful tools" (source). Because each of these tools have their own usage patterns, composing these tools breaks the iterative questioning method by introducing an additional step of figuring out how to answer a question. Image by the author.

DuckDB unifies the proliferation of tools: (1) it runs everywhere (2) can query multiple data sources (3) with a declarative language that is widely understood. Using DuckDB, the feedback loop for iterative analysis is much shorter making DuckDB the one true tool that all developers should keep in their toolbox for analysing data.

DuckDB unifies the proliferation of data tools and provides a common SQL interface to different data sources, facilitating the iterative method of data analysis. Image by the author.
DuckDB unifies the proliferation of data tools and provides a common SQL interface to different data sources, facilitating the iterative method of data analysis. Image by the author.

Related Articles