Introduction
Apache Spark has been the de facto open source data processing for fifteen years. It was invented to solve a major problem that traditional data warehousing was not built to solve - processing massive amounts of data horizontally at scale (Zach used Spark to process 2000 TBs per day at Netflix), whether in a structured or semi-structured form.
When social media and IoT exploded, companies saw a major opportunity to mine the data and achieve a competitive edge. The value they achieved from using Spark came strengthened a data engineering ideology of “just use Spark always”
Fast forward to 2025:
Processing power on laptops has increased dramatically over the last twenty years. This allows single laptops to accomplish what we needed multi-node Spark clusters to do ten years ago.
This shined light on a need for a data processing tool capable of performing fast on a single machine, but with the same ease and flexibility that Spark is known for.
We now know that Spark isn’t always the best choice.
DuckDB has entered the chat
DuckDB is a data processing platform that allows you to run SQL against large datasets with ease and minimal tooling. It has been growing in popularity each year, and here are some core reasons why:
very simple install - fully packaged, no dependencies, no jvm; literally installs in a couple seconds via !pip install duckb and only takes up a few mb’s
light weight - runs in-memory and does not require a server
rich sql language and extensions - offers a lot of syntax sugar
provides numerous client apis - most popular are the CLI and Python
On personal projects (here’s one with Python, DuckDB, and Iceberg), my default choice is DuckDB for data pipelines and will only go to Spark when my datasets are >20GB in size. This is where I start to see DuckDB struggling to keep up. For context, my laptop has 16GB of ram.
So How Fast Is DuckDB Exactly?
Instead of guessing whether Spark or DuckDB is faster, we built out some tests datasets with increasing number of rows and then time query benchmarks between both DuckDB and Spark to illustrate just how fast DuckDB is.
Disclaimer - As with any benchmark, it’s always good to test yourself; additionally, this benchmark is for demonstration purposes; real world usage of datasets will vary, so it’s always best practice to try multiple data processing engines to see which one fits your use-case appropriately
Generating The Test Data
Generating test data can be done with DuckDB very easily. Below is the function we will use to create the datasets:
And now let’s build 7 datasets:
When we take a look at the files created, we see the following sizes:
Notice how that last one is 23GB, which exceeds the ram on my MBPro (16GB ram). It will be rather interesting to see how well DuckDB performs for this one.
The Benchmark Code
For our benchmark, we will have both DuckDB and Spark query the rand_dt column and perform a count distinct of the rand_str column for each dataset. Count distinct will force DuckDB and Spark to read the full parquet file (resulting in a high fidelity test) The benchmark code is pretty straight forward:
The Results?
DuckDB outperformed Spark in every run by several orders of magnitude. Now, I’m not saying that you can plow ahead and use DuckDB as a replacement for Spark for everything, but I hope this simple demonstration helps you realize that you don’t always need a multi-node sledgehammer to crack a peanut. Below are the run results:
Additionally, I used matplotlib in Python to plot the results on a bar graph, which is also very telling:
As you can see, DuckDB crushed Spark, even when it needed to scan half a billion rows!
Conclusion
DuckDB is a powerful substitute for Spark, even on medium-to-large datasets (when you get to TBs, you probably need multi-node Spark).
Although this analysis did just a single benchmark test and did not cover JOIN performance. In my experience, you’d be surprised how well DuckDB performs with joins. Try it out yourself and you’ll be shocked at just how quickly you can author a new data pipeline on a single laptop!
You can learn more about how to create highly performant pipelines with the DataExpert.io subscription. It gives you immediate access to learn OpenAI, Databricks, Snowflake, AWS, Airflow, Trino, Iceberg and more! Use code DUCKDB for 30% off!
Thanks for Reading,
Matt Martin, Zach Wilson
Subscribe to High Performance DE for more articles like this