Understanding Parquet Format for beginners
A walk through of the most important file format to ever exist
There’s been a lot of talk about open table formats like Iceberg and Delta over the last few years. While these formats are awesome, many of the underlying efficiency and performance gains can be attributed to Parquet, with Iceberg/Delta serving as a nice management layer on top.
Think of Iceberg/Delta as the middle manager who gets all the credit, while Parquet is the senior engineer doing most of the work!
In this article, we are going to talk about:
Why Parquet matters?
The two compression techniques
Comparing Parquet to CSV for data compression
The interplay between Parquet and compression algorithms like zstd

Why Parquet Matters
Parquet is a columnar storage format.
Instead of storing rows like this:
| user_id | country | status | amount |
|---------|---------|--------|--------|
| 1 | US | active | 100 |
| 2 | US | active | 200 |
| 3 | CA | churn | 150 |Row-based formats (like CSV or JSON) store this row by row.
Parquet stores it column by column:
user_id: [1, 2, 3]
country: [US, US, CA]
status: [active, active, churn]
amount: [100, 200, 150]This unlocks:
Column Pruning
If your query only needs
amount ,Parquet doesn’t read the other columns.
Predicate Pushdown
If you’re filtering on
country = 'US',Parquet can skip entire row groups using metadata statistics.
Massive Compression
Columns contain similar data types and repeated values — perfect for compression.
You’ll notice a theme with Parquet: maximizing data skipping and compression!
The more data we skip, the lower our cloud bill!
Run-Length Encoding (RLE)
RLE compresses sequences of repeated values.
Example column:
country: [US, US, US, US, CA, CA, CA, US]Instead of storing:
US, US, US, US, CA, CA, CA, USRLE stores:
(US, 4), (CA, 3), (US, 1)The more sorted your data is, the better the compression!
You could imagine if we sorted the data above, we would get:
country: [CA, CA, CA, US, US, US, US, US]And RLE would store:
(CA, 3), (US, 5)Just by sorting, we reduced our storage footprint 33%!
This is extremely effective for:
Sorted columns
Partition columns
Low-cardinality values
Boolean columns
Status flags
Example in PySpark
Let’s create data where RLE will shine:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
data = [
(1, “US”, “active”),
(2, “US”, “active”),
(3, “US”, “active”),
(4, “CA”, “churn”),
(5, “CA”, “churn”),
(6, “CA”, “churn”),
]
df = spark.createDataFrame(data, [”user_id”, “country”, “status”])
df = df.orderBy(”country”)
df.write \
.option(”compression”, “zstd”) \
.mode(”overwrite”) \
.parquet(”/tmp/rle_example”)Because country and status are clustered together, Parquet’s internal RLE encoding compresses them efficiently.
Columnar data would look like this:
user_id: [1,2,3,4,5,6]
country: [US, US, US, CA, CA, CA]
status: [active, active, active, churn, churn, churn]Run Length encoding Compressed Parquet would look like:
user_id: [1,2,3,4,5,6]
country: [(US, 3), (CA, 3)]
status: [(active, 3), (churn, 3)]
Dictionary Encoding
Dictionary encoding works differently.
If a column has limited unique values:
country: [US, US, CA, CA, US]Parquet builds a dictionary:
Dictionary:
0 -> US
1 -> CAThen stores:
[0, 0, 1, 1, 0]Instead of repeating strings, it stores small integers.
This is extremely powerful for:
Categorical columns
Enums
Status fields
Country codes
Device types
Event types
This is not very useful for:
UUIDs
User Ids
Timestamps
Example: Forcing Dictionary Encoding
You can control Parquet behavior in Spark:
df.write \
.option(”parquet.enable.dictionary”, “true”) \
.option(”parquet.dictionary.page.size”, “1048576”) \
.parquet(”/tmp/dict_example”)Dictionary encoding is usually enabled by default, but it may fall back if:
Cardinality is too high
The dictionary grows too large (set by
parquet.dictionary.page.sizedefaulting to 1MB per page)
Comparing CSV to Parquet
Imagine we had this CSV dataset
| country | status |
|---------|---------|
| US | active |
| US | active |
| US | active |
| US | active |
| US | active |
| US | active |
| US | active |
| US | active |
| CA | active |
| CA | active |
| CA | active |
| CA | active |
| CA | churn |
| CA | churn |
| CA | churn |
| CA | churn |
| CA | churn |
| CA | churn |
| CA | churn |
| CA | churn |Let’s approximate bytes:
String sizes:
US/CA= 2 bytesactive= 6 byteschurn= 5 bytes
Country bytes:
8×2 (US) = 16
12×2 (CA) = 24
Total = 40
Status bytes:
12×6 (active) = 72
8×5 (churn) = 40
Total = 112
Total (values only):
CSV ≈ 40 + 112 = 152 bytes
Parquet Size (dictionary + RLE, approx)
Dictionary
We store each unique string once:
US(2) → is stored as 0CA(2) → is stored as 1active(6) → is stored as 0churn(5) → is stored as 1
Dictionary bytes = 15
RLE over dictionary IDs
After dictionary encoding:
country: [(0, 8), (1, 12)]
status: [(0, 12), (1, 8)]If we approximate each run pair (value_id, run_length) as ~2 bytes (tiny ints), then:
Runs ≈ 4 × 2 = 8 bytes
So:
Parquet (dict + RLE) ≈ 15 + 8 = 23 bytes!
CSV takes up 152 bytes. Parquet takes up 23 bytes
152 / 23 ≈ 6.6× smaller!
And that’s before zstd/snappy compression, and ignoring Parquet metadata overhead (which matters at tiny sizes but becomes negligible at real scale).
Now let’s add zstd to make it even smaller!
Here’s where it gets interesting.
Parquet uses two layers:
Encoding layer (RLE, dictionary)
Compression layer (zstd, snappy, gzip)
Think of encoding as restructuring data for compression.
Think of zstd as the final squeeze.
Before we add it to the CSV example above, let’s do a quick example of how zstd works:
Imagine this byte string:
ACTIVE_ACTIVE_ACTIVE_ACTIVERaw size:
“ACTIVE_” = 7 bytes
4 repetitions = 28 byteszstd scans and sees repetition.
It might encode this as:
Literal: “ACTIVE_”
Backreference: (distance=7, length=21)Meaning:
Write
"ACTIVE_"Then repeat the previous 7-byte chunk 3 more times
Instead of storing 28 bytes, it stores:
7 literal bytes
A small pointer describing repetition
That’s how dictionary + RLE inside Parquet creates ideal input for zstd — lots of repetition.
Putting zstd + run length encoding + dictionary encoding
Take our Parquet toy RLE output:
country: (0 × 8), (1 × 12)
status: (0 x 12), (1 x 8)Internally, this might look like:
0 8 1 12 0 12 1 8Now zstd sees the repeated 0s, 1s, 12s, and 8s:
Very small integers
Very few distinct values
Repetition patterns
It compresses this to something like:
[compressed block header + compact bitstream]Shrinking 8 bytes → 5-6 bytes.
Not dramatic here — because dictionary encoding and RLE already removed most redundancy.
The dictionary-encoded, run-length-encoded, zstd-compressed version of this dataset is ~20-21 bytes! This is 8 times smaller than CSV!
Which technique is actually doing the work here?
In this example, we can see that run-length encoding is doing 70-80% of the compression. Dictionary encoding is doing another 10%, and zstd compression is doing another 10%.
The problem with this is that run-length encoding requires sorting to be the most effective. Often, the sorting required to compress the data optimally costs MORE than the storage costs of storing the data non-ideally. Remember, zstd can compress repeated patterns even if the data is not sorted.
This illustrates the painful storage and compute tradeoff that doesn’t have a correct solution. Optimizing the storage almost always costs compute!
How do you compress your data most effectively? Can we vow to never use CSV ever again, at least? Please share this with your friends if you found it useful!
You can use code PARQUET for 30% off the DataExpert.io academy, where we show how these techniques actually work in production!


