DataExpert.io Newsletter

DataExpert.io Newsletter

How to save millions by optimizing data pipeline shuffling

Zach Wilson's avatar
Zach Wilson
Mar 11, 2024
∙ Paid
77
8
3
Share

Shuffling isn’t just a fancy dance move! Shuffling is caused when you try to aggregate or join datasets in distributed environments like Spark or BigQuery. One time when I was working at Facebook, we had a 50 TB table joining with a 150 TB table. The shuffle caused by that join took up 30% of all of our compute! I eliminated that shuffle by bucketing th…

Keep reading with a 7-day free trial

Subscribe to DataExpert.io Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Zach Wilson
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture