How to save millions by optimizing data pipeline shuffling

Mar 11, 2024

∙ Paid

Shuffling isn’t just a fancy dance move! Shuffling is caused when you try to aggregate or join datasets in distributed environments like Spark or BigQuery. One time when I was working at Facebook, we had a 50 TB table joining with a 150 TB table. The shuffle caused by that join took up 30% of all of our compute! I eliminated that shuffle by bucketing th…

Continue reading this post for free, courtesy of Zach Wilson.

Or purchase a paid subscription.

DataExpert.io Newsletter

How to save millions by optimizing data pipeline shuffling

Continue reading this post for free, courtesy of Zach Wilson.