Discussion about this post

User's avatar
Matt Martin's avatar

Good stuff. One thing that needs to be considered is the overhead of uncompressing the data. There is a real compute and time cost to it. This is why you shouldn’t just have the mandate of “thou must only use parquet”. It really does depend on the business problem you are trying to solve. If it’s hundreds of terabytes, then by all means use parquet. But if you are talking just a few thousand rows spread across 20 files or so, csv in that case is the better candidate due to no decompression hit and the format is more portable plus human readable for quick spot checking and debugging.

Sugaan's avatar

Great Article Zach!!!

No posts

Ready for more?