34 Comments
User's avatar
mirudake's avatar

Can't wait for part2!!

lindhj's avatar

This is probably the most crucial addendum to the standard dimensional modeling books. There is a timeless core there and then a ton of now pointless trivia for creating brittle logic to deal with 2008 era RDBMS storage and compute constraints. Can’t recommend this and the followup post enough to anyone building or using analytics pipelines.

Sahar Massachi's avatar

Thank you, that’s quite kind. :-)

sivas subramaniyan's avatar

LOVED IT !

Sung Won Chung's avatar

I always dreaded recommending SCD2 implementations in both dbt and sqlmesh because it's so stateful and brittle. Glad to see this series shine a light on this tediousness.

Sahar Massachi's avatar

Thank you! We are going to show the way.

Ivan Lanuza's avatar

Excellent article - I built a data warehouse almost a decade ago and it followed this input-pipeline-output pattern. Back then I called it raw-aggregates-marts. And it’s all SQL! It’s still up and running to this day and is loved by users (they are surprised by the speed to access data despite having hundreds of millions of records). Thank you for taking the time to write this down - I feel validated that I did it right. Looking forward to part 2.

Firas Saddi's avatar

Great article, thank you! Really enjoyed learning about the differences between data warehouse tables vs production tables, and how the latter should feed the former. Will be waiting for part 2.

Sahar Massachi's avatar

Appreciate it, Firas. Yeah, in my first data science job, they just cloned the prod db as the “warehouse”. I’m glad they weren’t slowing down prod with queries — but it wasn’t a good time. I think part of the problem is naming — they’re warehouse TABLES and prod TABLES. But what if, idk, we had warehouse tables, and prod … countertops? Eh, I guess it probably won’t catch on. :-P

Annika's avatar

Such a great post! + supplied links! Thank you. Really looking forward to the part II 🦗

Sahar Massachi's avatar

Hi Annika, I appreciate the kind words. Part 2 is a doozy. It’s already written! You’re gonna love it. (Thank Zach for the links, they were mostly his)

Moshe Shamouilian's avatar

Very informative!

Dan Franks's avatar

Snapshots are inefficient and brittle. Stream the operational data stores in near real-time if you must seperate analytical from operational. I argue that the separation is no longer really necessary except in edge cases where data is actually big. You made a delineation between machine and human data consumption that I also argue is either no longer there or collapsing rapidly. Data mesh was a tad early, similar to data vault, but both are trending towards where the near future will land. Software and Data engineers just haven't realised it yet.

Donald Parish's avatar

I’m going to need some convincing to go against Kimball.

Scott Smith's avatar

What if the production data isn’t initially in the cloud, data changes are overwritten without a timestamp, and you can’t change that process? The compute ingestion costs could be potentially to costly.

Sahar Massachi's avatar

Hi Scott! I’m not sure if I understand the question/premise exactly. Are you saying something like: “What if it’s expensive to copy data from my prod db to data lake?”

If so, that feels like a Big Problem. Unclear how you can _have_ an effective data warehouse if putting data into it is expensive in the first place. ¯\_(ツ)_/¯

(What am I missing?)

Zach Wilson's avatar

Change data capture (CDC) is a way to capture all mutations (I have another article on that if you check the archives). Meta accepts that changes that happen within a day don’t matter that much and day over day changes are the only thing that matters.

Slawomir Piotr's avatar

But it shouldn’t really matter what production does. Or does it? In most cases, the data is copied from production (snapshot) and then the SCD nonsense happens. If the data is changing frequently during the day and you need those changes, I’d still prefer the complete dataset and just add a column with the hour (INT) of the snapshot (for example: 19 for 19:00) so it can be sorted. Or something like this.

Sahar Massachi's avatar

Not sure I quite understand the problem scope, but my hot take:

- If you need anything at sub-daily fidelity, just use logs as the upstream for the tables that need data that fine

- If you _always_ need sub-daily fidelity, do hourly instead of daily snapshots for that data

Scott Smith's avatar

If the table or file is multiple terabytes, then loading daily snapshots of the full dataset to the cloud will get costly. I think the proposed solution is elegant, however, there is an assumption that the upstream production process can be optimized and migrated to the cloud, if necessary.

Slawomir Piotr's avatar

Quite a problem to solve… Oracle used to have a technology called GoldenGate. It was real-time replication based on logs - DML statements. In the case of the mentioned production system or table, if it would be possible to access the SQL update history for a day, it could be an option. Otherwise, whatever works now is the best solution. 🙂

Kevin McLaughlin's avatar

Lots of a great stuff. But in my experience “pipeline” isn’t just ingestion but the whole process from data ingestion, through transformation in dbt or whatever, through to final results.

But the point about storage is really important. Many teams and engineers still operate like it’s 2010 and storage is very expensive. They optimize for performance instead of solving business problems.

Kevin McLaughlin's avatar

Storage is cheap! Your data team’s time is not!

So good. I talked to a team this week that wanted to spend the next month optimizing performance so we could save save $10 per dbt run in snowflake.

Kseniia Korostelova's avatar

I mean can data warehouse setup sound exciting? Apparently!

Mattias Thalén's avatar

Now, this is a style of writing I appreciate, directly calling out my multi-tasking with SoMe 😅

Naina Chaturvedi's avatar

++ Good Post, Also, start here Compilation of 100+ Most Asked System Design, ML System Design Case Studies and LLM System Design

https://open.substack.com/pub/naina0405/p/important-compilation-of-most-asked?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false