21 Comments
User's avatar
lindhj's avatar

This is probably the most crucial addendum to the standard dimensional modeling books. There is a timeless core there and then a ton of now pointless trivia for creating brittle logic to deal with 2008 era RDBMS storage and compute constraints. Can’t recommend this and the followup post enough to anyone building or using analytics pipelines.

Expand full comment
Sahar Massachi's avatar

Thank you, that’s quite kind. :-)

Expand full comment
Firas Saddi's avatar

Great article, thank you! Really enjoyed learning about the differences between data warehouse tables vs production tables, and how the latter should feed the former. Will be waiting for part 2.

Expand full comment
Sahar Massachi's avatar

Appreciate it, Firas. Yeah, in my first data science job, they just cloned the prod db as the “warehouse”. I’m glad they weren’t slowing down prod with queries — but it wasn’t a good time. I think part of the problem is naming — they’re warehouse TABLES and prod TABLES. But what if, idk, we had warehouse tables, and prod … countertops? Eh, I guess it probably won’t catch on. :-P

Expand full comment
Sung Won Chung's avatar

I always dreaded recommending SCD2 implementations in both dbt and sqlmesh because it's so stateful and brittle. Glad to see this series shine a light on this tediousness.

Expand full comment
Sahar Massachi's avatar

Thank you! We are going to show the way.

Expand full comment
Ivan Lanuza's avatar

Excellent article - I built a data warehouse almost a decade ago and it followed this input-pipeline-output pattern. Back then I called it raw-aggregates-marts. And it’s all SQL! It’s still up and running to this day and is loved by users (they are surprised by the speed to access data despite having hundreds of millions of records). Thank you for taking the time to write this down - I feel validated that I did it right. Looking forward to part 2.

Expand full comment
Sahar Massachi's avatar

:-)

Expand full comment
Scott Smith's avatar

What if the production data isn’t initially in the cloud, data changes are overwritten without a timestamp, and you can’t change that process? The compute ingestion costs could be potentially to costly.

Expand full comment
Sahar Massachi's avatar

Hi Scott! I’m not sure if I understand the question/premise exactly. Are you saying something like: “What if it’s expensive to copy data from my prod db to data lake?”

If so, that feels like a Big Problem. Unclear how you can _have_ an effective data warehouse if putting data into it is expensive in the first place. ¯\_(ツ)_/¯

(What am I missing?)

Expand full comment
Zach Wilson's avatar

Change data capture (CDC) is a way to capture all mutations (I have another article on that if you check the archives). Meta accepts that changes that happen within a day don’t matter that much and day over day changes are the only thing that matters.

Expand full comment
Slawomir Piotr's avatar

But it shouldn’t really matter what production does. Or does it? In most cases, the data is copied from production (snapshot) and then the SCD nonsense happens. If the data is changing frequently during the day and you need those changes, I’d still prefer the complete dataset and just add a column with the hour (INT) of the snapshot (for example: 19 for 19:00) so it can be sorted. Or something like this.

Expand full comment
Sahar Massachi's avatar

Not sure I quite understand the problem scope, but my hot take:

- If you need anything at sub-daily fidelity, just use logs as the upstream for the tables that need data that fine

- If you _always_ need sub-daily fidelity, do hourly instead of daily snapshots for that data

Expand full comment
Scott Smith's avatar

If the table or file is multiple terabytes, then loading daily snapshots of the full dataset to the cloud will get costly. I think the proposed solution is elegant, however, there is an assumption that the upstream production process can be optimized and migrated to the cloud, if necessary.

Expand full comment
Slawomir Piotr's avatar

Quite a problem to solve… Oracle used to have a technology called GoldenGate. It was real-time replication based on logs - DML statements. In the case of the mentioned production system or table, if it would be possible to access the SQL update history for a day, it could be an option. Otherwise, whatever works now is the best solution. 🙂

Expand full comment
Annika's avatar

Such a great post! + supplied links! Thank you. Really looking forward to the part II 🦗

Expand full comment
Sahar Massachi's avatar

Hi Annika, I appreciate the kind words. Part 2 is a doozy. It’s already written! You’re gonna love it. (Thank Zach for the links, they were mostly his)

Expand full comment
mirudake's avatar

Can't wait for part2!!

Expand full comment
sivas subramaniyan's avatar

LOVED IT !

Expand full comment
Dan Franks's avatar

Snapshots are inefficient and brittle. Stream the operational data stores in near real-time if you must seperate analytical from operational. I argue that the separation is no longer really necessary except in edge cases where data is actually big. You made a delineation between machine and human data consumption that I also argue is either no longer there or collapsing rapidly. Data mesh was a tad early, similar to data vault, but both are trending towards where the near future will land. Software and Data engineers just haven't realised it yet.

Expand full comment
Donald Parish's avatar

I’m going to need some convincing to go against Kimball.

Expand full comment