29 Comments
User's avatar
mirudake's avatar

Can't wait for part2!!

lindhj's avatar

This is probably the most crucial addendum to the standard dimensional modeling books. There is a timeless core there and then a ton of now pointless trivia for creating brittle logic to deal with 2008 era RDBMS storage and compute constraints. Can’t recommend this and the followup post enough to anyone building or using analytics pipelines.

Sahar Massachi's avatar

Thank you, that’s quite kind. :-)

sivas subramaniyan's avatar

LOVED IT !

Sahar Massachi's avatar

Thank You! <3

Sung Won Chung's avatar

I always dreaded recommending SCD2 implementations in both dbt and sqlmesh because it's so stateful and brittle. Glad to see this series shine a light on this tediousness.

Sahar Massachi's avatar

Thank you! We are going to show the way.

Ivan Lanuza's avatar

Excellent article - I built a data warehouse almost a decade ago and it followed this input-pipeline-output pattern. Back then I called it raw-aggregates-marts. And it’s all SQL! It’s still up and running to this day and is loved by users (they are surprised by the speed to access data despite having hundreds of millions of records). Thank you for taking the time to write this down - I feel validated that I did it right. Looking forward to part 2.

Firas Saddi's avatar

Great article, thank you! Really enjoyed learning about the differences between data warehouse tables vs production tables, and how the latter should feed the former. Will be waiting for part 2.

Sahar Massachi's avatar

Appreciate it, Firas. Yeah, in my first data science job, they just cloned the prod db as the “warehouse”. I’m glad they weren’t slowing down prod with queries — but it wasn’t a good time. I think part of the problem is naming — they’re warehouse TABLES and prod TABLES. But what if, idk, we had warehouse tables, and prod … countertops? Eh, I guess it probably won’t catch on. :-P

Annika's avatar

Such a great post! + supplied links! Thank you. Really looking forward to the part II 🦗

Sahar Massachi's avatar

Hi Annika, I appreciate the kind words. Part 2 is a doozy. It’s already written! You’re gonna love it. (Thank Zach for the links, they were mostly his)

Moshe Shamouilian's avatar

Very informative!

Dan Franks's avatar

Snapshots are inefficient and brittle. Stream the operational data stores in near real-time if you must seperate analytical from operational. I argue that the separation is no longer really necessary except in edge cases where data is actually big. You made a delineation between machine and human data consumption that I also argue is either no longer there or collapsing rapidly. Data mesh was a tad early, similar to data vault, but both are trending towards where the near future will land. Software and Data engineers just haven't realised it yet.

Donald Parish's avatar

I’m going to need some convincing to go against Kimball.

Scott Smith's avatar

What if the production data isn’t initially in the cloud, data changes are overwritten without a timestamp, and you can’t change that process? The compute ingestion costs could be potentially to costly.

Sahar Massachi's avatar

Hi Scott! I’m not sure if I understand the question/premise exactly. Are you saying something like: “What if it’s expensive to copy data from my prod db to data lake?”

If so, that feels like a Big Problem. Unclear how you can _have_ an effective data warehouse if putting data into it is expensive in the first place. ¯\_(ツ)_/¯

(What am I missing?)

Zach Wilson's avatar

Change data capture (CDC) is a way to capture all mutations (I have another article on that if you check the archives). Meta accepts that changes that happen within a day don’t matter that much and day over day changes are the only thing that matters.

Slawomir Piotr's avatar

But it shouldn’t really matter what production does. Or does it? In most cases, the data is copied from production (snapshot) and then the SCD nonsense happens. If the data is changing frequently during the day and you need those changes, I’d still prefer the complete dataset and just add a column with the hour (INT) of the snapshot (for example: 19 for 19:00) so it can be sorted. Or something like this.

Sahar Massachi's avatar

Not sure I quite understand the problem scope, but my hot take:

- If you need anything at sub-daily fidelity, just use logs as the upstream for the tables that need data that fine

- If you _always_ need sub-daily fidelity, do hourly instead of daily snapshots for that data

Scott Smith's avatar

If the table or file is multiple terabytes, then loading daily snapshots of the full dataset to the cloud will get costly. I think the proposed solution is elegant, however, there is an assumption that the upstream production process can be optimized and migrated to the cloud, if necessary.

Slawomir Piotr's avatar

Quite a problem to solve… Oracle used to have a technology called GoldenGate. It was real-time replication based on logs - DML statements. In the case of the mentioned production system or table, if it would be possible to access the SQL update history for a day, it could be an option. Otherwise, whatever works now is the best solution. 🙂

Naina Chaturvedi's avatar

++ Good Post, Also, start here Compilation of 100+ Most Asked System Design, ML System Design Case Studies and LLM System Design

https://open.substack.com/pub/naina0405/p/important-compilation-of-most-asked?r=14q3sp&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false

Fahim Kanji's avatar

For clarity, this is not the data warehouse that nobody taught you (general not specifically you), it's the historically standard approach that has been lost to software engineers coming into the practice and pushing their always better wisdom onto data engineers (aka ETL developers).

Also, there's no such thing as ELT outside of streaming analytics where you do actually extract from a system, load a stream and transform (and consume that transformation). If you warehouse you *ALWAYS* L as your last step in the process ... So ELT as you've described it is simply staging ... It's really ELTL.

And best of all - it's not new! It's what every data warehouse architect and developer did because storage and computer wasn't cheap!!!!!! You can't afford to run full scans and in order to survive failure downstream you had to stage data so you could resume from a stage on failure!

Gonna guess I'm spoiling things for the next articles so I'll pause and we can reconvene in the next article!

Good job by the way! Very nice to read your style!

jesper's avatar

When is part II coming please, I've paid

Zach Wilson's avatar

Part two was released a few days ago!