Stop using Slowly-changing Dimensions! Part 1

Oct 24

Storage is cheap, your time is not!

21 Comments

This is probably the most crucial addendum to the standard dimensional modeling books. There is a timeless core there and then a ton of now pointless trivia for creating brittle logic to deal with 2008 era RDBMS storage and compute constraints. Can’t recommend this and the followup post enough to anyone building or using analytics pipelines.

Expand full comment

Reply (1)

Sahar Massachi

Thank you, that’s quite kind. :-)

Expand full comment

Firas Saddi

Great article, thank you! Really enjoyed learning about the differences between data warehouse tables vs production tables, and how the latter should feed the former. Will be waiting for part 2.

Expand full comment

Reply (1)

Sahar Massachi

Appreciate it, Firas. Yeah, in my first data science job, they just cloned the prod db as the “warehouse”. I’m glad they weren’t slowing down prod with queries — but it wasn’t a good time. I think part of the problem is naming — they’re warehouse TABLES and prod TABLES. But what if, idk, we had warehouse tables, and prod … countertops? Eh, I guess it probably won’t catch on. :-P

Expand full comment

Sung Won Chung

I always dreaded recommending SCD2 implementations in both dbt and sqlmesh because it's so stateful and brittle. Glad to see this series shine a light on this tediousness.

Expand full comment

Reply (1)

Sahar Massachi

Thank you! We are going to show the way.

Expand full comment

Ivan Lanuza

Excellent article - I built a data warehouse almost a decade ago and it followed this input-pipeline-output pattern. Back then I called it raw-aggregates-marts. And it’s all SQL! It’s still up and running to this day and is loved by users (they are surprised by the speed to access data despite having hundreds of millions of records). Thank you for taking the time to write this down - I feel validated that I did it right. Looking forward to part 2.

Expand full comment

Reply (1)

Sahar Massachi

:-)

Expand full comment

Scott Smith

4dEdited

What if the production data isn’t initially in the cloud, data changes are overwritten without a timestamp, and you can’t change that process? The compute ingestion costs could be potentially to costly.

Expand full comment

Reply (3)

Sahar Massachi

Hi Scott! I’m not sure if I understand the question/premise exactly. Are you saying something like: “What if it’s expensive to copy data from my prod db to data lake?”

If so, that feels like a Big Problem. Unclear how you can _have_ an effective data warehouse if putting data into it is expensive in the first place. ¯\_(ツ)_/¯

(What am I missing?)

Expand full comment

Zach Wilson

Change data capture (CDC) is a way to capture all mutations (I have another article on that if you check the archives). Meta accepts that changes that happen within a day don’t matter that much and day over day changes are the only thing that matters.

Expand full comment

Slawomir Piotr

But it shouldn’t really matter what production does. Or does it? In most cases, the data is copied from production (snapshot) and then the SCD nonsense happens. If the data is changing frequently during the day and you need those changes, I’d still prefer the complete dataset and just add a column with the hour (INT) of the snapshot (for example: 19 for 19:00) so it can be sorted. Or something like this.

Expand full comment

Reply (2)

Sahar Massachi

Not sure I quite understand the problem scope, but my hot take:

- If you need anything at sub-daily fidelity, just use logs as the upstream for the tables that need data that fine

- If you _always_ need sub-daily fidelity, do hourly instead of daily snapshots for that data

Expand full comment

Scott Smith

If the table or file is multiple terabytes, then loading daily snapshots of the full dataset to the cloud will get costly. I think the proposed solution is elegant, however, there is an assumption that the upstream production process can be optimized and migrated to the cloud, if necessary.

Expand full comment

Reply (1)

Slawomir Piotr

Quite a problem to solve… Oracle used to have a technology called GoldenGate. It was real-time replication based on logs - DML statements. In the case of the mentioned production system or table, if it would be possible to access the SQL update history for a day, it could be an option. Otherwise, whatever works now is the best solution. 🙂

Expand full comment

Annika

Such a great post! + supplied links! Thank you. Really looking forward to the part II 🦗

Expand full comment

Reply (1)

Sahar Massachi

Hi Annika, I appreciate the kind words. Part 2 is a doozy. It’s already written! You’re gonna love it. (Thank Zach for the links, they were mostly his)

Expand full comment

mirudake

Can't wait for part2!!

Expand full comment

sivas subramaniyan

LOVED IT !

Expand full comment

Dan Franks

Snapshots are inefficient and brittle. Stream the operational data stores in near real-time if you must seperate analytical from operational. I argue that the separation is no longer really necessary except in edge cases where data is actually big. You made a delineation between machine and human data consumption that I also argue is either no longer there or collapsing rapidly. Data mesh was a tad early, similar to data vault, but both are trending towards where the near future will land. Software and Data engineers just haven't realised it yet.

Expand full comment

Donald Parish

I’m going to need some convincing to go against Kimball.

Expand full comment

DataExpert.io Newsletter

Stop using Slowly-changing Dimensions! Part 1