SCD-2 considered harmful! Part 2

Nov 4, 2025

Date stamp your data!

Read →

30 Comments

Sahar Massachi

Nov 6

One thing I forgot to make clear in the article is that --

1. If you partition your data, doing snapshots adds 0 latency

2. If you send partitions to cold storage after, say, 3 months, then your tables stay roughly the same size over time.

Chris Gambill

Nov 5

Date-stamped snapshots are solid for analytic state questions and fast backfills. No argument from me on that, but where this breaks down is human-edited operational data. Salesforce, NetSuite, HubSpot, finance apps. Human entered data... Those records flip after the fact, and auditors expect an immutable trail tied to who changed what and when.

If you need real auditability, capture changes first (CDC or event log). Then layer convenience: a “latest” view for analysts and, if needed, an SCD2 history table for business reporting. Snapshots alone are fine for product analytics; they’re not enough for regulated ops data.

So the choice isn’t “SCD2 vs datestamps.” It’s immutable facts at the base, with snapshots or SCD2 as views that match the model and the risk.

Your question might be but how does SCD2 fix this?

It “fails” only if your ingestion cadence or diff method is too coarse. Fix with CDC or higher-frequency diffs. SCD2 will happily store the extra versions.

Then SCD2 is the storage pattern and CDC is the microscope.

Scott Smith

Nov 5

Great article! One gap: handling deletes and cancellations

The datestamp approach elegantly handles updates and backfills, but there's a missing piece around data that gets deleted or cancelled after the fact. Here are the main patterns I've seen:

For Dimension Tables:

Soft delete flags in snapshots - Include deleted records in daily snapshots with is_deleted or is_active flags. Queries filter on WHERE current = 1 AND ds = '2025-01-16'. Use a deleted_at timestamp (with microseconds precision) to track exact deletion time. Preserves history while maintaining idempotency.

For Fact Tables:

Reversal/negative records - Like accounting ledgers. Original transaction gets a +100 row, cancellation gets a -100 row with a later ds. Works great for SUM() aggregations but complicates count logic.

Status/flag columns - Each ds includes the status as-of that day (completed, cancelled, refunded). Queries filter WHERE status != 'cancelled'. Better for non-numeric data or complex state transitions.

Separate cancellation fact tables - Keep fct_orders immutable, add fct_order_cancellations. LEFT JOIN and filter. Clean separation of concerns.

Joel Iemma

Nov 4

Great technique for it! I'm currently suffering with these SCD 2 issues so I'll make sure to propose and implement it now that I have an article to back me up. What about when the granularity of the date is not daily but timestamp and multiple states can happen within the day? Curious if there's a similar technique apart from bucketing in smaller "stamps"

Reply (3)

Sahar Massachi

Nov 4

Thank you! Glad you liked it.

On the granularity question - you'll still have logs for everything that happens, right? You can get the data you need from the logs (or even coalesce/aggregate them and put them in dim tables). So unless you need really fast turnaround (quasi-realtime), I think daily snapshots are fine.

But maybe I'm misunderstanding your use case! Tell me more.

Reply (1)

Joel Iemma

Nov 5

Not a use case I have at the moment but I was just curious about states that change multiple times within the date. Zach is right, CDC could be the approach. I'll research about it (I may even join the bootcamp😆)

After posting, I started wondering how you would use date stamping with gaps in the data?

Zach Wilson

Nov 4

Change data capture is the strategy there and I teach that in my boot camp!

Marcus Röhler

Nov 7Edited

I had a similar question in mind and I am not sure if CDC is the answer. CDC or Event Logs make sense on the raw ingestion level. But what is the best way to transform these into useful Dimensions, if the exact time of change is important. For example it will confuse the Analyst or Product Team if user a and b became unconnected on Date 2025-10-10 and user a still posted something on user b board in the Morning. Fact „user a posted on user b board and have_connection = false“ would seem like a huge bug if this should not be possible… so here I would argue for SCD 2 logic for the Dimension and Event logs saved as is? Or joining facts based on last dimension change before fact time?

Mateo Posada

Nov 6

This is awesome, thanks! I do have one question, for facts you just append new facts right? since they would already have a date, vs dimensions where you timestamp the whole table every day?

Reply (1)

Sahar Massachi

Nov 6

Yeah, exactly! Facts (and their upstream events) get datestamped — but the date stamp is just “what day did this happen?”

Reply (1)

Mateo Posada

Nov 6

Thanks for your reply! Maybe Im overlooking something, but is the purpose of datestamping the facts just for partitioning? Im thinking when querying and joining to the latest dim you wont really want to filter the fact by ds as well. Like for exmple, I work for a construction company, one of our dim tables is the jobs table (these are construction projects so will have status, market segment, start date etc...) , and a related fact could be cost transactions, if i want the current state of a job i would get the latest ds for the dim_job table, but then I wouldnt want to filter the cost by ds to get an aggregate (you would leave out older transactions). To me it makes total sense with the dims, but I'm just trying to understand the why to the datestamping in the facts so we can apply this method in the company!

Reply (1)

Sahar Massachi

Nov 7

Great question! I think there are a few reasons:

1. Yes, partitioning

2. Getting logs once a day is easy/clear practice. And those filter down to fact tables

3. Your pipelines often wait on "did today's data land?" before running, right? If facts and dim tables both use datestamps, then you have an easy/standardized way to signal that.

4. Having half your data datestamped and half not would be a total mess.

Probably other reasons I'm not thinking of, honestly! But that's what I have straight off the dome

Reply (1)

Mateo Posada

Nov 10

I appreciate you taking the time to clarify, thanks!

Love this!

Great article!

Great article. I think your approach is 100% compatible with Kimball methodology. I would call them fact_user_follower_daily_snapshot instead of a dim_users table. This is the preferred way to query historical data when storage is not a issue and now it is not! The dim_user table would just be attributes off the user instead of their behavior such as if they are married or where they live. Now that dim user too can be snapshotted instead of SCD Type 2. The Dim User attributes itself can be flattened into fact_user_follower_daily_snapshot since storage is cheap and having everything you need in one table make end user querying easier. The key is still to have a dimensional model with facts and dimension but you can give up clutching to SCD and primary keys as necessary evil of storage limitations.

Reply (2)

Sahar Massachi

Nov 5

Thanks Suhail. One thing I want to point out -- maybe it was unclear in the toy example, but dim_user should be a pretty wide table. Not just followers, but tons of stuff should go in there. And sure, why not have some aggregate user behavior stuff in there too? (E.g. "posts in the last 30 days, how many of the last 28 days did they use the app, etc)

Denormalize! Duplicate! It's all good! Storage is cheap, but analysts making decisions based off an incorrect join is _expensive_

Reply (1)

iamtodor

Nov 16

> And sure, why not have some aggregate user behavior stuff in there too? (E.g. "posts in the last 30 days, how many of the last 28 days did they use the app, etc)

I feel like with this approach, we would be having the backfill issue you described in the article: we have to have 2 code bases for backfill and daily, backfill depends on the past run, right? We will not be able to run like 10 runs simultaneously

Suhail Ali

Nov 5

If it’s a table of aggregate counts then I would add metric type column and have a generic count column then you have a flexible data model to store any type of count for the user. This too is one of Kimball’s model but in this case we are normalizing. As defender of the faith I just had to bring this up.

mirudake

Dec 25

I was a bit confused by this one. Any chance there's a video on it?

Reply (1)

Zach Wilson

Dec 25

How Meta Models Big Volume Event Data - Full 4 Hour Course - DataExpert.io Free Boot Camp Week 2

https://youtu.be/DQefW9sNmw0

Four hour course on it. There’s another video on one big table you could look at too.

iamtodor

Nov 6

"who now have 1000+ followers but had under 200 three months ago" - what exactly does it mean?

Reply (1)

Sahar Massachi

Nov 6

The user's number of followers is relatively high right now. But in the recent past, their number of followers was relatively low.

Does this help?

Reply (1)

iamtodor

Nov 7

Great, thank you for the elaboration! Now it makes sense :)

Lars Johansson Blank

Nov 5

Great article and well explained. Moving on from Kimball method in the persistent layer (DW) is good and use something else.

I hope you have looked at DataVault, because the principles of adding data (among a lot of other things) is there

Sundoo

Dec 24Edited

I love this article! I bookmarked it to my reading list, and I finally read the whole thing from top to bottom. The datestamp approach makes a lot of sense for simplifying historical queries.

However, I think there might be a bug in the retention query example??

Even though you did:

------------------------------------------------------------

LEFT JOIN dim_relationships past_rels

ON views.viewer_id = past_rels.user_id

AND views.poster_id = past_rels.friend_id

AND views.ds = past_rels.ds

AND past_rels.is_mutual = true

------------------------------------------------------------

the COUNT(DISTINCT views.post_id) counts all viewed posts, not just posts from mutuals.

Doesn't it make sense to create additional CTE (in the thread)

then left join it with user_device at the end? That way, it will count viewed posts from mutuals only.

Reply (1)

Sundoo

Dec 24

------------------------------------------------------------

mutual_views AS (

SELECT DISTINCT

v.viewer_id,

v.post_id

FROM views v

INNER JOIN dim_relationships r -- INNER JOIN to filter

ON v.viewer_id = r.user_id

AND v.poster_id = r.friend_id

AND v.ds = r.ds

AND r.is_mutual = true

)

------------------------------------------------------------

Jonathan van de Ruitenbeek

Nov 9

The proposal describes snapshotting the entire dimension every day (one new partition per ETL run).

So after 5 years, if you run daily ETL:

- You’d have roughly 1,825 partitions.

- Each partition contains the full dimension (say 10 million rows).

- That means you’re storing 10 million × 1,825 ≈ 18.25 billion rows total — even though maybe only a few thousand change per day.

And that's for a single table! And compute isnt cheap either.

Is this really a viable solution in large scale environments ?

Would love y'all views on this!

Reply (1)

Sahar Massachi

Nov 9

Yep! You gotta sunset old partitions. Cold storage after some number of months worked for us. But I imagine you can do custom tables or columns for data you wanna keep.

https://blog.dataexpert.io/p/stop-using-slowly-changing-dimensions/comment/174310834?r=5xyah&utm_medium=ios

Reply (1)

Sundoo

Dec 24

We have 420M unique rows in our users table, roughly 320GB in size for a daily snapshot. After a year, that will be ~117TB/year. Storage cost is cheap, but the data growth rate when adopting this strategy is no joke. Yes, storage tiering based on data age is one way to mitigate this, but that's also an overhead we need to manage.

Also, the whole point of adopting this strategy rather than the old-school SCD Type-2 approach is to gain query simplicity, backfill reliability, and pipeline maintainability. But if we move data to cold storage, it increases data retrieval time by an order of magnitude, resulting in high query latency. This also adds operational overhead.

Any thoughts on this trade-off?

DataExpert.io Newsletter

SCD-2 considered harmful! Part 2