Discussion about this post

User's avatar
Sahar Massachi's avatar

One thing I forgot to make clear in the article is that --

1. If you partition your data, doing snapshots adds 0 latency

2. If you send partitions to cold storage after, say, 3 months, then your tables stay roughly the same size over time.

Expand full comment
Scott Smith's avatar

Great article! One gap: handling deletes and cancellations

The datestamp approach elegantly handles updates and backfills, but there's a missing piece around data that gets deleted or cancelled after the fact. Here are the main patterns I've seen:

For Dimension Tables:

Soft delete flags in snapshots - Include deleted records in daily snapshots with is_deleted or is_active flags. Queries filter on WHERE current = 1 AND ds = '2025-01-16'. Use a deleted_at timestamp (with microseconds precision) to track exact deletion time. Preserves history while maintaining idempotency.

For Fact Tables:

Reversal/negative records - Like accounting ledgers. Original transaction gets a +100 row, cancellation gets a -100 row with a later ds. Works great for SUM() aggregations but complicates count logic.

Status/flag columns - Each ds includes the status as-of that day (completed, cancelled, refunded). Queries filter WHERE status != 'cancelled'. Better for non-numeric data or complex state transitions.

Separate cancellation fact tables - Keep fct_orders immutable, add fct_order_cancellations. LEFT JOIN and filter. Clean separation of concerns.

Expand full comment
19 more comments...

No posts