<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[DataExpert.io Newsletter]]></title><description><![CDATA[A newsletter dedicated to talking about data engineering, AI, and data science trends]]></description><link>https://blog.dataexpert.io</link><image><url>https://substackcdn.com/image/fetch/$s_!2oBZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7babfd31-fa90-48c7-a46b-c155b3694ede_1280x1280.png</url><title>DataExpert.io Newsletter</title><link>https://blog.dataexpert.io</link></image><generator>Substack</generator><lastBuildDate>Sun, 31 May 2026 02:54:11 GMT</lastBuildDate><atom:link href="https://blog.dataexpert.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Zach Wilson]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[dataexpert@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[dataexpert@substack.com]]></itunes:email><itunes:name><![CDATA[Zach Wilson]]></itunes:name></itunes:owner><itunes:author><![CDATA[Zach Wilson]]></itunes:author><googleplay:owner><![CDATA[dataexpert@substack.com]]></googleplay:owner><googleplay:email><![CDATA[dataexpert@substack.com]]></googleplay:email><googleplay:author><![CDATA[Zach Wilson]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[A well-architected secretary is 76 agents in a trenchcoat]]></title><description><![CDATA[RAG and note takers won't save your business]]></description><link>https://blog.dataexpert.io/p/ai-architecture-is-misunderstood</link><guid isPermaLink="false">https://blog.dataexpert.io/p/ai-architecture-is-misunderstood</guid><dc:creator><![CDATA[Sahar Massachi]]></dc:creator><pubDate>Mon, 11 May 2026 16:02:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Ned2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hundreds of startups (and maybe an internal team at your company) are trying, right now, to build and sell you on an army of AI scribes. That&#8217;s fine, and it&#8217;ll be useful. But what you&#8217;ll really need are <strong>competent, trusted, proactive secretaries. </strong>That&#8217;s <strong>much harder.</strong> But doable!</p><h4>This article is brought to you by:</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="http://www.eon.io/virtual-event/bigquery-day?utm_medium=third_party&amp;utm_source=inflluencer_zachwilson&amp;utm_campaign=26q2_tofu_virtual_event_bq_day_registration&amp;utm_content=newsletter" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S-7U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png 424w, https://substackcdn.com/image/fetch/$s_!S-7U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png 848w, https://substackcdn.com/image/fetch/$s_!S-7U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png 1272w, https://substackcdn.com/image/fetch/$s_!S-7U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S-7U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png" width="1280" height="641" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:641,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;http://www.eon.io/virtual-event/bigquery-day?utm_medium=third_party&amp;utm_source=inflluencer_zachwilson&amp;utm_campaign=26q2_tofu_virtual_event_bq_day_registration&amp;utm_content=newsletter&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!S-7U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png 424w, https://substackcdn.com/image/fetch/$s_!S-7U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png 848w, https://substackcdn.com/image/fetch/$s_!S-7U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png 1272w, https://substackcdn.com/image/fetch/$s_!S-7U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb070203a-2889-4809-84a7-f5d5aee9b09e_1280x641.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Eon and Google Cloud are hosting <a href="http://www.eon.io/virtual-event/bigquery-day?utm_medium=third_party&amp;utm_source=inflluencer_zachwilson&amp;utm_campaign=26q2_tofu_virtual_event_bq_day_registration&amp;utm_content=newsletter">BigQuery Day</a>, a <a href="http://www.eon.io/virtual-event/bigquery-day?utm_medium=third_party&amp;utm_source=inflluencer_zachwilson&amp;utm_campaign=26q2_tofu_virtual_event_bq_day_registration&amp;utm_content=newsletter">free one-day virtual event</a> for teams running BigQuery in production. Sessions from Google&#8217;s VP of BigQuery Engineering, Lead Data Engineer at L.L.Bean, Google Cloud Developer Advocates, and more.</em></p><p></p><h3>AI has evolved into new UX patterns. This will continue.</h3><p>So, we have robots (LLMs, agents, whatever you want to call them) now. What can we use them for?</p><p>Well, robots have changed their shape over the last few years. We went through these affordances:</p><ul><li><p>GPT playground (autocomplete)</p></li><li><p>ChatGPT (chatbot)</p></li><li><p>custom (<a href="https://medium.com/@zilliz_learn/graphrag-explained-enhancing-rag-with-knowledge-graphs-3312065f99e1">RAG</a>) setups (understand <em>your</em> files specifically)</p></li><li><p>Claude Code (agents, dispatchers)</p></li><li><p>&#8220;Agent&#8221; workflows</p></li><li><p><a href="https://en.wikipedia.org/wiki/OpenClaw">OpenClaw</a> (autonomy + skills + texting interface)</p></li></ul><p>Each was useful! Businesses are being built on them.</p><p>So -- what next? And what can we learn?</p><p>Well, we&#8217;re about to see a new layer &#8211; the army of scribes. You&#8217;ll be tempted to use them directly. You&#8217;ll then be tempted to hook them up to your robot assistant. Fine. But the real action will be in giving that assistant enough context, power, and <em>team-aware-capabilities</em> so that it can actually be a useful clerk or secretary.</p><h4></h4><h3>The gold rush of AI scribes is coming</h3><p>Surely you also see the coming gold rush. Seemingly everyone is  working on versions of this vision (to greater and lesser degrees of insight). Via &#8220;<a href="https://www.notion.com/help/guides/category/ai">owning your docs</a>&#8221;, via &#8220;<a href="http://slack.com/features/ai">owning your communications</a>&#8221;. Maybe pitched as &#8220;<a href="https://www.cortex.io/">specially for developers</a>&#8221; or a <a href="https://www.evermuse.com/">sales-calls-to-product-insight loop</a>. Maybe as a <a href="https://www.youtube.com/shorts/ZpxJzGpv7mo">panicked pivot</a> from an entirely different product. Maybe <a href="https://openai.com/index/introducing-chatgpt-pulse/">over a specific dataset.</a> They&#8217;re all the same thing &#8211; scribes, taking notes, collating them, and giving them back to you.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2PJq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2PJq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!2PJq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!2PJq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!2PJq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2PJq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:186354,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/193514667?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2PJq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!2PJq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!2PJq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!2PJq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77addcbc-4947-45ae-b95e-9fb0ed54b966_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This will happen.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p>And it&#8217;ll be fine! It&#8217;ll be useful. But how often do you currently read meeting notes? How will it feel, really, to have action items more <em>efficiently</em> thrown your way? You&#8217;re still tied to the tyranny of staring at a screen and typing on a keyboard.</p><p>What&#8217;s <em>more </em>interesting is what will happen <em>next</em>.</p><h3>We can do better than delegating notes to robots</h3><p>Think about depictions of principals doing magnificent work across media.</p><p>A <a href="http://ted-lasso.fandom.com/">soccer coach</a> rallying his team. <a href="https://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series_season_1">Space captains</a> conducting negotiation with aliens. <a href="https://en.wikipedia.org/wiki/Mad_Men">Ad executives</a> making deals in the 1960s. <a href="http://youtube.com/watch?v=podknqszmdy">A fantasy king leading his men into battle</a>. Business executives jockeying for power.</p><div class="pullquote"><p style="text-align: center;">How will it feel, really, to have action items more efficiently thrown your way? You&#8217;re still tied to the tyranny of staring at screen and typing on a keyboard</p></div><p>What do these people have in common? They have fun, creative jobs. They have meetings. They have big thoughts. They made decisions. They do not slave over a typewriter or word document. They do not take meticulous notes. They have staff for that. (Notably, being the staff is generally a lot less pleasant.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>)</p><p>In the near future &#8211; that could be you. But if you want that life, you need staff to do the less fun work. Beyond just scribes taking notes. You need clerks. You need secretaries. And you need them to be as reliable and competent as those fantasy secretaries. But since they&#8217;re robots, you don&#8217;t actually need to worry about whether their jobs are rewarding.</p><p>But what do real-life secretaries actually do?</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/p/ai-architecture-is-misunderstood?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.dataexpert.io/p/ai-architecture-is-misunderstood?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p>Well, in large part, you&#8217;re currently a secretary to yourself. Modern real-world workplaces spend a lot less money on logistical support than fantasy TV offices do. You do a lot of typing. You send memos. You route information between people on your team. You update documentation. You schedule and bargain and procure. You might communicate with vendors and outsiders or do other logistics. You type. You do it all in the background as part of supporting your &#8220;real&#8221; work. It may feel tactically rote, but it relies on a wealth of knowledge and trust and context that you can&#8217;t hand off cleanly &#8211; at least, until now.</p><div class="pullquote"><p>What do secretaries actually do? Well, in large part, you&#8217;re currently a secretary to yourself.</p></div><p>Now what does this look like in our near-future?</p><h3>Mostly-reliable scribes are a commodity</h3><p>Your company will set up scribes. You will have an invisible system of robots doing certain menial tasks. Your conversations will be filed, your notes will be taken. The notes will be parsed into decisions. Jira tickets will be filed and closed. Executives will get daily summaries of what their teams are up to. Action items will be filed after each meeting.</p><p>This might even be built, in a half-baked way, in your team right now.</p><p>That&#8217;s the basic, obvious vision. Robots swarm around writing notes, reading notes, answering your questions about your notes. An army of scribes writing into a giant filing cabinet.</p><p>Importantly &#8211; you&#8217;ll need these scribes to be so good that you <em>don&#8217;t</em> spend a lot of time fixing their filing mistakes. Or, more interestingly, you&#8217;ll need some other entity to make sure they got it right.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Data Expert is reader-supported and pretty smart and fun! Why not subscribe for more sweet sweet blog posts?</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>(<em>&#8221;So far, so good! But weren&#8217;t we talking about secretaries?&#8221; </em>-- yes, shh. We&#8217;re going deeper).</p><blockquote><h4>SIDEBAR: What about OpenClaw?</h4><p>There&#8217;s an obvious reaction to all this: &#8220;hey, isn&#8217;t this just openclaw? OpenClaw/Hermes/IronClaw is already my personal assistant&#8221;. Well, kinda. Not really. The first prototypes of AI secretaries we see will likely be built on a foundation of *claws, but we need AI systems that have these core design principles:</p><ul><li><p>Built for teams (of humans and similar robots) from the ground up.</p></li><li><p>Coordinating across systems of code, people, and other AIs (invisibly, constantly)</p></li><li><p>Will integrate with company systems (and be trusted by IT not to leak or delete everything)</p></li><li><p>Likely multi-server and multi-model.</p></li><li><p>Opinionated on what happens invisibly/continuously and need user approval or visibility.</p></li></ul><p>And more (see below).</p><p>This will require a new and different type of architecture than looping agents on a server and connecting it to Telegram.</p></blockquote><h3>A scribe writes things down. A secretary proactively preps.</h3><p>But wait. A beautifully organized filing system doesn&#8217;t work well unless you can use it; at the same time you don&#8217;t want to access it yourself. Otherwise you&#8217;re back in the drudgery of paperwork.</p><p>And you need to be able to give orders -- often implicitly phrased as &#8220;if X, then Y&#8221;. (Current tooling is pretty bad at this right now)</p><p>You need someone to proactively and reactively hand you the perfect manila folder each time you need it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9YB2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9YB2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!9YB2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!9YB2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!9YB2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9YB2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:244973,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/193514667?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9YB2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!9YB2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!9YB2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!9YB2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8481494a-cd0c-4f76-86e9-bf418352af52_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>And, suddenly, we&#8217;re not talking about a scribe at all. We&#8217;re discussing something  more interesting. The layer on top of that. The secretary. A clerk that talks to the invisible army of scribes on your behalf, sure, but so much more than that. It must continuously manage attention, context, information, and execution across time.</p><p>Let&#8217;s go further. You&#8217;re at <em>work</em>.</p><p>When you go to a meeting with someone else on your team, you want your secretary to know ahead of time, talk to <em>their</em> secretary, compare notes, write briefing docs for the meeting, hand them to you on a just-in-time basis (or maybe send you the pre-read in the morning and the prep doc 10 minutes ahead of the meeting) -- AND understand any decisions you make in the meeting and implement them.</p><p>When you work on an architecture plan, you want your clerk to interrupt you, and tell you that team ABC tried something similar in the company 2 years ago.</p><p>The clerk has:</p><ul><li><p><strong>Simple retrieval: </strong>Pulled all the notes</p></li><li><p><strong>Social graph querying with sophisticated privacy / sensitivity features: </strong>Talked the clerk of the Albert, the lead architect of ABC, and got from them the sensitive learnings that <em>weren&#8217;t</em> in the official document.</p><ul><li><p> (&#8220;Joe is great at code but horrible at explaining things well &#8211; we should not have made him a project lead on this&#8221;, &#8220;direction X was really promising but couldn&#8217;t pursue it due to office politics&#8221;, &#8220;I&#8217;ve subsequently learned about Y pattern that we should have used instead&#8221;)</p></li><li><p>Followed up with the clerks of Joe, Sally, and Martha who were also on that project.</p></li></ul></li><li><p><strong>Negotiation with other clerks: </strong>Booked coffee with Albert, who will be in town next week (but does <em>not</em> show his free time on his calendar)</p></li><li><p><strong>Proactive pushing of info: </strong>Prepped Albert&#8217;s clerk with all <em>your</em> context</p></li><li><p><strong>Context aware synthesis: </strong>Written you a report on what, from project ABC, can you <em>actually</em> learn</p></li></ul><p>That&#8217;s a vision. Maybe a compelling vision. It is to me.</p><div class="pullquote"><p>Suddenly, we&#8217;re not talking about a scribe at all. We&#8217;re discussing something  more interesting. The layer on top of that.  The secretary.</p></div><p>And I think we can build it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ned2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ned2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!Ned2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!Ned2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!Ned2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ned2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:311935,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/193514667?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ned2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!Ned2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!Ned2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!Ned2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6381cad-7e3d-4428-968f-5f7fd1bacf8e_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Can you trust your robot secretary?</h3><p>For all this to work, you need <em>reliability</em>, <em>trust</em>, and even <em>discretion</em>.</p><ul><li><p>You need to know that stuff won&#8217;t get dropped. Dropped is terrible. Forgotten about? Also terrible, but as long as it is filed <em>somewhere</em> you can recover.</p></li><li><p>You need to be able to fire your secretary. In this case, that means that if you stop using your particular AI system, porting to a new system needs to be incredibly easy.  The secretary&#8217;s notes need to play well with the filing/scribe system the company uses. It needs to be flexible enough to take a filing system as it exists and invisibly clean it up to its standard over time.</p></li><li><p>Some robots need to know about the existence of, say, an email or meeting (&#8221;the boss is meeting a lawyer to talk about his divorce&#8221;) but NEVER leak the content of it. And we need to trust that they never double-book, and might even do polite fictions if necessary.</p></li></ul><blockquote><h4>SIDEBAR: Your secretary is many agents in a trenchcoat</h4><p>Note that I&#8217;m not using the word &#8220;agent&#8221; a lot here. That&#8217;s because your clerk, secretary, even your scribe &#8211; they&#8217;re not going to be <em>one</em> agent. Why should they?</p><p>An agent is analogous to a <a href="https://en.wikipedia.org/wiki/Process_(computing)">process</a> on a machine. Processes can fork, or call other processes. They can die without implicating each other. A traditional &#8220;program&#8221; on your laptop has many processes running at once. But a SASS app has so many different computers involved that talking about processes feels like you&#8217;re missing the point.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/p/ai-architecture-is-misunderstood?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.dataexpert.io/p/ai-architecture-is-misunderstood?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p>So too with anthropomorphized robot helpers. Maybe on one server there&#8217;s a dispatcher agent that calls research agents, runs async scripts, negotiates with the company data layer. Maybe it coordinates with a dispatcher on another server that message and info passes with other assistants. A third dispatcher on a third server might be the layer that talks to you (and also runs subagents for odd jobs rather than message passing to the other two).</p><p>I&#8217;m not saying this is a good architecture, by the way. But it&#8217;s entirely feasible. And, to my point: who is the &#8220;agent&#8221; you&#8217;re talking to in this scenario? Even &#8220;<a href="https://ai.plainenglish.io/multi-ai-agent-architectures-and-patterns-a-complete-guide-to-learn-and-build-projects-4f1e9a0367e1">Multi-agent system&#8221;</a> is too low-level. We don&#8217;t quite have the right words the emerging thing we&#8217;re describing. Agent OS? AI persona?</p></blockquote><p>I have some thoughts on how to build it. I&#8217;ve even started. The keys, so far, seem to be:</p><ul><li><p>Start from an assumption that your secretary is working on a team</p></li><li><p>Set up the right dispersion of deterministic and LLM control flow</p></li><li><p>Use a really rich schema for how secretaries communicate in structured ways.</p></li></ul><p>Some primitives that I&#8217;ve found useful:</p><ul><li><p>Dispatcher/subagent structure</p></li><li><p>Code. Gatekeepers, hooks, a message queue, and injection of prompts to deterministically both nudge and <em>enforce</em> behavior.</p></li><li><p>Rich schema for discretion, communication, decisions, and memories.</p></li><li><p>Secure enclaves of local LLMs that monitor the frontier model &#8220;kernel&#8221; and gatekeep its communication for the correct discretion sensitivity.</p></li><li><p>Federated CRMs and knowledge graphs within a team</p></li><li><p>An &#8220;If this, then that&#8221; data structure that points to the right memory when agents match a certain behavior (solving the &#8220;remember to look up a memory telling you to remember something problem).</p></li><li><p>Logs and health check monitors everywhere</p></li></ul><p>But the crux, maybe, is this: the world of OpenClaw clones is running into two bottlenecks: security/privacy/reliability, and the need for an actual killer app. Those are related. Imagine a world where everyone has some sort of robot butler -- but their *claws talk to <em>each other</em> well. Invisibly, carefully.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/p/ai-architecture-is-misunderstood/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.dataexpert.io/p/ai-architecture-is-misunderstood/comments"><span>Leave a comment</span></a></p><blockquote><h4>SIDEBAR: Clerks and Secretaries and CoS</h4><p>Why &#8220;secretary&#8221;? Why not just &#8220;clerk?&#8221; Well, the term clerk doesn&#8217;t quite bring up the level of familiarity and trust that I&#8217;m going for. Does a clerk know about your sensitive meeting even as they don&#8217;t put it on the calendar? Clerks are closer to paper pushers than autonomous agents.</p><p>Another word you could use  was &#8220;chief of staff&#8221;. But that has its own limitations &#8211; mostly the term means very different things. The CoS in a government agency basically runs it. A CoS in tech has extremely different responsibilities than one in nonprofits. And so on.</p><p>So, when using &#8220;secretary&#8221;, I&#8217;m trying to evoke a level of trust, competence, autonomy and brilliance that we see in a Joan or Peggy. Without anyone involved being as poor a boss or bad a person as a Don Draper.</p></blockquote><h3>Against communication explosion</h3><p>That&#8217;s the future I see. We shouldn&#8217;t just throw our robots into Slack<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> and call it a day. You don&#8217;t teach students by throwing them all into a gym and having them run around yelling at each other. You don&#8217;t run a company by having CEOs spying on how each worker is sending notes to each other.</p><p>We will have our AIs morph <em>between</em> <a href="https://www.geoffreylitt.com/2025/07/27/enough-ai-copilots-we-need-ai-huds">copilot and HUD</a>. They&#8217;ll share information in a smart way. They&#8217;ll know they&#8217;re on a team and work accordingly.</p><p>So don&#8217;t obsess over the coming scribes. That&#8217;s tomorrow&#8217;s tech. Build assuming that it exists. Focus on using it in smart ways rather than staying stuck as your own secretary.</p><p><em>Make sure to follow <a href="http://sahar.substack.com">Sahar&#8217;s blog</a> to understand <a href="http://sahar.substack.com">growth and what comes next</a>, both personally and for your business! Sahar is currently <a href="http://sahar.io/build-with-me">pursing full-time product and engineering leadership</a> roles in New York City.</em></p><p><em>DataExpert is now open for more sponsorship. Like this post? Want people to know about how great your product is? Please email us at <a href="mailto:mitali@dataexpert.io">mitali@dataexpert.io</a> to start a conversation.</em></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/p/ai-architecture-is-misunderstood?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading DataExpert! This post is public (for now) &#8212; so please share it, or dive into the comments to tell us what I missed, what you think, or share your favorite recipes.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/p/ai-architecture-is-misunderstood?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.dataexpert.io/p/ai-architecture-is-misunderstood?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/p/ai-architecture-is-misunderstood/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.dataexpert.io/p/ai-architecture-is-misunderstood/comments"><span>Leave a comment</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The interesting part of it is <em>how. </em>Which framing will resonate with your company? Which strategy to integrate to your company  will actually succeed? (I think the jargon for this is &#8220;change management&#8221;. )</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Some of these very shows -- Mad Men and Ted Lasso, for example -- put a lot of narrative weight into underscoring the importance and brilliance of these support staff and what happens when the relationship gets sour.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>For one thing -- slacks are terrible at distinguishing between urgent and important. Don&#8217;t use Slack!</p></div></div>]]></content:encoded></item><item><title><![CDATA[The New Rules of Behavioral Interviews after AI]]></title><description><![CDATA[Our behavior needs to update faster than our tech]]></description><link>https://blog.dataexpert.io/p/how-ai-is-changing-behavioral-interviews</link><guid isPermaLink="false">https://blog.dataexpert.io/p/how-ai-is-changing-behavioral-interviews</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Tue, 28 Apr 2026 17:49:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ApaD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397b73bb-eab6-4c9e-a4a0-2a0c98b8ef7f_3786x1120.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When Meta dropped the Leetcode round in their interview process, people were stunned. Tools like Interview Coder have made the data structures round look like a piece of cake. </p><p>While Leetcode has historically been important for filtering candidates out, it has zero impact on the level candidates are hired at in Big Tech. The behavioral round and leadersh&#8230;</p>
      <p>
          <a href="https://blog.dataexpert.io/p/how-ai-is-changing-behavioral-interviews">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How do One Big Table and AI fit together]]></title><description><![CDATA[Dimensional Modeling is dead]]></description><link>https://blog.dataexpert.io/p/how-to-data-model-for-your-ai-context</link><guid isPermaLink="false">https://blog.dataexpert.io/p/how-to-data-model-for-your-ai-context</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Mon, 06 Apr 2026 20:54:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CqJ_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cf5f8e8-0ad9-410a-9466-705b0b123612_2204x1240.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Data modeling is one of the few skills left for data engineers. <strong>Context engineering is starting to dominate.</strong> <br><br>A proper data model for AI is the difference between:</p><ul><li><p>Consistent correct answers and hallucinations</p></li><li><p>Low-latency answers and waiting forever</p></li><li><p>Maxing out your token limit in a day and being able to use a cheaper model</p><p></p></li></ul><p>In this article, we will go over ho&#8230;</p>
      <p>
          <a href="https://blog.dataexpert.io/p/how-to-data-model-for-your-ai-context">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Understanding Parquet Format for beginners]]></title><description><![CDATA[A walk through of the most important file format to ever exist]]></description><link>https://blog.dataexpert.io/p/parquet-can-shrink-your-data-100x</link><guid isPermaLink="false">https://blog.dataexpert.io/p/parquet-can-shrink-your-data-100x</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Tue, 03 Mar 2026 21:52:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!y8MP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F483d4048-b4d9-467c-90ec-89a1906036b6_2076x1170.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There&#8217;s been a lot of talk about open table formats like Iceberg and Delta over the last few years. While these formats are awesome, many of the underlying efficiency and performance gains can be attributed to Parquet, with Iceberg/Delta serving as a nice management layer on top. <br><br>Think of Iceberg/Delta as the middle manager who gets all the credit, whil&#8230;</p>
      <p>
          <a href="https://blog.dataexpert.io/p/parquet-can-shrink-your-data-100x">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Databricks is no longer about tuning knobs ]]></title><description><![CDATA[Databricks abstracts away almost all of the data engineering skills. Liquid clustering is the first place where things will get messy!]]></description><link>https://blog.dataexpert.io/p/databricks-is-for-data-analysts-not</link><guid isPermaLink="false">https://blog.dataexpert.io/p/databricks-is-for-data-analysts-not</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Tue, 24 Feb 2026 01:03:11 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/032129fc-5b3f-443c-8b8b-0ab902bf8e6a_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>For years, Databricks positioned itself as the true home of serious data engineers. They offer Spark jobs, distributed systems, lakehouse architecture, the works. But that&#8217;s old Databricks. </p><p>But if you zoom out and look at the product direction over the last few years, a much different pattern emerges.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported public&#8230;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>
      <p>
          <a href="https://blog.dataexpert.io/p/databricks-is-for-data-analysts-not">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The 2026 AI Data Engineer Roadmap]]></title><description><![CDATA[And how to avoid getting replaced]]></description><link>https://blog.dataexpert.io/p/the-2026-ai-data-engineer-roadmap</link><guid isPermaLink="false">https://blog.dataexpert.io/p/the-2026-ai-data-engineer-roadmap</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Thu, 05 Feb 2026 20:26:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!n6z2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e613a49-7270-49fc-98fd-7834a05a44a0_1890x2363.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AI has made <strong>manually writing complex data pipelines mostly obsolete</strong>.</p><p>If AI can generate pipelines, DAGs, tests, and even migrations&#8230;<br>What&#8217;s left for data engineers to actually work on?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Conceptual knowledge is no &#8230;</strong></p>
      <p>
          <a href="https://blog.dataexpert.io/p/the-2026-ai-data-engineer-roadmap">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Processing 1 TB with DuckDB in less than 30 seconds]]></title><description><![CDATA[And so can you]]></description><link>https://blog.dataexpert.io/p/i-processed-1-tb-with-duckdb-in-30</link><guid isPermaLink="false">https://blog.dataexpert.io/p/i-processed-1-tb-with-duckdb-in-30</guid><dc:creator><![CDATA[Matt Martin]]></dc:creator><pubDate>Tue, 23 Dec 2025 19:58:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Q-kL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Get ready to toss out all the norms and conventional wisdom about distributed compute! Today, we are eradicating the belief that DuckDB can only be used for &#8220;small&#8221; data. </p><p>In this article, we will attack the following beliefs:</p><ul><li><p>Only Spark can be used for terabytes of data (or it is ALWAYS the best choice)</p></li><li><p>You need a lot of time to process TBs of data</p></li></ul><p>We want to leave your head spinning at the end of this article. Wondering if everything you learned about MapReduce was wrong! </p><h3>This Article is brought to you by</h3><p>We want to give a shout-out to <a href="https://motherduck.com/">MotherDuck</a>, who is sponsoring this article and providing the infrastructure for the benchmarks! </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.motherduck.com?utm_source=dataexpert" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Sq8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png 424w, https://substackcdn.com/image/fetch/$s_!6Sq8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png 848w, https://substackcdn.com/image/fetch/$s_!6Sq8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png 1272w, https://substackcdn.com/image/fetch/$s_!6Sq8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Sq8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png" width="1426" height="486" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:486,&quot;width&quot;:1426,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67935,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://www.motherduck.com?utm_source=dataexpert&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6Sq8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png 424w, https://substackcdn.com/image/fetch/$s_!6Sq8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png 848w, https://substackcdn.com/image/fetch/$s_!6Sq8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png 1272w, https://substackcdn.com/image/fetch/$s_!6Sq8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56993f49-aab0-4570-9c60-4dcd4c4ca210_1426x486.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>You Said to Use DuckDB On Small Data</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m2sU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m2sU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg 424w, https://substackcdn.com/image/fetch/$s_!m2sU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg 848w, https://substackcdn.com/image/fetch/$s_!m2sU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!m2sU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m2sU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg" width="588" height="499" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:499,&quot;width&quot;:588,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!m2sU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg 424w, https://substackcdn.com/image/fetch/$s_!m2sU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg 848w, https://substackcdn.com/image/fetch/$s_!m2sU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!m2sU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f6b100-c763-41cf-b793-6768aa65f471_588x499.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Previously, I was a champion of using DuckDB for any dataset that was &#8220;small&#8221; (&lt; 20GBs). Recently, I was challenged on that remark on LinkedIn by some astute data engineers, who said I had a misconception about what DuckDB was capable of. Being the curious data engineer I am, I took a bite on that bait and decided to roll up my sleeves and benchmark much larger datasets. <br><br><em>But how much larger?</em></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">-DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><em>Also - if you want to subscribe to Matt&#8217;s Substack, you can click <a href="https://performancede.substack.com">here</a>.</em></p><div><hr></div><p>I first decided to go after ~200 GBs<strong>. DuckDB read that data in &lt;10 seconds.</strong></p><p>This was too fast. It felt magical. What about 500 GBs? Then I hit a wall: a physical wall. The hard drive on my Mac M2 didn&#8217;t have enough space for 500GBs. I strolled to my local Best Buy and picked up this thing: </p><blockquote><p><strong>Side Note - </strong>A 4 TB external hard drive might seem like overkill; this was one of those &#8220;Go big or go home&#8221; moments. I figured in my mind &#8220;well if 500gb works, I want to have enough runway for much larger tests down the road&#8221;</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJjS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJjS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg 424w, https://substackcdn.com/image/fetch/$s_!qJjS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg 848w, https://substackcdn.com/image/fetch/$s_!qJjS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!qJjS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJjS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg" width="4284" height="2987" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2987,&quot;width&quot;:4284,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2499347,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36087144-e477-4713-a428-eb0ab054f394_4284x5712.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJjS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg 424w, https://substackcdn.com/image/fetch/$s_!qJjS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg 848w, https://substackcdn.com/image/fetch/$s_!qJjS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!qJjS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655816b-c8fe-4782-9345-e4cf55346944_4284x2987.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I created a 500GB dataset on the external drive in DuckDB. <strong>It read that data in ~40 seconds</strong></p><p>This made me realize I needed to set my sights on the big kahuna. <strong>1 full TB of data!</strong></p><h2>Building A 1 TB Dataset for DuckDB</h2><p>In my previous articles, you will see that I use a script that leverages DuckDB&#8217;s'&nbsp;<strong>generate_series'</strong>&nbsp;function to generate rows of data quickly. The gist of it looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2cfv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2cfv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic 424w, https://substackcdn.com/image/fetch/$s_!2cfv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic 848w, https://substackcdn.com/image/fetch/$s_!2cfv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic 1272w, https://substackcdn.com/image/fetch/$s_!2cfv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2cfv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic" width="1456" height="1149" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1149,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148167,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2cfv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic 424w, https://substackcdn.com/image/fetch/$s_!2cfv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic 848w, https://substackcdn.com/image/fetch/$s_!2cfv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic 1272w, https://substackcdn.com/image/fetch/$s_!2cfv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5f05a6-eeb4-4059-9a05-929e7c44f448_1830x1444.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It&#8217;s straightforward:</p><ul><li><p>You pass in a row count and let it generate a Parquet file. </p></li><li><p>If we want to do this at scale and not wait several hours (or go dreaded serialized). What do you do?</p></li><li><p>Bring in the good ol&#8217; Python ProcessPoolExecutor and go parallel. <a href="https://github.com/mattmartin14/dream_machine/blob/main/substack/articles/2025.11.18-duckdb_1_tb/local_gen_data.py">(code here)</a> </p></li></ul><h2>Ok, But Did You Really Generate A Full TB Of Data?</h2><p>Yes, it took my M2 Pro (16GB of RAM) ~70 minutes to fry this egg with 10 workers in parallel. Here&#8217;s the proof:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iTEW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iTEW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png 424w, https://substackcdn.com/image/fetch/$s_!iTEW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png 848w, https://substackcdn.com/image/fetch/$s_!iTEW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png 1272w, https://substackcdn.com/image/fetch/$s_!iTEW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iTEW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png" width="579" height="126" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:126,&quot;width&quot;:579,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13098,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!iTEW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png 424w, https://substackcdn.com/image/fetch/$s_!iTEW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png 848w, https://substackcdn.com/image/fetch/$s_!iTEW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png 1272w, https://substackcdn.com/image/fetch/$s_!iTEW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ff90a6-2375-44f4-be1f-6305d38421cc_579x126.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The dataset is: <strong>400 files, each  ~2.76GB in size.</strong> </p><p>So now, without further ado, let&#8217;s get cracking and run some benchmarks on this.</p><blockquote><p><strong>Side Note - </strong>If you have not moved from the old Python virtual environment of &#8220;-m venv&#8221; over to UV, do yourself a favor and do it now; uv loads packages faster, makes targeting specific Python environments easier; I could go on&#8230;you&#8217;ll thank me later</p></blockquote><h2>The Benchmark&#8230;And What Exactly Are You Doing Here?</h2><p>Today&#8217;s benchmark will:</p><ul><li><p>run a common aggregation query across the 1TB dataset; </p></li><li><p>It will group by a date, count rows, and sum a value. </p></li></ul><p>This is a common analytics query I have seen in my last two decades as a data engineer and BI leader; I did not cherry-pick this to just make DuckDB look good. This is what the benchmark query boils down to:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dZD_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dZD_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png 424w, https://substackcdn.com/image/fetch/$s_!dZD_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png 848w, https://substackcdn.com/image/fetch/$s_!dZD_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png 1272w, https://substackcdn.com/image/fetch/$s_!dZD_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dZD_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png" width="525" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88fafec0-2882-42de-9829-83cebb4f218e_525x430.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:525,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52209,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dZD_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png 424w, https://substackcdn.com/image/fetch/$s_!dZD_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png 848w, https://substackcdn.com/image/fetch/$s_!dZD_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png 1272w, https://substackcdn.com/image/fetch/$s_!dZD_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88fafec0-2882-42de-9829-83cebb4f218e_525x430.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For our benchmark, we will run it 5 times. Below are the results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3iB6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3iB6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png 424w, https://substackcdn.com/image/fetch/$s_!3iB6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png 848w, https://substackcdn.com/image/fetch/$s_!3iB6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png 1272w, https://substackcdn.com/image/fetch/$s_!3iB6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3iB6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png" width="682" height="389" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:389,&quot;width&quot;:682,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42096,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!3iB6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png 424w, https://substackcdn.com/image/fetch/$s_!3iB6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png 848w, https://substackcdn.com/image/fetch/$s_!3iB6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png 1272w, https://substackcdn.com/image/fetch/$s_!3iB6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7e03376-bf17-4a50-9faa-abc2eb15fd2f_682x389.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Average processing time locally: <strong>1 minute, 29 seconds.</strong> </p><h2>Hold On - You Said We Could Crush A Full TB In Under 30 Seconds!?</h2><p>I did say that. This first benchmark was on my laptop and local computer, which is impressive. </p><p>What would happen if I were to create a full 1TB dataset in MotherDuck and try this again?</p><h2>Time To Join The Flock</h2><p>On MotherDuck, we have excellent options to load data. We could do stuff like:</p><ul><li><p>store CSVs and Parquet in S3/Azure/GCP</p></li><li><p>import the <a href="https://duckdb.org/docs/stable/core_extensions/tpch">TCP-H dataset</a></p></li><li><p>Use our local CLI to generate data</p></li></ul><p>For this article, I chose the third option. <a href="https://github.com/mattmartin14/dream_machine/blob/main/substack/articles/2025.11.18-duckdb_1_tb/md_gen_data.py">Here is the script</a> that created the 1TB dataset in MotherDuck:</p><p>I created a view in a MotherDuck database that leveraged the <strong>generate_series</strong> function (like in the previous local benchmark). After that, I ran the script to iterate over and insert the data multiple times. </p><p>After 10 iterations, I saw it wasn&#8217;t quite at 1TB; I manually ran the load process several times more until I got 1TB. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NyOR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NyOR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png 424w, https://substackcdn.com/image/fetch/$s_!NyOR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png 848w, https://substackcdn.com/image/fetch/$s_!NyOR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png 1272w, https://substackcdn.com/image/fetch/$s_!NyOR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NyOR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png" width="895" height="311" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49609283-1ed8-4a86-870a-355f155d377d_895x311.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:311,&quot;width&quot;:895,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56283,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!NyOR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png 424w, https://substackcdn.com/image/fetch/$s_!NyOR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png 848w, https://substackcdn.com/image/fetch/$s_!NyOR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png 1272w, https://substackcdn.com/image/fetch/$s_!NyOR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49609283-1ed8-4a86-870a-355f155d377d_895x311.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">To check in MotherDuck if you are at a full TB of data, you can run this query on your data&#8217;s metadata</figcaption></figure></div><p>We now have over 1TB of data. Now it&#8217;s time to choose our compute capacity. </p><p>In MD, we have four standard options for compute capacity: Pulse, Standard, Jumbo, and Mega. I went with Mega, given we are dealing with a full TB of data:</p><p>For the benchmark, we ran the exact same query we did locally, but with a caveat;</p><ul><li><p>MotherDuck has intelligent caching; running the same query five times will have results 2-5 be about 5 seconds or less because it will read from cache vs. actually scanning the data.</p></li></ul><p>So how do we get around that? </p><ul><li><p>Simple - for each iteration, we will have our aggregation values sum a different lower and upper range, which makes the query non-deterministic, and removes the ability for it to just hit the cache layer. The query template looks like this:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ET_y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ET_y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic 424w, https://substackcdn.com/image/fetch/$s_!ET_y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic 848w, https://substackcdn.com/image/fetch/$s_!ET_y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic 1272w, https://substackcdn.com/image/fetch/$s_!ET_y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ET_y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic" width="1456" height="576" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:576,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51942,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ET_y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic 424w, https://substackcdn.com/image/fetch/$s_!ET_y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic 848w, https://substackcdn.com/image/fetch/$s_!ET_y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic 1272w, https://substackcdn.com/image/fetch/$s_!ET_y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d0c34a8-7df9-4828-b41f-1eb1e6f0834a_1830x724.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Below are the results of our benchmark:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q-kL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q-kL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png 424w, https://substackcdn.com/image/fetch/$s_!Q-kL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png 848w, https://substackcdn.com/image/fetch/$s_!Q-kL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png 1272w, https://substackcdn.com/image/fetch/$s_!Q-kL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q-kL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png" width="679" height="433" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:433,&quot;width&quot;:679,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53822,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Q-kL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png 424w, https://substackcdn.com/image/fetch/$s_!Q-kL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png 848w, https://substackcdn.com/image/fetch/$s_!Q-kL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png 1272w, https://substackcdn.com/image/fetch/$s_!Q-kL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a8b3d5-f6e0-40f3-9304-8cff1dd83307_679x433.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Holy Smokes</strong> - </p><ul><li><p>That was clocking in at under 17 seconds on average. You might also ask - &#8220;Hey what is that run 0 a.k.a the cold start?&#8221; </p><ul><li><p>Remember, we are now dealing with a cloud data warehouse, which will not always keep our data readily available to rip from RAM; sometimes, it will be on disk and have to get read in; thus, the iteration 0 is a warm up run, incase its having to read off of disk for the first usage; for datasets you will query often in MotherDuck, this won&#8217;t be an issue, as your data will more than likely be ready to process and sit in hot ram.</p></li></ul></li></ul><h2>Ok, Great Stuff! We Are Done, Right?</h2><p>We blew our promise of scanning a full TB in under 30 seconds out of the water, by a factor of 2. But what if we wanted faster?</p><h2>I&#8217;m Ready - Let&#8217;s Go Deeper</h2><p>DuckDB supports <a href="https://duckdb.org/docs/stable/guides/performance/indexing">indexes</a>, but they don&#8217;t really push the concept much. The Zonemap index is a secret weapon that allows you to take advantage of pre-sorted data through min/max tracking of the metadata. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8z9M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8z9M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png 424w, https://substackcdn.com/image/fetch/$s_!8z9M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png 848w, https://substackcdn.com/image/fetch/$s_!8z9M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png 1272w, https://substackcdn.com/image/fetch/$s_!8z9M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8z9M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png" width="1371" height="580" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16315362-24e9-4592-9d25-5629512246b9_1371x580.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:580,&quot;width&quot;:1371,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138717,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8z9M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png 424w, https://substackcdn.com/image/fetch/$s_!8z9M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png 848w, https://substackcdn.com/image/fetch/$s_!8z9M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png 1272w, https://substackcdn.com/image/fetch/$s_!8z9M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16315362-24e9-4592-9d25-5629512246b9_1371x580.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>How would I implement these zone maps? </p><ul><li><p>Let&#8217;s reload our dataset where we sort and insert on the rand date. The load process looked like this:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vSTy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vSTy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic 424w, https://substackcdn.com/image/fetch/$s_!vSTy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic 848w, https://substackcdn.com/image/fetch/$s_!vSTy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic 1272w, https://substackcdn.com/image/fetch/$s_!vSTy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vSTy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic" width="1456" height="827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:827,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:132070,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vSTy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic 424w, https://substackcdn.com/image/fetch/$s_!vSTy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic 848w, https://substackcdn.com/image/fetch/$s_!vSTy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic 1272w, https://substackcdn.com/image/fetch/$s_!vSTy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddc44530-b29a-44c3-bdf0-d982e9db0c48_2480x1408.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once that was complete, I created another benchmark, and here were the results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aRgO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aRgO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png 424w, https://substackcdn.com/image/fetch/$s_!aRgO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png 848w, https://substackcdn.com/image/fetch/$s_!aRgO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png 1272w, https://substackcdn.com/image/fetch/$s_!aRgO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aRgO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png" width="658" height="454" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f68aa33f-476b-42a8-abca-7763098223b3_658x454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:658,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54570,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/181474453?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!aRgO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png 424w, https://substackcdn.com/image/fetch/$s_!aRgO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png 848w, https://substackcdn.com/image/fetch/$s_!aRgO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png 1272w, https://substackcdn.com/image/fetch/$s_!aRgO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68aa33f-476b-42a8-abca-7763098223b3_658x454.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Wow!! - Just Wow!&#8230;one of the iterations clocked in under 10 seconds! Tell me why you still need Spark here&#8230;tell me &#128518;.</figcaption></figure></div><p>The sorted (zonemap) dataset improved our benchmark time by roughly <em><strong>30%</strong></em>. That is an amazing tweak, simply by loading the table in a sorted order by the field we are grouping by.</p><h2>Summary</h2><p>This article showcased a paradigm shift in DuckDB&#8217;s capabilities! We have now shattered the belief of what &#8220;small&#8221; data is and what the duck can do. Even on my local laptop, we were still scanning 1TB of data in &lt;2 minutes. If my batch jobs refreshed my reports in 2 minutes without Spark, I would be very happy!</p><p>Here are all the code examples.</p><ul><li><p><a href="https://github.com/mattmartin14/dream_machine/blob/main/substack/articles/2025.11.18-duckdb_1_tb/local_gen_data.py">Local Data Generator</a></p></li><li><p><a href="https://github.com/mattmartin14/dream_machine/blob/main/substack/articles/2025.11.18-duckdb_1_tb/md_gen_data.py">Motherduck Data Generator (Unsorted)</a></p></li><li><p><a href="https://github.com/mattmartin14/dream_machine/blob/main/substack/articles/2025.11.18-duckdb_1_tb/md_gen_data_sorted.py">Motherduck Data Generator (Sorted)</a></p></li><li><p><a href="https://github.com/mattmartin14/dream_machine/blob/main/substack/articles/2025.11.18-duckdb_1_tb/benchmark_local.py">Local Benchmark</a></p></li><li><p><a href="https://github.com/mattmartin14/dream_machine/blob/main/substack/articles/2025.11.18-duckdb_1_tb/benchmark_md_unsorted.py">Motherduck Unsorted Data Benchmark</a></p></li><li><p><a href="https://github.com/mattmartin14/dream_machine/blob/main/substack/articles/2025.11.18-duckdb_1_tb/benchmark_md_sorted.py">Motherduck Sorted Data Benchmark</a></p></li></ul><p>Thanks again, MotherDuck, for providing us with the environment to showcase this capability!</p><p>Thanks for reading and happy holidays! If you found value in this article, make sure to comment and share with your friends! </p><p>Matt and Zach</p>]]></content:encoded></item><item><title><![CDATA[Data security shouldn't be an afterthought]]></title><description><![CDATA[A practical guide for Data Engineers]]></description><link>https://blog.dataexpert.io/p/how-to-secure-your-data-a-practical</link><guid isPermaLink="false">https://blog.dataexpert.io/p/how-to-secure-your-data-a-practical</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Wed, 26 Nov 2025 16:36:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zhh_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zhh_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zhh_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png 424w, https://substackcdn.com/image/fetch/$s_!zhh_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png 848w, https://substackcdn.com/image/fetch/$s_!zhh_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png 1272w, https://substackcdn.com/image/fetch/$s_!zhh_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zhh_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png" width="1324" height="930" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:930,&quot;width&quot;:1324,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zhh_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png 424w, https://substackcdn.com/image/fetch/$s_!zhh_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png 848w, https://substackcdn.com/image/fetch/$s_!zhh_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png 1272w, https://substackcdn.com/image/fetch/$s_!zhh_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc679e7-7e9e-47ce-9bf9-8dc1fa097372_1324x930.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Data engineers sit at the intersection of value and vulnerability. You build the pipelines that power analytics, AI, and business decisions. But the same systems that enable innovation also become prime targets for attackers.</p><p>You don&#8217;t need to become a full-time security engineer to protect your data. But you <em>do</em> need to think like one &#8212; because no amount&#8230;</p>
      <p>
          <a href="https://blog.dataexpert.io/p/how-to-secure-your-data-a-practical">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[SCD-2 considered harmful! Part 2]]></title><description><![CDATA[Date stamp your data!]]></description><link>https://blog.dataexpert.io/p/stop-using-slowly-changing-dimensions</link><guid isPermaLink="false">https://blog.dataexpert.io/p/stop-using-slowly-changing-dimensions</guid><dc:creator><![CDATA[Sahar Massachi]]></dc:creator><pubDate>Tue, 04 Nov 2025 20:44:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!YLKk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Not everything you learned in college about data warehousing still applies in 2025. <br><br>This is part 2 of <a href="https://www.linkedin.com/in/saharmassachi/">Sahar&#8217;s</a> unlearning data warehousing concepts series. (make sure to read part 1 first if you haven&#8217;t <em><a href="https://blog.dataexpert.io/p/the-data-warehouse-setup-no-one-taught">&#8220;The Data Setup No One Ever Taught You&#8221; series</a>)</em></p><p>Sahar and I learned a lot during out time working in core growth together at Facebook working in friending and notifications.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Let&#8217;s talk about the pain of unlearning and then let&#8217;s get to the magic.</p><p>Imagine you&#8217;re an analyst at a social media company. The retention team asks: &#8220;For users who now have 1000+ followers but had under 200 three months ago &#8211; what device were they primarily using back then? And of the posts they viewed during that growth period, how many were from accounts that were mutuals <em>at the time</em>?&#8221;</p><p>You need to join user data (follower counts then and now), device data (primary device then and now), relationships (who was a mutual then vs now), and post views &#8211; all <em>as of 3 months ago.</em></p><p>With most data warehouse setups, this query is somewhere between &#8220;nightmare&#8221; and &#8220;impossible.&#8221; &nbsp;</p><p>You&#8217;re dealing with state, not actions. State in the past, across multiple tables. There&#8217;s a word for this problem &#8211; slowly changing dimensions. <em>Whole chapters</em> of textbooks deal with various approaches. You could try logs (if you logged the right stuff). You could try slowly changing dimensions with `<code>valid_from</code>/<code>valid_to</code>` dates. You could try separate history tables. All of these approaches are painful, error-prone, and make backfilling a living hell. </p><p>There&#8217;s a better way. Through the magic of &#10024;<strong>datestamps</strong>&#10024; and idempotent pipelines, this query becomes straightforward. And backfills? They become a button you push.</p><p><a href="https://blog.dataexpert.io/p/the-data-warehouse-setup-no-one-taught">Part 1 </a>fixed weird columns, janky tables, and trusting your SQL. Part 3 will cover scaling your team and warehouse. But now &#8211; now we fix: <a href="https://blog.dataexpert.io/p/how-i-made-airbnb-millions-with-this?utm_source=publication-search">backfills</a>, 3am alerts, time complexity, data recovery, and historical queries.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> </p><h2>The old way was a mess</h2><p>Here&#8217;s what most teams do when they start out:</p><h3>Option 1: Overwrite everything daily</h3><p>Your pipeline runs every night, updates <code>dim_users</code> with today&#8217;s snapshot, overwrites yesterday&#8217;s data. Simple! Until six months later when someone asks &#8220;how many followers did users have in March?&#8221; and you realize: that data is gone. You have no history. You can&#8217;t answer the question. Oops.</p><p><em>(Jargon alert &#8211; Apparently this is <a href="https://medium.com/@deepakda1972/understanding-slowly-changing-dimension-scd-type-1-64b5ec571fb0">SCD Type-1 </a>&#175;\_(&#12484;)_/&#175; )</em></p><h3>Option 2: Try to track history manually</h3><p>Okay, you think, let&#8217;s be smarter. Add an <code>updated_at</code> column. Or maybe <code>valid_from</code> and <code>valid_to</code> dates, with an <code>is_current</code> flag. When a user&#8217;s follower count changes, don&#8217;t update their row &#8211; instead, mark the old row as outdated and insert a new one.</p><p><em>(Jargon alert &#8211; This is <a href="https://medium.com/@SaiKarthikaPuttha/understanding-slowly-changing-dimension-scd-type-2-ea1563714bd7">SCD Type-2</a>. Booo)</em></p><p>This is better! You have history. But now:</p><ul><li><p>Your pipelines need custom logic to &#8220;close out&#8221; old rows before inserting new ones</p></li><li><p>If you mess up the <code>valid_to</code> dates, you get gaps or overlaps in history</p></li><li><p>Backfilling becomes a nightmare &#8211; you can&#8217;t just rerun a pipeline, you need to carefully update dates without breaking everything downstream</p></li><li><p><strong>Querying becomes a nightmare</strong>. To get user data &#8220;as of 3 months ago&#8221;, you need:</p></li></ul><p><code>SELECT * FROM dim_users WHERE user_id = 123 AND valid_from &lt;= &#8216;2024-10-01&#8217; AND (valid_to &gt; &#8216;2024-10-01&#8217; OR valid_to IS NULL)</code></p><p>Now imagine joining MULTIPLE historical tables (users, devices, relationships). Every join needs that <code>BETWEEN</code> logic. Miss one and your results are silently wrong. Get the date math slightly off and you&#8217;re joining snapshots from different points in time. Good luck debugging that.</p><h3>Option 3: Separate current and history tables</h3><p>Some teams maintain <code>dim_users</code> (current snapshot) and <code>dim_users_history</code> (everything else). Now you&#8217;ve got two sources of truth to keep in sync. Analysts need to remember which table to query. Any analysis spanning current and historical data requires stitching across tables with <code>UNION ALL.</code> It&#8217;s a mess.</p><p>And, depending on how the <code>dim_users_history</code> table works &#8211; it won&#8217;t solve any of the problems you&#8217;d have in option 2!</p><p><strong>All of these approaches share a problem:</strong> they&#8217;re trying to be clever about storage. They made sense when disk was expensive. They don&#8217;t anymore.</p><p><em>(Jargon alert &#8211; This is SCD Type-4. Note that I didn&#8217;t know this when I started writing this blog post because it&#8217;s <strong>useless</strong>, <strong>boring</strong>, <strong>outdated</strong> jargon. Ignore it.)</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YLKk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YLKk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png 424w, https://substackcdn.com/image/fetch/$s_!YLKk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png 848w, https://substackcdn.com/image/fetch/$s_!YLKk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png 1272w, https://substackcdn.com/image/fetch/$s_!YLKk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YLKk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3187461-2858-405e-9686-698a326165df_2262x1270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:441126,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/177927711?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!YLKk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png 424w, https://substackcdn.com/image/fetch/$s_!YLKk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png 848w, https://substackcdn.com/image/fetch/$s_!YLKk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png 1272w, https://substackcdn.com/image/fetch/$s_!YLKk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3187461-2858-405e-9686-698a326165df_2262x1270.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">There are other SCD types beyond these, you can find an in-depth video on them <a href="https://www.youtube.com/watch?v=emQM9gYh0Io">here</a></figcaption></figure></div><p></p><h2>Sponsorship</h2><p>If you want to learn more about data modeling and data architecture in detail, you can use code <a href="https://www.dataexpert.io/program/snowflake-and-dbt-boot-camp-starting-january-2nd-2630?code=SCDSUCKS">SCDSUCKS</a> <strong>by November 14th</strong> to get 35% off the <a href="http://DataExpert.io">DataExpert.io</a> Snowflake + dbt boot camp</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.dataexpert.io/program/snowflake-and-dbt-boot-camp-starting-january-2nd-2630?code=SCDSUCKS" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Fi1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png 424w, https://substackcdn.com/image/fetch/$s_!1Fi1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png 848w, https://substackcdn.com/image/fetch/$s_!1Fi1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png 1272w, https://substackcdn.com/image/fetch/$s_!1Fi1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Fi1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png" width="1002" height="904" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:904,&quot;width&quot;:1002,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214944,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://www.dataexpert.io/program/snowflake-and-dbt-boot-camp-starting-january-2nd-2630?code=SCDSUCKS&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/177927711?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!1Fi1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png 424w, https://substackcdn.com/image/fetch/$s_!1Fi1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png 848w, https://substackcdn.com/image/fetch/$s_!1Fi1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png 1272w, https://substackcdn.com/image/fetch/$s_!1Fi1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd8d5719-f354-4bdd-bf66-c04ae0d31dbf_1002x904.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>The new way: Just append everything</h2><p>You solve it with date stamps. You solve it with &#8220;functional data engineering&#8221;.</p><p>What you really want is a sort of table that tracks state &#8211; a dimension table &#8211;, but where you can access a version that tracks information about the world <em>today</em>, and another version that tracks information about the world <em>in the past</em>.</p><p>Maxime Beauchemin wrote the seminal <a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a">public work on the idea here</a>. But, honestly, I think the concept can be explained more plainly and directly. So here we are.</p><p>The thinking goes like this:</p><ul><li><p>We&#8217;re getting new data all the time.</p></li><li><p>Let&#8217;s simplify it and say &#8211; we get new data every day. We copy over snapshots from our production database each evening.</p></li><li><p>There are complex, convoluted ways to keep track of what data is new and useful, and what data is a duplicate of yesterday.</p></li><li><p>But wait. Storage is cheap. Compute is cheap. Pipelines can run jobs for us while we sleep.</p></li><li><p>It&#8217;s annoying to have a table with the data we need as of right now, and either some specialized columns or tables to track history..</p></li><li><p>Instead, what if we just kept adding data to existing tables? Add a column for &#8220;date this information was true&#8221; to keep track.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gUA6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gUA6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png 424w, https://substackcdn.com/image/fetch/$s_!gUA6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png 848w, https://substackcdn.com/image/fetch/$s_!gUA6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png 1272w, https://substackcdn.com/image/fetch/$s_!gUA6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gUA6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png" width="1456" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:300517,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/177927711?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gUA6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png 424w, https://substackcdn.com/image/fetch/$s_!gUA6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png 848w, https://substackcdn.com/image/fetch/$s_!gUA6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png 1272w, https://substackcdn.com/image/fetch/$s_!gUA6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42622372-07ec-45ab-bfce-42a5456624d9_2260x1264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here&#8217;s what it looks like in practice. Instead of overwriting your dimension tables every day, you append to them:</p><pre><code>dim_users
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474; user_id &#9474; followers &#9474; ds         &#9474;
&#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
&#9474; 123     &#9474; 150       &#9474; 2024-10-01 &#9474;
&#9474; 123     &#9474; 180       &#9474; 2024-10-02 &#9474;
&#9474; 123     &#9474; ...       &#9474; ...        &#9474;
&#9474; 123     &#9474; 1200      &#9474; 2025-01-16 &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;

dim_devices
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474; user_id &#9474; device  &#9474; ds         &#9474;
&#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
&#9474; 123     &#9474; mobile  &#9474; 2024-10-01 &#9474;
&#9474; 123     &#9474; mobile  &#9474; 2024-10-02 &#9474;
&#9474; 123     &#9474; ...     &#9474; ...        &#9474;
&#9474; 123     &#9474; desktop &#9474; 2025-01-16 &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;

dim_relationships:
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474; user_id &#9474; friend_id &#9474; is_mutual &#9474; ds         &#9474;
&#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
&#9474; 123     &#9474; 789       &#9474; true      &#9474; 2024-10-01 &#9474;
&#9474; 123     &#9474; 789       &#9474; true      &#9474; 2024-10-02 &#9474;
&#9474; ...     &#9474; ...       &#9474; ...       &#9474; ...        &#9474;
&#9474; 123     &#9474; 789       &#9474; false     &#9474; 2025-01-16 &#9474; &#8592; changed
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;

fct_post_views:
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474; post_id &#9474; viewer_id &#9474; poster_id &#9474; ds         &#9474;
&#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
&#9474; 5001    &#9474; 123       &#9474; 789       &#9474; 2024-10-01 &#9474;
&#9474; 5002    &#9474; 123       &#9474; 456       &#9474; 2024-10-01 &#9474;
&#9474; 5003    &#9474; 123       &#9474; 789       &#9474; 2024-10-05 &#9474;
&#9474; ...     &#9474; ...       &#9474; ...       &#9474; ...        &#9474;
&#9474; 9999    &#9474; 123       &#9474; 789       &#9474; 2025-01-15 &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre><p>Now that impossible retention query becomes straightforward. No <code>BETWEEN</code> clauses, no <code>valid_from</code>/<code>valid_to</code> logic &#8211; just filter each table to the date you want:</p><pre><code>-- For fast-growing users, what device did they use back then?

WITH
  today_users as (SELECT user_id, followers as today_followers
      FROM dim_users WHERE ds = &#8216;2025-01-16&#8217; AND followers &gt;= 1000),
  past_users as (SELECT user_id, followers as past_followers
      FROM dim_users WHERE ds = &#8216;2024-10-01&#8217; AND followers &lt; 200),
  past_device as (SELECT user_id, device
      FROM dim_devices WHERE ds = &#8216;2024-10-01&#8217;),
  user_device as (
      SELECT tu.user_id, today_followers, past_followers, pd.device
      FROM past_users pu
      JOIN today_users tu ON pu.user_id = tu.user_id
      JOIN past_device pd ON tu.user_id = pd.user_id),
  views as (
      SELECT post_id, viewer_id, poster_id, ds
      FROM fct_post_views 
      WHERE ds BETWEEN &#8216;2024-10-01&#8217; AND &#8216;2025-01-16&#8217;)
  SELECT
      ud.user_id,
      ud.device as device_during_growth,
      COUNT(DISTINCT views.post_id) as posts_from_mutuals
  FROM user_device ud
  LEFT JOIN views
      ON ud.user_id = views.viewer_id
  LEFT JOIN dim_relationships past_rels
      ON views.viewer_id = past_rels.user_id
      AND views.poster_id = past_rels.friend_id
      AND views.ds = past_rels.ds -- mutual status AS OF view date
      AND past_rels.is_mutual = true
  GROUP BY 1, 2</code></pre><p>Is this query complex? Sure.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> But the complexity is in the <em>business logic</em> (what you&#8217;re trying to measure), not in fighting with valid_from/valid_to dates. Each query just filters to ds = {the date I want}. That&#8217;s it.</p><p>The idea is that you&#8217;re not <em>overwriting</em> existing tables. You are <em>appending</em>. <a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><blockquote><p><strong>Sidebar: Common Table Expressions</strong></p><p>If I had a SECOND &#8220;one weird trick&#8221; for data engineering, CTEs would be it. CTEs are just fucking fantastic. With liberal use of common table expressions (the <code>WITH</code> clause you saw in the retention query above), you can treat subqueries like variables &#8211; and then manipulating data feels more like code. Make sure your query engine (like Presto/Trino) flattens them for free &#8211; but if it does: wowee! SQL just got dirt simple. (a free one hour course on CTEs <a href="https://www.youtube.com/watch?v=vstJyDo88kA">here</a>)</p></blockquote><p>When you grab data into your warehouse<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, append a special column. That column is usually called &#8220;<code>ds</code>&#8221; &#8211; probably short for datestamp. You want something small and unobtrusive. (Notice that &#8220;<code>date</code>&#8221; would be a bad name &#8211; because you&#8217;d confuse people between this (date of ingestion of data) and the more obvious sort of date &#8211; date the action happened.) For snapshots, copy over the entire data of the snapshot, and have your &#8220;ds&#8221; column be &lt;today&#8217;s date&gt;. For logs, you can just grab the logs since yesterday, and set the ds column to &lt;today&#8217;s date&gt;.</p><blockquote><p><strong>Sidebar: Date stamps vs Date partitions<br></strong>I&#8217;ll mostly say &#8220;date stamps&#8221; in this piece &#8211; the concept of marking each row with when that data was valid/ingested.</p><p>&#8220;Date partitions&#8221; is how most warehouse tools *implement* date stamps. A partition is how your warehouse physically organizes data. Think of it like: all rows with <code>ds=2025-01-15</code> get grouped together in one chunk,<code> ds=2025-01-16</code> in another chunk, and so on. (In older systems, each partition was literally a separate folder. Modern cloud warehouses abstract this, but the concept remains.)</p><p>Why does this matter? When you query `<code>WHERE ds=&#8217;2025-01-15</code>`, your warehouse only scans that one partition instead of the entire table. This makes queries faster and cheaper (especially in cloud warehouses where you pay per data scanned).</p><p>People use the terms interchangeably. The important thing is the concept: tables with a date column that lets you query any point in history.</p></blockquote><p>Every table emanating from your input tables should add a filter (<code>WHERE ds={today}</code>), and similarly append the data to the table (<code>WHERE ds={today}</code>). (Except special circumstances where a pipeline might <em>want</em> to look into the past).</p><p>That&#8217;s it! Now your naive setup (overwriting everything every day) has only changed a bit (append everything each day, and keep track of what you appended when) &#8211; but everything has become so much nicer.</p><h2>This is huge</h2><p>This has two major implications:</p><p>First, many types of analysis become much easier. Want to know about the state of the world yesterday? Filter with <code>WHERE ds = {yesterday}</code>. Need data from a month ago? Filter with <code>WHERE ds = {a month ago}.</code> You can even mix and match &#8211; comparing today&#8217;s data with historical data, all within simple queries.</p><p>Second, data engineering becomes both easier and much less error prone. You can rerun jobs, create tables with historical data, and fix bugs in the past. Your pipeline will produce consistent, fast, reliable results consistently</p><h2>What &#8220;functional&#8221; actually means</h2><h4>(Aka &#8220;I don&#8217;t know what idempotent means and at this point I&#8217;m afraid to ask&#8221;)</h4><p>So, in Maxime&#8217;s article (<a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a">link</a>) there&#8217;s all this talk about &#8220;functional data engineering&#8221;. What does that even mean? Let&#8217;s discuss.</p><p>First, we&#8217;re borrowing an idea from traditional programming. &#8220;Functional programs&#8221; (or functions) meet certain conditions:</p><ol><li><p>If you give it the same input, you get the same output. Every time.</p></li><li><p>State doesn&#8217;t change. Your inputs won&#8217;t change, hidden variables won&#8217;t change. It&#8217;s clean. (AKA &#8220;no side effects&#8221;)</p></li></ol><p>Okay, so what does that mean for pipelines? Functional pipelines:</p><ul><li><p>Given the same input, will give the same output</p></li><li><p>Don&#8217;t use (or rely on) magic secret variables</p></li></ul><p>This is what people mean when they say &#8220;<a href="https://www.youtube.com/live/JeeqpK3o3LQ">idempotent</a>&#8221; pipelines or &#8220;reproducible&#8221; data.</p><p>And here&#8217;s how to implement it: <em>datestamps</em>.</p><ul><li><p>Your rawest/most upstream data should never be deleted &#8211; just keep appending with datestamps</p></li><li><p>Pipelines work the same in backfill mode vs normal daily runs</p></li><li><p>If you find bugs, fix the pipeline and rerun &#8211; the corrected data overwrites the bad data</p></li><li><p>Time travel is built in &#8211; just filter to any ds you need</p></li></ul><p><strong>Datestamps also give you the nice side-effect of having it be </strong><em><strong>very clear</strong></em><strong> how fresh the data you&#8217;re looking at is</strong>. If the latest datestamp on your table is from a week ago -- it&#8217;s instantly understandable not only what&#8217;s wrong, but also you have hints about why.</p><blockquote><p><strong>Sidebar &#8211; what this looks like in practice:</strong><br>Your SQL will look something like:  <code>WHERE ds=&#8217;{{ ds }}&#8217;</code> (Airflow&#8217;s templating syntax)<code> </code>or <code>WHERE ds=@run_date </code> (parameter binding). </p><p>Your orchestrator injects the date - whether it&#8217;s today&#8217;s scheduled run or a backfill from three months ago. Same SQL, different parameter. That&#8217;s the whole trick.</p></blockquote><h3>Backfilling is now easy, simple, magical</h3><p>Remember that retention query? Now imagine you built that analysis pipeline three months ago, but you just discovered a bug in your <code>dim_relationships</code> table. The <code>is_mutual</code> flag was wrong for two weeks in November. You fixed the bug going forward, but now all your retention metrics from that period are wrong.</p><p><strong>With the old SCD Type-2 approach, you&#8217;re in hell:</strong></p><p>You can&#8217;t just &#8220;rerun November.&#8221; Because each day&#8217;s pipeline depended on the previous day&#8217;s state. Day 15 updated rows from Day 14, which updated rows from Day 13, and so on. To fix November 15th, you&#8217;d need to:</p><ol><li><p>Rerun November 1st (building from October 31st&#8217;s state)</p></li><li><p>Wait for it to finish</p></li><li><p>Rerun November 2nd (building from your new November 1st)</p></li><li><p>Wait for it to finish</p></li><li><p>Rerun November 3rd...</p></li><li><p>...keep going for 30 days, sequentially, one at a time</p></li></ol><p>And this is assuming nothing breaks along the way. If Day 18 fails? Start over. Need to fix December too? Add another 31 sequential runs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zl2z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zl2z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png 424w, https://substackcdn.com/image/fetch/$s_!zl2z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png 848w, https://substackcdn.com/image/fetch/$s_!zl2z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png 1272w, https://substackcdn.com/image/fetch/$s_!zl2z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zl2z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:432980,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/177927711?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zl2z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png 424w, https://substackcdn.com/image/fetch/$s_!zl2z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png 848w, https://substackcdn.com/image/fetch/$s_!zl2z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png 1272w, https://substackcdn.com/image/fetch/$s_!zl2z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d89c1a8-90c7-4b3f-b8f6-f0c833679614_2264x1272.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Now imagine instead of backfilling six days of data, you&#8217;re backfilling 5 years. This goes from being 6 times faster to hundreds and hundreds of times faster (depending on your DAG&#8217;s concurrency limits)</figcaption></figure></div><p>In Airflow terms, this is what <code>depends_on_past=True </code>does to you. Each day is blocked until the previous day completes. <strong>Backfilling becomes painfully slow.</strong> But that&#8217;s by no means the worst part.</p><p><strong>You can&#8217;t just hit &#8220;backfill&#8221; and walk away.</strong> Your normal daily pipeline logic doesn&#8217;t work for backfills. Why? Because SCD Type-2 requires you to:</p><ul><li><p>Close out existing rows (set their <code>valid_to</code> date)</p></li><li><p>Insert new rows (with new <code>valid_from</code> dates)</p></li><li><p>Update <code>is_current</code> flags</p></li><li><p>Handle the case where a row changed <em>multiple times</em> during your backfill period</p></li></ul><p>Your daily pipeline probably has logic like:</p><pre><code>-- Daily SCD Type-2 pipeline (simplified)
-- Step 1: Close out changed rows
UPDATE dim_users
SET valid_to = CURRENT_DATE - 1, is_current = false
WHERE user_id IN (
SELECT user_id FROM users_source_today
WHERE &lt;something changed&gt;
)
AND is_current = true;

-- Step 2: Insert new versions
INSERT INTO dim_users (user_id, followers, valid_from, valid_to, is_current)
SELECT user_id, followers, CURRENT_DATE, NULL, true
FROM users_source_today;</code></pre><p>This works fine when you&#8217;re processing &#8220;today.&#8221; But for a backfill? You need <em>different</em> SQL:</p><ul><li><p>You need to carefully reconstruct valid_from/valid_to for historical dates</p></li><li><p>And handle the fact that a user might have changed <em>multiple</em> times during your backfill window</p></li><li><p>This gets messy fast.</p></li><li><p>You&#8217;re essentially rewriting your pipeline. (WHY?)</p></li></ul><p>So now you&#8217;re not just waiting 30 sequential days - you&#8217;re maintaining <em>two separate codebases</em>: one for daily runs, one for backfills. And every time you change your daily logic, you need to update your backfill logic to match. More code to write, more code to test, more places for bugs to hide. It&#8217;s completely useless and unnecessary.</p><p><em>Sidenote &#8211; even worse, if you&#8217;re outside your retention window (say, the source data from 90 days ago has been deleted), you can&#8217;t backfill at all. You&#8217;d need to completely rebuild the entire table from scratch, from whatever historical snapshots you still have. Which probably means... datestamped snapshots anyway. Womp womp.</em></p><p><strong>With datestamps, backfilling is trivial:</strong></p><p>Your pipeline for any given day just needs:</p><ul><li><p>Input tables filtered to <code>ds=&#8217;2024-11-15&#8217;</code> (or whatever day you&#8217;re processing)</p></li><li><p>Write output to <code>ds=&#8217;2024-11-15&#8217;</code></p></li></ul><p><strong>That&#8217;s it. November 15th doesn&#8217;t need November 14th. It just needs the snapshot from November 15th.</strong></p><p>So to fix your broken November data:</p><pre><code># In Airflow (or whatever orchestrator)
&gt; airflow dags backfill my_retention_pipeline \--start-date 2024-11-01 \--end-date 2024-11-30</code></pre><p>What happens behind the scenes?</p><ul><li><p>All 30 days kick off in <em>parallel</em> (up to your concurrency limits)</p></li><li><p>Each day independently reads from its ds partition</p></li><li><p>Each day independently writes to its ds partition</p></li><li><p>No coordination needed between days</p></li><li><p>The whole month finishes in the time it takes to run one day</p></li></ul><p><strong>The exact same SQL that runs daily also handles backfills</strong> - no special logic, no custom code</p><p>This changes everything:</p><p><strong>No more custom SQL for backfills</strong> - It&#8217;s just a button you push. Your orchestrator handles it. The same pipeline code that runs daily also handles backfills. No special logic needed.</p><p><strong>New tables get history for free</strong> - Created a new <code>dim_users_enriched</code> table today but want to populate it with the last year of data? Just backfill 365 days. Since your input tables have datestamps, the data is sitting there waiting.</p><p><strong>Bugs in old data become fixable</strong> - Fix your pipeline logic, backfill the affected date range, done. The old (wrong) data gets overwritten with the new (correct) data for those specific partitions. Everything downstream can reprocess automatically.</p><p><strong>Upstream changes cascade easily</strong> - Fixed a bug in <code>dim_users</code>? All downstream tables that depend on it can backfill the affected dates in parallel. The whole warehouse stays in sync.</p><p>This is possible because your pipelines are <strong>idempotent</strong>. Run them once, run them a thousand times - given the same input date, you get the same output. No hidden state, no &#8220;current&#8221; vs &#8220;historical&#8221; logic, no manual date math.</p><p><strong>One pattern to avoid:</strong> Tasks that depend on the previous day&#8217;s partition of their <em>own</em> table. If computing today&#8217;s <code>dim_users</code> requires yesterday&#8217;s <code>dim_users</code>, you&#8217;ve created a chain - backfilling 90 days means 90 sequential runs that can&#8217;t be parallelized. This is sometimes <a href="https://github.com/DataExpert-io/cumulative-table-design">necessary for cumulative metrics</a>, but most dimension tables don&#8217;t need it - just recompute from raw sources each day.</p><p>For most datestamped pipelines, <code>depends_on_past</code> should be False. Each day is independent - the only dependency is &#8220;does the upstream data exist for this ds?&#8221;</p><h2>Welcome to the magic of easy DE work</h2><p>We started this article staring at the prospect of <code>valid_from</code>/<code>valid_to</code> logic, sequential backfills that take days, and custom SQL for every backfill and cascading for every bugfix. Yuck. Ew!</p><p>Or maybe &#8211; worse &#8211; with no sense of history at all. No ability to ask &#8220;how did the world look like yesterday&#8221;, much less &#8220;3 months ago&#8221;. I&#8217;ve seen startups and presidential campaigns and 500 million dollar operations operate like this. &#128579;</p><p>Now you know the secret. Now you have the magic. What mature companies have been doing all along: <strong>snapshot your data daily, append it with datestamps, and write idempotent pipelines on top.</strong></p><p>That&#8217;s it. That&#8217;s the whole One Weird Trick. Add a <code>ds</code> column to <em>every</em> table. Filter on it. Write your pipelines to be independent of each other. Have every pipeline be ds-aware. Storage is cheap. Your time is expensive. Getting your data wrong is <em>extra expensive</em>.</p><p>What you get in return:</p><ul><li><p>Backfills that run in parallel and finish in minutes instead of days</p></li><li><p>Backfills that are a button push instead of custom SQL mess.</p></li><li><p>Historical queries that are simple <code>WHERE ds=&#8217;2024-10-01&#8217;</code> filters instead of date-range gymnastics</p></li><li><p>Pipelines that are the same whether you&#8217;re processing today or reprocessing last year</p></li><li><p>A built-in time machine for your entire warehouse</p></li><li><p>Bugs that are fixable instead of permanent scars on your data</p></li></ul><p>This is functional data engineering. Functional as in idempotent. And functional as in &#8220;it works&#8221;.</p><p>Your backfills are easy now. Your 3am alerts will be rarer. Time complexity is solved. Data recovery is trivial. Your job just became <em>so much easier.</em></p><p>But we&#8217;re not done yet. Part 3 will tackle: how to scale your team and your warehouse. Parts 4 and 5 are gonna get me back on my <a href="http://integrityinstitute.org">&#8220;he who controls metrics controls the galaxy&#8221; soapbox</a>.</p><p>For now, go add some datestamps. Your future self will thank you.</p><p><em>Hey, it&#8217;s Zach again. Sahar is currently open to work in NYC. Make sure to <a href="http://sahar.substack.com">follow Sahar&#8217;s blog</a> to understand <a href="http://sahar.substack.com">growth and what comes next</a>, both personally and for your business! And more at <a href="http://sahar.io">sahar.io</a></em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Except naming. That&#8217;s on you. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>But actually much simpler due to my favorite SQL tool &#8211; Common Table Expressions!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Technically you&#8217;re appending if today&#8217;s ds is empty and <em>replacing</em> if there is data in today&#8217;s ds</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Ideally daily. You might do logs hourly, but let&#8217;s ignore that for simplicity</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[The Data Warehouse Setup No One Taught You]]></title><description><![CDATA[Storage is cheap, your time is not!]]></description><link>https://blog.dataexpert.io/p/the-data-warehouse-setup-no-one-taught</link><guid isPermaLink="false">https://blog.dataexpert.io/p/the-data-warehouse-setup-no-one-taught</guid><dc:creator><![CDATA[Sahar Massachi]]></dc:creator><pubDate>Fri, 24 Oct 2025 21:03:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!n-30!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Running and using a data warehouse can suck. There are pitfalls. It doesn&#8217;t have to be so hard. In fact, it can be so ridiculously easy that you&#8217;d be surprised people are paying you so much to do your data engineering job. <a href="https://www.linkedin.com/in/saharmassachi/">My name is Sahar</a>. I&#8217;m an old coworker of Zach&#8217;s from Facebook. This is our story. <em>(<a href="https://blog.dataexpert.io/p/stop-using-slowly-changing-dimensions">Part two is here</a>)</em></p><p><strong>Data engineering can actually be easy, fast, and resilient! All you have to embrace is a simple concept:</strong> <strong>Date-stamping all your data.</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Why isn&#8217;t this the norm? Because &#8211; even in 2025 &#8212; , institutions haven&#8217;t really understood the implications that <strong>STORAGE IS CHEAP! </strong>(And your data team&#8217;s time is <em>expensive</em>).</p><p>Datestamping solves <em>so many problems</em>. But you won&#8217;t find it in a standard textbook. They&#8217;ll teach you &#8220;<a href="https://en.wikipedia.org/wiki/Slowly_changing_dimension">slowly changing dimensions Type 2</a>&#8221; when the real answer is simpler and more powerful. You <em>will</em> find the answer in <a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a">Maxime Beauchemin&#8217;s seminal article</a> on functional data engineering. Here&#8217;s the thing &#8211; I love <a href="https://www.linkedin.com/in/maximebeauchemin/">Max</a>, but that article is not helpful to the majority of people who could learn from it.</p><p>What if I told you:</p><ul><li><p>We can have resilient pipelines.</p></li><li><p>We can master changes to data over time.</p></li><li><p>We can use <strong>One Weird Trick</strong> to marry the benefits of order and structure with the benefits of chaos and exploration.</p></li></ul><p>That&#8217;s where this article comes in. It&#8217;s been 7 years in the making &#8211; all the stuff that you should know, but no one bothered to tell you yet. (At least, in plain english &#8211; sorry Max!)</p><ul><li><p><strong>Part One: How to set up a simple warehouse </strong>(and which small bits of jargon actually matter)</p></li><li><p><strong>Part Two:</strong> <strong>Date-stamping</strong>. Understand this and everyone&#8217;s life will become easier, happier, and 90% more bug-free.</p></li><li><p><strong>Part Three: Plugging metrics into AB testing. </strong>Warehousing enables experimentation. Experimentation enables business velocity. </p></li><li><p><strong>Part Four: The limits of metrics and KPIs. </strong>It can be so captivating to chase short-term metrics to long-term doom.  </p></li></ul><p>I&#8217;ll show you a practical intro to scalable analytics warehousing, where date stamps are the organizing principle, not an afterthought. In plain language, not tied to any specific tool, and useful to you today Meta used this architecture even back in the early 2010s. It worked with Hive metastore. It still works with Iceberg, Delta, and Hudi.<br><br>But first, to understand why all this matters, you need some context about how warehouses work. Then I&#8217;ll show you the magic.</p><h2><strong>Sponsorship</strong></h2><p><strong>Cut Code Review Time &amp; Bugs in Half</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="http://coderabbit.link/zach" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UdFM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png 424w, https://substackcdn.com/image/fetch/$s_!UdFM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png 848w, https://substackcdn.com/image/fetch/$s_!UdFM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png 1272w, https://substackcdn.com/image/fetch/$s_!UdFM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UdFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;http://coderabbit.link/zach&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UdFM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png 424w, https://substackcdn.com/image/fetch/$s_!UdFM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png 848w, https://substackcdn.com/image/fetch/$s_!UdFM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png 1272w, https://substackcdn.com/image/fetch/$s_!UdFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5e32d3b-403f-443c-866f-f66e80e018fd_1600x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.</p><p>Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.</p><p><a href="http://coderabbit.link/zach">CodeRabbit</a> has so far reviewed more than 10 million PRs, installed on 1 million repositories, and used by 70 thousand Open-source projects. CodeRabbit is free for all open-source repo&#8217;s.</p><h1><strong>Part one &#8212; A Simple Explanation of Modern Data Warehousing</strong></h1><p><strong>Our goals and our context</strong></p><p>We are here to build a system that gets all company data, tidily, in one place. That allows us to make dashboards that executives and managers look at, charts and tools that analysts and product managers can use to do deep dives, alerts on anomalies, and a breadth of linked data that allows data scientists and researchers to look for magic or product insights. The basic building blocks are tables, and the pipelines that create and maintain them.</p><blockquote><p><strong>Sidebar: DB vs Data lake? OLTP vs OLAP? Production vs warehouse? Here&#8217;s what you need to know.</strong></p><p>A basic point about a data warehouse (or lake, or pond, or whatever trendy buzzword people use today) is that it is <em>not</em> production. It must be a separate system from &#8220;the databases we use to power the product&#8221;.</p><p>Both are &#8220;databases&#8221;, both have &#8220;data&#8221;, including &#8220;tables&#8221; that might be similar or mirrored &#8211; but the similarity should end there.</p><ul><li><p>Your <em>production</em> database is meant to be fast, serve your product and users. It is optimized for code to read and write.</p></li><li><p>Your <em>warehouse</em> is meant to be human-usable, and serve people <em>inside</em> the business. It is optimized for breadth, for use by human analysts, and to have historical records.</p></li></ul><p>Put it this way &#8211; your ecommerce webapp needs to look up an item&#8217;s price and return it as fast as possible. Your warehouse needs to look up an item from a year ago, and look at how the price changed over the course of months. The database powering the webapp won&#8217;t even store the information, much less make it easy to compute. Meanwhile if you run a particularly difficult query, you don&#8217;t want your webapp to slow down.</p><p><strong>So &#8211; split them.</strong> (You might hear people talking about OLTP vs OLAP &#8211; it&#8217;s just this distinction. Ignore the confusing terminology. <a href="https://blog.dataexpert.io/p/how-to-data-model-correctly-kimball">Here&#8217;s a deep dive into the two types of OLAP data model (Kimball and One Big Table) </a>)</p></blockquote><p>So, we want a warehouse. Ideally, it should:</p><ul><li><p>Be separate from our production databases</p></li><li><p>Collect all data that is useful to the company</p></li><li><p>Have tables that make queries easy</p></li><li><p>Be correct &#8211; with accurate, trusted, information</p></li><li><p>Be reasonably up to date &#8211; with perhaps a daily lag, rather than a weekly or monthly one</p></li><li><p>Power charts and interactive tools, while also being useful for automatic and local queries</p></li></ul><p><strong>This used to be difficult! (It is not anymore!) </strong>There was a tradeoff between &#8220;big enough to have all the data we need&#8221; and &#8220;give answers fast enough to be useful&#8221;. A lot of hard work was put into reconciling those two needs.</p><p>Since circa 2015 or so, this pretty much no longer a problem. Presto/Trino, Spark, and hosted databases (BigQuery, Snowflake, the AWS offerings) and other tools allow you to have arbitrarily huge data, accessed quickly. We live in a golden age.</p><blockquote><p><strong>Sidebar: At my old school&#8230;</strong><br>At Meta, they used HDFS and Hive to power their data lake and MySQL to power production. Once a day they took a &#8220;snapshot&#8221; of production with a corresponding date stamp and moved the data from MySQL to Hive.</p></blockquote><p>In a world where storage is cheap, access to data can be measured in seconds rather than minutes or hours, and data is overflowing, the bottleneck is engineering time and conceptual complexity. Solving <em>that</em> bottleneck allows us to break with annoyingly fiddly past best practices. That&#8217;s what I&#8217;m here to talk about.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oxWz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oxWz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png 424w, https://substackcdn.com/image/fetch/$s_!oxWz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png 848w, https://substackcdn.com/image/fetch/$s_!oxWz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png 1272w, https://substackcdn.com/image/fetch/$s_!oxWz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oxWz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:481244,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/176954181?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oxWz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png 424w, https://substackcdn.com/image/fetch/$s_!oxWz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png 848w, https://substackcdn.com/image/fetch/$s_!oxWz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png 1272w, https://substackcdn.com/image/fetch/$s_!oxWz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad662b0-d020-4912-8943-5b697d4bdb6a_3122x1748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>A basic setup</h3><p>Imagine your warehouse as a giant box, holding many, many tables. Think of data flowing downhill through it.</p><ul><li><p>At the top: raw copies from production databases, marketing APIs, payment processors, whatever.</p></li><li><p>At the bottom: clean, trusted tables that analysts actually query.</p></li><li><p>In between: pipelines that flow data from table to table.</p></li></ul><pre><code>[Raw Input Tables]
&#9500;&#9472; users_production
&#9500;&#9472; events_raw
&#9500;&#9472; transactions_raw [Pipelines]
&#9492;&#9472; ... &#8595;

     Clean &#8594; Join &#8594; Enrich

                 &#8595;

[Clean Output Tables]
&#9500;&#9472; dim_users
&#9500;&#9472; fct_events
&#9492;&#9472; grp_daily_revenue</code></pre><p>How do we get from raw input to clean tables? <strong>Pipelines.</strong> (See buzzwords like ETL, ELT? Ignore the froth &#8211; replace with &#8220;pipelines&#8221; and move on).</p><p>Pipelines are the #1 tool of data engineering. At their most basic form, they&#8217;re pieces of code that take in one or more input tables, do something to the data, and output a different table.</p><p><strong>What language do you write pipelines in? </strong>Like it or not, the lingua franca of <em>editing</em> large-scale data is SQL. The lingua franca of <em>accessing</em> large scale data is SQL. SQL is a constrained enough language that it can parallelize easily. The tools that invisibly translate your simple snippets into complex mechanisms to grab data from different machines, transform it, join it, etc &#8211; they not only are literally set up with SQL in mind, they figuratively <em>cannot</em> do the same for python, java, etc. Why? Because a traditional programming language gives you too much flexibility -- there&#8217;s no guarantee that your imperative code <em>can</em> be parallelized nicely.</p><blockquote><p><strong>Sidebar: When non-SQL makes sense (or doesn&#8217;t)</strong></p><p>If you&#8217;re ingesting data from the outside world (calling APIs, reading streams, and so on), then python, javascript, etc could make sense. But once data is in the warehouse, beware anything that isn&#8217;t SQL &#8211; it&#8217;s likely unnecessary, and almost certainly going to be much slower than everything else.</p><p>Your tooling might offer a way to &#8220;backdoor&#8221; a bit of code (e.g. &#8220;write some java code that calls an API and then writes the resultant variable to a column&#8221;). Think twice before you use it. Often, it&#8217;s easier and faster to import a new dataset into your warehouse so that you can recreate with SQL joins what you would have done using an imperative language.</p><p>You may be tempted to transform or analyze data in R, pandas, or whatnot &#8211; that&#8217;s fine, but you do that by interactively <em>reading</em> from the warehouse. Rule of thumb: if you&#8217;re writing <em>between</em> tables in a warehouse &#8211; SQL. <em>Into</em> a warehouse &#8211; you probably need some glue code somewhere. <em>Out</em> of a warehouse &#8211; that&#8217;s on you.</p></blockquote><p><strong>So here&#8217;s the simple setup:</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n-30!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n-30!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png 424w, https://substackcdn.com/image/fetch/$s_!n-30!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png 848w, https://substackcdn.com/image/fetch/$s_!n-30!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png 1272w, https://substackcdn.com/image/fetch/$s_!n-30!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n-30!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:655121,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/176954181?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n-30!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png 424w, https://substackcdn.com/image/fetch/$s_!n-30!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png 848w, https://substackcdn.com/image/fetch/$s_!n-30!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png 1272w, https://substackcdn.com/image/fetch/$s_!n-30!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1d426c-755f-4c27-8c31-210f408b7568_2782x1558.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Each day, copy data into your warehouse. Copy in data from your production database, your marketing platform, your sales data, whatever. Don&#8217;t bother cleaning it as you pipe it over (ELT pattern NOT ETL!). Just do a straight copy, using whatever tools make sense</figcaption></figure></div><p>Then, set up a system of pipelines to this, every day, as soon as the upstream data is ready:</p><ul><li><p>As each of these input tables gets the latest dump of data from outside: take that latest day&#8217;s data, deduplicate, clean it up a bit, rename the columns, and cascade it to a nicer, cleaner version of that table. <strong>(this is your <a href="https://www.databricks.com/glossary/medallion-architecture">silver tier data in medallion architecture</a>)</strong></p></li><li><p>Then, from that <em>nicer</em> input table, perform a host of transformations, joins, etc to write to other downstream tables.<strong> (this is your master data)</strong></p></li><li><p>Master data is highly trusted which makes building metrics and powering dashboards easy!<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p></li></ul><p>Every day, new data comes in, and your pipeline setup cascades new information in a host of tables downstream of it. That&#8217;s the setup.</p><h3>A well-ordered table structure</h3><p>Okay, so to review: the basic useful item in a warehouse is a <em>table</em>. Tables are created (and filled up by) <em>pipelines.</em></p><p>&#8220;Great, great,&#8221; you might say &#8211; &#8220;but which tables do I build?&#8221;</p><p><em>For the sake of example, let&#8217;s imagine our product is a social network. But this typology should work just as well for whichever business you are in &#8211; from b2b saas to ecommerce to astrophysics.</em></p><p>From the perspective of the data warehouse as a <em>product</em>, there are only three kinds of tables: input tables (copied from outside), staging tables (used by pipelines and machines), and output tables &#8211; also known as user-facing tables.</p><p>Output tables (in fact, almost all tables) really only have three types:</p><ul><li><p>Tables where each row corresponds to a noun. (E.g. &#8220;user&#8221;, or even &#8220;post&#8221; or &#8220;comment&#8221;). When done right, these are called <strong>dimension tables</strong>. Prefix their names with <em>dim_</em></p></li><li><p>Tables where each row corresponds to an action. Think of them as fancier versions of logs. (E.g. &#8220;user X wrote post Y at time Z&#8221;). When done right, these are called <strong>fact tables</strong>. Prefix their names with <em>fct_</em></p></li><li><p>Everything else. Often these will be summary tables. (e.g. &#8220;number of users who made at least 1 post, per country, per day). If you&#8217;re proud of these, prefix them with <em>sum_ </em>or <em>agg_.</em></p></li></ul><blockquote><p><strong>Sidebar: more on naming</strong></p><p>YMMV, but I generally <em>don&#8217;t</em> prefix input tables. Input tables should be an <em>exact copy</em> of the table you&#8217;re importing from outside the warehouse. Changing names breaks that &#8211; and an unprefixed table name is a good sign that the table cannot be trusted.</p><p>Staging and temporary tables are prefixed with <em>stg_</em> or <em>tmp_.</em></p></blockquote><p>Let&#8217;s talk more about dimension and fact tables, since they&#8217;re the core part of any clean warehouse.</p><p><strong>Dimension tables are the clean, user-friendly, mature form of </strong><em><strong>noun tables</strong></em><strong>.</strong></p><ul><li><p>Despite being focused on nouns (say, users), they can also roll up useful <em>verby</em> information (<a href="https://github.com/DataExpert-io/cumulative-table-design">leveraging cumulative table design</a>)</p></li><li><p>For instance, a <em>dim_users</em> table might both include stuff like: user id, date created, datetime last seen, number of friends, name; <em>AND</em> more aggregate &#8220;verby&#8221; information like: total number of posts written, comments made in the last 7 days, number of days active in the last month, number of views yesterday.</p></li><li><p>If a data analyst might consistently want that data &#8211; maybe add it to the table! Your small code tweak will save them hours of waiting a week.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p></li></ul><p><em>(Now, what&#8217;s to stop the table from being unusably wide? Say, with 500+ columns? Well, that&#8217;s mostly an internal culture problem, and somewhat a tooling problem. You could imagine, say, dim_user getting too large, so the more extraneous information is in a dim_user_extras table, to be joined in when necessary. Or using complex data types to reduce the number of columns)</em></p><p><strong>Fact tables are the clean, user-friendly, mature form of logs (or actions or verb tables).</strong></p><ul><li><p>Despite being verb focused, fact tables contains noun information. (Zach chimes in: here&#8217;s a <a href="https://www.youtube.com/watch?v=DQefW9sNmw0">free 4 hour course</a> on everything you need to know about fact tables)</p></li><li><p>Unlike a plain log, which will be terse, they can also be enriched with data that might probably live in a dim table.</p></li><li><p>The essence of a good fact table is providing all the necessary context to do analysis of the event in question.</p></li><li><p>A fact table, fundamentally, helps you understand: &#8220;Thing X happened at time Y. And here&#8217;s a bunch of context Z that you might enjoy&#8221;.</p></li><li><p>So a log containing &#8220;User Z made comment Xa on post Xb at time Y&#8221; could turn into a fct_comment table, with fields like: commenter id, comment id, post id, time, time at commenter timezone, comment text, post text, userid of owner of post, time zone of owner of parent post. Some of these fields are strictly speaking unnecessary &#8211; you could in theory do some joins to grab the post text, or the comment text, or time zone of the owner of the parent post. But they&#8217;re useful to have handy for your users, so why not save them time and grab them anyway.</p></li></ul><p><strong>Q: Wait &#8211; so if dim tables also have </strong><em><strong>verb </strong></em><strong>data, and fact tables also have </strong><em><strong>noun</strong></em><strong> data, what&#8217;s the difference?</strong></p><p><strong>A: </strong>Glad you asked. Here&#8217;s what it boils down to &#8211; is there one row per noun in the table? Dim. One row per &#8220;a thing happened?&#8221; Fact. That&#8217;s it. You&#8217;re welcome.</p><p>Here, as in so much, we are spending <em>space</em> freely. We are duplicating data. We are also doing a macro form of caching &#8211; rather than forcing users to join or group data on the fly, we have pipelines do it ahead of time.</p><p>Compute is cheap, storage is cheap. Staff time is not. We want analysis to be fluid and low latency &#8211; both technically in terms of compute, and in terms of mental overhead.</p><p><strong>Q: Wait! What about data stamps? Where&#8217;s the magic? You promised magic.</strong></p><p><strong>A: </strong>Patience, young grasshopper. Part of enlightenment is the journey. Part of understanding the magic is understanding what it builds on. And &#8211; hey &#8211; would YOU read a huge blog post all at once? Or would you prefer to read it in chunks. Yeah, you with your Tiktok problem and inability to focus. I&#8217;m surprised you even made this far.</p><p><strong>Stay tuned for part two where we:</strong></p><ul><li><p>Show you how to make warehousing dirt easy</p></li><li><p>Behold the glory of date stamping</p><ul><li><p>Through better data quality</p></li><li><p><a href="https://blog.dataexpert.io/p/how-to-avoid-pipeline-backfill-nightmares">How it avoids backfill nightmares</a></p></li><li><p>Bug fixes are easy now</p></li><li><p>You get a time machine for free</p></li></ul></li><li><p>Explore the dream of functional data engineering (what is that weird phrase?)</p></li><li><p>Throw SCD-2 and other outdated &#8220;solutions&#8221; to the dustbin of history</p></li></ul><p>Make sure to follow <a href="https://sahar.substack.com">Sahar&#8217;s blog</a> to understand growth and what comes next, both personally and for your business! <a href="https://www.linkedin.com/in/saharmassachi/">Sahar</a> is currently open to work, he is interested in DevRel and engineering leadership roles in New York City.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>For instance, join the data from your sales and marketing platforms to create a &#8220;customer&#8221; table. Or join various production tables to create a &#8220;user&#8221; table. Could you then combine &#8220;customer&#8221; and &#8220;user&#8221; to create a bigger table? You might add pipeline steps to create easy tables for analysts to use: &#8220;daily revenue grouped by country&#8221;, etc.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Here&#8217;s another key insight: data processing done while everyone is asleep is much better than data querying done while people are on the clock and fighting a deadline</p></div></div>]]></content:encoded></item><item><title><![CDATA[The 2025 AI + Data Engineering Roadmap]]></title><description><![CDATA[Getting a data engineering job is complicated.]]></description><link>https://blog.dataexpert.io/p/the-2025-breaking-into-data-engineering-roadmap</link><guid isPermaLink="false">https://blog.dataexpert.io/p/the-2025-breaking-into-data-engineering-roadmap</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Fri, 17 Oct 2025 22:35:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8YmI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Getting a data engineering job is complicated. After the crowd of people screaming <em>&#8220;LEARN PYTHON AND SQL,&#8221;</em> you&#8217;ll still find yourself lost in a sea of technologies like Spark, Flink, Iceberg, BigQuery, and now even AI-driven platforms.</p><p>Knowing where to start and how to get a handle on this requires some guidance. This newsletter is going to unveil the steps needed to break into data engineering in <strong>2025</strong> and how AI fits into the picture.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>A lot of people still think that after you&#8217;ve done a magical number of Leetcode problems, a job falls into your lap. That&#8217;s almost never the case!</p><p>To get a job in 2025, you&#8217;ll need the following things:</p><ul><li><p>Demonstrable skills with:</p><ul><li><p><strong>SQL and Python</strong></p></li><li><p><strong>Distributed compute</strong> (Snowflake, Spark, BigQuery, DuckDB)</p></li><li><p><strong>Orchestration knowledge</strong> (Airflow, Mage, or Databricks workflows)</p></li><li><p><strong>Data modeling and data quality</strong></p></li><li><p><strong>AI/data integrations</strong> (vector databases, embeddings, RAG pipelines)</p></li></ul></li><li><p>An opportunity to demonstrate those skills via a <strong>portfolio project</strong></p></li><li><p><strong>A personal brand that radiates above the noise</strong> both on LinkedIn and in interviews</p></li></ul><p>Let&#8217;s dig into each of these areas and see how you can fast-track your way to success!</p><div><hr></div><h3>Sponsorship</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.tinybird.co/data-crash-course/?utm_campaign=sponsorships-dataexpert" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ElMM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ElMM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ElMM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ElMM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ElMM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg" width="720" height="361" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:361,&quot;width&quot;:720,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45412,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:&quot;https://www.tinybird.co/data-crash-course/?utm_campaign=sponsorships-dataexpert&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/170786637?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ElMM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ElMM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ElMM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ElMM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b558d1-6988-454c-8219-153c1ecc1343_720x361.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><a href="https://www.tinybird.co/?utm_campaign=sponsorships-dataexpert">Tinybird</a> helps companies like Vercel and Framer transform massive streams of real-time data into instant insights and real-time user experiences. </p><p>They built <a href="https://www.tinybird.co/data-crash-course/?utm_campaign=sponsorships-dataexpert">a free course on real-time data</a> designed for data engineers who want to master the basics and build amazing data products.</p><div><hr></div><h2>Learning SQL</h2><p>Avoiding SQL is the same as avoiding a job in data engineering. This is the most fundamental language you need to know.</p><p><br><br>There are many resources out there to learn it! The ones I recommend are:</p><ul><li><p><a href="https://www.dataexpert.io/questions">DataExpert.io free SQL question practice</a></p></li><li><p><a href="https://datalemur.com/">DataLemur</a></p></li><li><p><a href="https://www.stratascratch.com/">StrataScratch</a></p></li></ul><p>Key things you should know in this bucket are:</p><ul><li><p>The basics</p><ul><li><p><strong>JOIN</strong>s</p><ul><li><p><strong>INNER, LEFT, FULL OUTER</strong></p><ul><li><p>Remember you should almost never use <strong>RIGHT JOIN</strong></p></li></ul></li></ul></li><li><p>Aggregations with <strong>GROUP BY</strong></p><ul><li><p>Know the differences between <strong>COUNT</strong> and <strong>COUNT(DISTINCT)</strong></p><ul><li><p>Remember that COUNT(DISTINCT) is much slower in distributed environments like Spark</p></li></ul></li><li><p>Know how to use aggregation functions with CASE WHEN statements</p><ul><li><p>example: <strong>COUNT(CASE WHEN status = &#8216;expired&#8217; THEN order_id END) </strong>this counts the number of expired orders</p></li></ul></li><li><p>Know about cardinality reduction and bucketing your dimensions</p><ul><li><p>example:</p><ul><li><p><code>SELECT CASE WHEN age &gt; 30 THEN &#8216;old&#8217; ELSE &#8216;young&#8217; END as age_bucket, COUNT(1) FROM users GROUP BY 1</code></p></li></ul></li><li><p>You&#8217;ll see in this query, we take a high cardinality dimension like age and make it a lower cardinality (just two values &#8220;old&#8221; and &#8220;young&#8221;)</p></li></ul></li></ul></li></ul></li><li><p>Common Table Expressions vs Subquery vs View vs Temp Table (<a href="https://www.youtube.com/watch?v=vstJyDo88kA">a great YouTube video here</a>)</p><ul><li><p>The key things here are:</p><ul><li><p>You should very rarely be using subquery (it hurts readability of pipelines)</p></li><li><p>You should use temp table if you need to reuse some logic since Temp Table gets materialized and will improve the performance of your pipeline</p></li><li><p>You should use View when you need to store logic for longer than the duration of the pipeline execution</p></li><li><p>In all other cases, you should use common table expressions to improve readability!</p></li></ul></li></ul></li><li><p>Understand how SQL works in distributed environments</p><ul><li><p>Know what keywords trigger shuffle</p><ul><li><p>JOIN, GROUP BY, ORDER BY</p></li></ul></li><li><p>Know what keywords are extremely scalable (this means they&#8217;re executed entirely on the map-side)</p><ul><li><p>SELECT, FROM, WHERE, LIMIT</p></li></ul></li></ul></li><li><p>Know window functions thoroughly (<a href="https://www.youtube.com/watch?v=dqwhNcZoMOQ&amp;t=3s">great YouTube video here</a>)</p><ul><li><p>The basics</p><ul><li><p><code>RANK() OVER (PARTITION BY &lt;partition&gt; ORDER BY &lt;order by&gt; ROWS BETWEEN &lt;preceding&gt; PRECEDING AND &lt;following rows&gt;)</code></p></li><li><p>You have the function (e.g. RANK, SUM, etc).</p></li><li><p>You have <strong>PARTITION BY</strong> this divides the window up. Maybe you want to do window functions per department or country?</p></li><li><p>You have <strong>ORDER BY</strong> this determines the sorting of the window</p></li><li><p>You have the <strong>ROWS BETWEEN </strong>clause to determine how many rows you should include in your window. If you don&#8217;t specify this, it defaults to <strong>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. </strong>So the default is the &#8220;cumulative&#8221; row definition.</p></li></ul></li><li><p>Understand <strong>RANK</strong> vs <strong>DENSE_RANK</strong> vs <strong>ROW_NUMBER</strong><a href="https://www.youtube.com/watch?v=-MFcNlHMLDY">(a quick two minute YouTube video about this)</a></p><ul><li><p>Key things:</p><ul><li><p>When there is no tie in your ORDER BY clause, these functions are identical</p></li><li><p>When there are ties,</p><ul><li><p>RANK skips values (e.g. a tie for 1st place means the next place is third)</p></li><li><p>DENSE_RANK does not skip values (e.g. a tie for 1st place means the next place is second)</p></li><li><p>ROW_NUMBER guarantees unique values with no ties (a tie for first place means one of them will get second place, this is based on the natural ordering of the data)</p></li></ul></li></ul></li><li><p>Understand how to do &#8220;rolling&#8221; calculations</p><ul><li><p>Rolling average and sum by department is a common interview question. You can solve it with a query like this:</p><ul><li><p><code>SELECT revenue_date, SUM(revenue) OVER (PARTITION BY department ORDER BY revenue_date ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as thirty_day_rolling_revenue FROM daily_sales</code></p></li><li><p>You&#8217;ll see we split the window by department, then we look at the rolling 30 day period for each day and sum it up. You need to be careful here and ensure that there is data (even if it&#8217;s zero) for each date otherwise you&#8217;ll calculations will be wrong!</p></li></ul></li></ul></li></ul></li><li><p>Know about the differences between <strong>INSERT INTO, INSERT OVERWRITE</strong> and <strong>MERGE</strong></p><ul><li><p>INSERT INTO just copies the data from the result query into the table. <strong>THIS IS PRONE TO DUPLICATES! </strong>If you&#8217;re using <strong>INSERT INTO</strong>, it should always be coupled with either <strong>TRUNCATE</strong> or a <strong>DELETE</strong> statement!</p></li><li><p><strong>INSERT OVERWRITE</strong> is nice because it copies the data and replaces whatever existing data is in that partition. This is the most common one they use in big tech!</p></li><li><p><strong>MERGE</strong> is nice because it looks at the existing data and copies only the rows that are updated and/or deletes the rows that aren&#8217;t in the incoming data. The only minus of <strong>MERGE</strong> is the comparisons it needs to do to accomplish this can be very slow for large data sets!</p></li></ul></li></ul></li></ul><p><strong>What&#8217;s new in 2025?</strong></p><ul><li><p>AI copilots like <strong>Databricks Genie</strong> and <strong>Snowflake Cortex</strong> will happily write SQL for you. But interviewers now test whether you can <strong>verify, debug, and optimize AI-generated SQL.</strong></p></li><li><p>Understanding <em>how SQL runs in distributed environments</em> (shuffle, partitioning, scalability) is still your edge over someone blindly trusting a copilot.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8YmI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8YmI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png 424w, https://substackcdn.com/image/fetch/$s_!8YmI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png 848w, https://substackcdn.com/image/fetch/$s_!8YmI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png 1272w, https://substackcdn.com/image/fetch/$s_!8YmI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8YmI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png" width="1456" height="1820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2785027,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/170786637?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8YmI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png 424w, https://substackcdn.com/image/fetch/$s_!8YmI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png 848w, https://substackcdn.com/image/fetch/$s_!8YmI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png 1272w, https://substackcdn.com/image/fetch/$s_!8YmI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bed7b1c-af42-4f10-bc88-f25ceffb80b4_2160x2700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Learning Python (and AI Integration)</h2><p>SQL is great and can accomplish a lot in data engineering. There are limitations to it though. To overcome these limitations you need a more complete language like Python.</p><p><strong>Here are the concepts you need to learn:</strong></p><ul><li><p>Data types and data structures</p><ul><li><p>Basics</p><ul><li><p>strings, integers, decimals, booleans</p></li></ul></li><li><p>complex</p><ul><li><p>lists (or arrays), dictionaries, stacks and queues</p></li></ul></li><li><p>Other data structures you probably don&#8217;t need to learn <strong>(although these might show up in interviews which sucks!)</strong></p><ul><li><p>heaps, trees, graphs, self-balancing trees, tries</p></li></ul></li></ul></li><li><p>Algorithms</p><ul><li><p>Basics</p><ul><li><p>Loops, linear search, binary search</p></li><li><p>Big O Notation</p><ul><li><p>You should be able to <a href="https://en.wikipedia.org/wiki/Big_O_notation">write down both the space and time complexity of algorithms</a></p></li></ul></li></ul></li><li><p>Algorithms you probably don&#8217;t need to know <strong>(although these might show up in interviews which sucks!)</strong></p><ul><li><p>Dijkstra's algorithm, dynamic programming, greedy algorithms</p></li></ul></li></ul></li><li><p>Using Python as an orchestrator</p><ul><li><p>One of the most common use cases for Python in data engineering is to construct Airflow DAGs</p></li></ul></li><li><p>Using Python to interact with REST APIs</p><ul><li><p>A common data source for data engineers is REST data. Learn about GET, POST, PUT request in Python. The <a href="https://pypi.org/project/requests/">requests</a> package in Python is great for this!</p></li></ul></li><li><p>Know how to test your code with <strong>pytest</strong> (in Python) or <strong>JUnit</strong> (in Scala)</p><ul><li><p>There&#8217;s a really solid library called <a href="https://medium.com/art-of-data-engineering/writing-pyspark-integration-tests-with-chispa-a89f5023b445">Chispa</a> that works well with pytest for testing your PySpark jobs you should check out!</p></li></ul></li></ul><p><strong>New in 2025:</strong><br>Python is now the glue between <strong>data engineering and AI</strong>. You&#8217;ll need to know:</p><ul><li><p>Calling LLM APIs (OpenAI, Anthropic, open-source models)</p></li><li><p>Generating and storing embeddings</p></li><li><p>Working with vector databases (Pinecone, Milvus, Weaviate, pgvector)</p></li><li><p>Building lightweight RAG (Retrieval-Augmented Generation) pipelines</p></li><li><p>Writing &#8220;AI validators&#8221; &#8212; Python jobs that use an LLM to check data quality or generate documentation</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XA-F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XA-F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png 424w, https://substackcdn.com/image/fetch/$s_!XA-F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png 848w, https://substackcdn.com/image/fetch/$s_!XA-F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png 1272w, https://substackcdn.com/image/fetch/$s_!XA-F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XA-F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png" width="1456" height="1820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2883996,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/170786637?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XA-F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png 424w, https://substackcdn.com/image/fetch/$s_!XA-F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png 848w, https://substackcdn.com/image/fetch/$s_!XA-F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png 1272w, https://substackcdn.com/image/fetch/$s_!XA-F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523448be-c1e3-44bd-9a2a-ed4dda43df89_2160x2700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2><strong>Learning Distributed Compute (either Spark, BigQuery or Snowflake)</strong></h2><p>Back in the late 2000s, Hadoop was born and so was the notion of distributed compute. This means that instead of having one fancy computer process all your data you have a team of computers each process a small chunk!<br><br>This concept unlocks the possibility of computing vast amounts of data in a small amount of time by leveraging teamwork! But this does not come without complexity!<br><br>Here are some things to consider:</p><ul><li><p>Shuffle</p><ul><li><p>If we are using teamwork, we need to guarantee certain data is on a certain machine (like if we are counting how many messages each user has received). The team accomplishes this guarantee by passing all of your data to one machine via shuffle (example in the diagram below). We only HAVE to do this when we do <strong>GROUP BY, JOIN, </strong>or<strong> ORDER BY. (</strong><a href="https://www.youtube.com/watch?v=g23GHqJje40">a quick 2 minute video about how I managed this at petabyte scale at Netflix</a>)</p></li></ul></li></ul><ul><li><p>Shuffling isn&#8217;t a bad thing remember! It actually is really good because it makes distributed compute mostly the same as single node compute! The only time it gets in the way of things is at very large scale!</p><ul><li><p>Things you should consider to reduce shuffling at very large scale</p><ul><li><p>Broadcast <strong>JOIN</strong></p><ul><li><p>If one side of your <strong>JOIN</strong> is small (&lt; 5 GBs), you can &#8220;broadcast&#8221; the entire data set to your executors. This allows you to do the join without shuffle which is much faster</p></li></ul></li><li><p>Bucket <strong>JOIN</strong></p><ul><li><p>If both sides of your <strong>JOIN</strong> are large, you can bucket them first and then do the join, this allows. Remember you&#8217;ll still have to shuffle the data once to bucket it, but if you&#8217;re doing multiple <strong>JOIN</strong> with this data set it will be worth it!</p></li></ul></li><li><p>Partitioning your data set</p><ul><li><p>Sometimes you&#8217;re just trying to <strong>JOIN</strong> too much data because you should <strong>JOIN</strong> one day of data not multiple. Think about how you could do your <strong>JOIN</strong> with less data</p></li></ul></li><li><p><a href="https://github.com/EcZachly/cumulative-table-design">Leverage cumulative table design</a></p><ul><li><p>Sometimes you&#8217;ll be asked to aggregate multiple days of data for things like &#8220;monthly active users.&#8221; Instead of scanning thirty days of data, leverage cumulative table design to dramatically improve your pipeline&#8217;s performance!</p></li></ul></li></ul></li><li><p>Shuffle can have problems too! What if one team member gets a lot more data than the rest? This is called skew and happens rather frequently! There are a few options here:</p><ul><li><p>In Spark 3+, you can enable adaptive execution. This solves the problem very quickly and I love Databricks for adding this feature!</p></li><li><p>In Spark &lt;3, you can <a href="https://medium.com/curious-data-catalog/sparks-salting-a-step-towards-mitigating-skew-problem-5b2e66791620">salt the </a><strong><a href="https://medium.com/curious-data-catalog/sparks-salting-a-step-towards-mitigating-skew-problem-5b2e66791620">JOIN</a></strong><a href="https://medium.com/curious-data-catalog/sparks-salting-a-step-towards-mitigating-skew-problem-5b2e66791620"> or </a><strong><a href="https://medium.com/curious-data-catalog/sparks-salting-a-step-towards-mitigating-skew-problem-5b2e66791620">GROUP BY</a>. </strong>Salting allows you to leverage random numbers so you get a more even distribution of your workload among your team members!</p></li></ul></li></ul></li><li><p>Output data</p><ul><li><p>Most often when you&#8217;re using Spark, you&#8217;re going to be writing Parquet files to S3 or GCP or Azure. Parquet files have some interesting properties that allow for dramatic file size reduction if you use them right. (<a href="https://www.youtube.com/watch?v=hFFP2OYFlTA&amp;t=1229s">an in depth YouTube video about this</a>).</p><ul><li><p>The key property of parquet files that you need to know about is a thing called <a href="https://en.wikipedia.org/wiki/Run-length_encoding#:~:text=Run%2Dlength%20encoding%20(RLE),than%20as%20the%20original%20run.">run length encoding compression</a>. I shrunk <a href="https://www.linkedin.com/feed/update/urn:li:activity:6983870693264281600/">pricing and availability datasets at Airbnb by over 90%</a> by leveraging this technique!</p></li></ul></li><li><p>If you&#8217;re using Snowflake or BigQuery, they use their own proprietary format that I know less about. So find another newsletter for that! Maybe <a href="https://seattledataguy.substack.com/">Seattle Data Guy&#8217;s</a>?</p></li></ul></li></ul><p><strong>What&#8217;s new in 2025?</strong></p><ul><li><p><strong>AI workloads run on your data stack (i.e. <a href="https://www.gable.ai/blog/shift-left-data-manifesto">Shift Left</a>)</strong></p><ul><li><p>Embedding generation at scale (GPU clusters with Spark or Ray)</p></li><li><p>Hybrid query engines that support both structured tables + vector search</p></li><li><p>Streaming inference pipelines with Kafka + Flink</p></li></ul></li></ul><div><hr></div><h2>Data Modeling and Data Quality</h2><p>Data engineering is ultimately about delivering <strong>usable, correct, privacy-compliant data.</strong></p><p><br>Here are the ways you can assess data quality:</p><ul><li><p><strong>Data should be correct</strong></p><ul><li><p>You should check for duplicates, NULLs, proper formatting. Also checking that there is the right number of rows in the data set!</p></li></ul></li><li><p><strong>Data should be usable and efficient</strong></p><ul><li><p>This means it should have proper documentation and good column names. The query patterns should also allow for fast answers to questions! Answers shouldn&#8217;t give Jeff Bezos millions of dollars either!</p></li></ul></li><li><p><strong>Data should be privacy-compliant</strong></p><ul><li><p>An often overlooked part of the puzzle. You shouldn&#8217;t be hurting user privacy to make your analytics better!</p></li></ul></li></ul><p>So how do you achieve this output data set nirvana dream? It has multiple parts!</p><ul><li><p>Correctness should be handled in a few ways</p><ul><li><p>The first pass should be via validation from a data analyst. This a part of the powerful <a href="https://medium.com/airbnb-engineering/data-quality-at-airbnb-e582465f3ef7">MIDAS process</a> at Airbnb that you should check out!</p></li><li><p>After that, you should build in automated data quality checks into your pipeline with something like <a href="https://greatexpectations.io/">Great Expectations</a>. Remember to follow the <a href="https://www.dremio.com/wp-content/uploads/2022/05/Sam-Redai-The-Write-Audit-Publish-Pattern-via-Apache-Iceberg.pdf">write-audit-publish pattern</a> here so you don&#8217;t publish bad data into production that doesn&#8217;t pass data quality checks!</p></li></ul></li><li><p>Usability and efficiency are handled with a few things</p><ul><li><p>Documentation should be a big part of the process. Spec building and stakeholder sign-off should happen BEFORE you start building your pipeline. This will prevent a lot of redoing and undoing of work!</p></li><li><p>Leveraging efficient practices for your data lake:</p><ul><li><p>Manage your Apache Iceberg snapshots for time-traveling and disaster recovery</p></li><li><p>Set good retention policies on your data so you don&#8217;t have a huge cloud bill for data you do not use!</p></li></ul></li><li><p>Data modeling is going to be the other big piece of this puzzle</p><ul><li><p>There are a few camps here:</p><ul><li><p>Relational data modeling</p><ul><li><p>This type prioritizes data deduplication at the cost of more complex queries. <strong>Think of this as prioritizing storage at the cost of compute</strong>.</p></li></ul></li><li><p>Dimensional (or Kimball) data modeling</p><ul><li><p>This denormalizes the data into facts and dimensions which prioritizes larger queries but duplicates data a bit. <strong>Think of this as trying to balance the costs of storage and compute.</strong></p></li></ul></li><li><p>One Big Table</p><ul><li><p>This denormalizes the data even more where the facts and dimensions are in one table. You duplicate data more but you get extremely efficient queries from it. <strong>Think of this as prioritizing compute at the cost of storage.</strong></p></li></ul></li><li><p>I wrote a long-form article detailing the differences and when to pick which one <a href="https://blog.dataengineer.io/p/how-to-data-model-correctly-kimball">here</a>. If you prefer video format you can find it <a href="https://www.youtube.com/watch?v=ltQgbSs99WU">here</a></p></li></ul></li></ul></li></ul></li><li><p>Privacy compliant data sets</p><ul><li><p>Be mindful where you have personally identifiable information in your data sets and don&#8217;t hold onto that longer than you need.</p><ul><li><p>Remember that anything that can bring you back to a user is personally identifiable!</p></li></ul></li><li><p><a href="https://policies.google.com/technologies/anonymization?hl=en-US">Anonymizing the data</a> so you can hold onto it longer is a great strategy that balances user privacy and long-term analytical capabilities</p></li></ul></li></ul><p><strong>New in 2025:</strong></p><ul><li><p><strong>AI-assisted quality checks.</strong> Tools now let you point LLMs at logs or query results to spot anomalies.</p></li><li><p><strong>Semantic modeling.</strong> Using embeddings to cluster or enrich tables with &#8220;semantic meaning.&#8221;</p></li><li><p><strong>Privacy in AI.</strong> You must ensure embeddings don&#8217;t leak personally identifiable information. Vector search can accidentally memorize sensitive data if you&#8217;re careless.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_VNH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_VNH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png 424w, https://substackcdn.com/image/fetch/$s_!_VNH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png 848w, https://substackcdn.com/image/fetch/$s_!_VNH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png 1272w, https://substackcdn.com/image/fetch/$s_!_VNH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_VNH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png" width="1456" height="1820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2964736,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/170786637?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_VNH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png 424w, https://substackcdn.com/image/fetch/$s_!_VNH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png 848w, https://substackcdn.com/image/fetch/$s_!_VNH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png 1272w, https://substackcdn.com/image/fetch/$s_!_VNH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9289af35-e96f-487b-be54-d3c50e3914a8_2160x2700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><div><hr></div><h2>Building a Portfolio Project</h2><p>This is where you separate yourself from the pack.</p><p>You should pick a project that:</p><ul><li><p>You care about deeply. I was obsessed about Magic the Gathering and that was what allowed me to keep working on that project even though I wasn&#8217;t getting paid</p></li><li><p>You can work on it for 3 months at 5-10 hours per week. A portfolio piece shouldn&#8217;t be easy to create. If it was easy to create, everybody would do it and then you wouldn&#8217;t stand out!</p></li><li><p>You build a piece that has a frontend! Another portfolio piece I build was <a href="https://www.halogods.com/">HaloGods.com</a> a website I built that allowed me to <a href="https://www.youtube.com/watch?v=dbgK6cx--IY">reach 20th in the world in Halo 5 back in 2016</a>. Without a frontend, your data pipelines are a little bit harder to show off. This is why learning a skill like Tableau or Power BI can be a really solid way to make your portfolio shine even if those aren&#8217;t skills you end up using on the job!</p></li><li><p>You implement the following things in your portfolio</p><ul><li><p>A comprehensive documentation that details all the inputs and output data sets, the data quality checks, how often the pipeline runs, etc.</p></li><li><p>You have a pipeline running in production. Using something like Databricks Free Edition would be a great place to get started!</p></li><li><p>You leverage hot technologies like Spark, Snowflake, Vector Databases, Iceberg, and/or Delta Lake.</p></li><li><p>If you want to stand out, you build a JavaScript front end like I did with HaloGods and you&#8217;ll find yourself landing jobs in big tech even though you didn&#8217;t go to Stanford, MIT or IIT.</p></li></ul><p></p></li></ul><p>In 2025, you&#8217;ll want something that shows both <strong>data pipelines</strong> and <strong>AI integration. Remember you can use AI to get the boilerplate and you add your own details to get started!</strong></p><p>Ideas:</p><ul><li><p>E-commerce pipeline &#8594; warehouse model &#8594; AI recommendation system</p></li><li><p>Ingesting data from <a href="https://polygon.io/">Polygon.io</a> &#8594; stock market model &#8594; AI stock picker system</p></li><li><p>YouTube transcript ingestion &#8594; Iceberg tables &#8594; semantic search UI with RAG</p></li><li><p>Automated data quality alerts using GPT validators</p></li></ul><p>Remember, <strong>you might have to pay a tiny bit of money to build these projects</strong>. But it will be worth it over the long run!</p><div><hr></div><h2>Building a Personal Brand</h2><p>So you have the skills and the sexy portfolio project. Now all you got to do is not mess up the job interview and you&#8217;ll be golden.<br><br>Here are some things you should do to get there:</p><ul><li><p>Build relationships on LinkedIn</p><ul><li><p>You should be building relationships with</p><ul><li><p>Hiring managers and recruiters</p></li><li><p>Peers</p></li></ul></li><li><p>You should start talking with people <strong>BEFORE</strong> you ask for a referral and start building up friendships and your network. </p><ul><li><p>If you need a referral, it would be better to send out DMs to people and ask them what the job is like. Leading with questions instead of tasks will have a much higher hit rate! </p></li><li><p>Also remember that this DM game is a low percentage play. If you send out 20 DMs, 1 or 2 might respond, especially if you&#8217;re early in your career. Finding creators and employees that are near your same level will help increase that hit rate!</p></li></ul></li><li><p>Create your own content and talk about your learning journey. You&#8217;d be surprised how effective this is at landing you opportunities you didn&#8217;t even think were possible. For example, <a href="https://blog.dataengineer.io/p/how-i-quit-my-600k-data-engineering">I made $600k from LinkedIn in 7 months</a> after quitting my job! Content creation and branding are a very powerful combo that can change your life!</p><ul><li><p>Nowadays with AI, you can use AI aid in the content creation process! Just remember to remove the ugly emojis, em dashes, and </p></li></ul></li></ul></li><li><p>Interview like a person, not a robot</p><ul><li><p>When you go into the interview make sure you have:</p><ul><li><p>Researched the people who are interviewing you. You should know</p><ul><li><p>How long they&#8217;ve worked for the company</p></li><li><p>What they do for the company</p></li></ul></li><li><p>Asked the recruiter a lot of detailed questions about the role</p><ul><li><p>What technologies will I be using</p></li><li><p>How many people will be on my team</p></li><li><p>What is the culture like</p></li></ul></li></ul></li><li><p>During the interview you should radiate</p><ul><li><p>positivity, enthusiasm and excitement for the role</p></li><li><p>competence and calm when asked questions</p></li><li><p>curiosity to engage in stupid interview questions and curiosity about the role and what you&#8217;ll be doing by asking good follow up questions</p></li></ul></li><li><p>Demonstrate technical skills during the interview</p><ul><li><p><a href="https://blog.dataexpert.io/p/how-to-pass-data-engineering-sql">The SQL Interview</a></p></li><li><p><a href="https://blog.dataexpert.io/p/how-to-pass-the-data-modeling-round">The Data Modeling Interview</a></p></li><li><p><a href="https://blog.dataexpert.io/p/how-to-pass-the-data-architecture">The Data Architecture Interview</a></p></li><li><p><a href="https://blog.dataexpert.io/p/the-hard-truth-about-data-engineering">The Data Structures and Algorithms Interview</a></p></li></ul></li></ul></li></ul><div><hr></div><h2>Conclusion</h2><p>Getting into data engineering in 2025 is still hard. The job market is competitive. But the best engineers and the best startups are built in times like this.</p><p>The difference now? <strong>AI is part of the job. </strong>You don&#8217;t need to be an AI researcher, but you <em>do</em> need to know how to integrate AI responsibly into your pipelines.</p><p>Follow this roadmap and you&#8217;ll be much closer to landing the data engineering role of your dreams.<br><br>I am launching a new AI Engineering Boot camp starting on October 20th where we are covering all the new AI topics necessary to be a good AI-enabled data engineer. You can get 30% off with code <strong>AIROADMAP <a href="https://www.dataexpert.io/program/the-ai-engineering-challenge-starting-october-20th-2025?code=AIROADMAP">here</a>.<br><br></strong>What else do you think is critical to learn in data engineering to excel in the field? Anything I missed here that you would add? Please share this with your friends and on LinkedIn if you found it useful!</p><p></p><p></p><p><br></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Stop grinding leetcode for data engineer interviews!]]></title><description><![CDATA[Landing a role in big tech and &#8220;grinding leetcode&#8221; have gone together like peanut butter and jelly for the last ten years.]]></description><link>https://blog.dataexpert.io/p/how-ai-will-change-data-engineer</link><guid isPermaLink="false">https://blog.dataexpert.io/p/how-ai-will-change-data-engineer</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Wed, 24 Sep 2025 22:21:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TI5-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc79edfe-d564-4f3e-bbfb-0bbd38e85366_929x674.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Landing a role in big tech and &#8220;grinding leetcode&#8221; have gone together like peanut butter and jelly for the last ten years. <br>This world is changing rapidly though. <a href="https://www.linkedin.com/in/roy-lee-goat/">Roy Lee</a> created <a href="https://www.interviewcoder.co/?utm_source=dataexpert">InterviewCoder</a> to cheat on these &#8220;leetcode-style&#8221; interviews and landed multiple offers from big tech with it. Instead of fighting the trend, <a href="https://www.wired.com/story/meta-ai-job-interview-coding/">Meta announced they will allow cand&#8230;</a></p>
      <p>
          <a href="https://blog.dataexpert.io/p/how-ai-will-change-data-engineer">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[DuckDB benchmarked against Spark]]></title><description><![CDATA[You Don't Always Need A Sledgehammer]]></description><link>https://blog.dataexpert.io/p/duckdb-can-be-100x-faster-than-spark</link><guid isPermaLink="false">https://blog.dataexpert.io/p/duckdb-can-be-100x-faster-than-spark</guid><dc:creator><![CDATA[Matt Martin]]></dc:creator><pubDate>Mon, 22 Sep 2025 20:13:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!SvIv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SvIv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SvIv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png 424w, https://substackcdn.com/image/fetch/$s_!SvIv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png 848w, https://substackcdn.com/image/fetch/$s_!SvIv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png 1272w, https://substackcdn.com/image/fetch/$s_!SvIv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SvIv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png" width="825" height="521" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:521,&quot;width&quot;:825,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80968,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://performancede.substack.com/i/170260710?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!SvIv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png 424w, https://substackcdn.com/image/fetch/$s_!SvIv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png 848w, https://substackcdn.com/image/fetch/$s_!SvIv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png 1272w, https://substackcdn.com/image/fetch/$s_!SvIv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefa47e2b-097b-4adc-b19f-995143a8e13f_825x521.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Introduction</h2><p><a href="https://en.wikipedia.org/wiki/Apache_Spark">Apache Spark</a> has been the de facto open source data processing for fifteen years. It was invented to solve a major problem that traditional data warehousing was not built to solve - processing massive amounts of data horizontally at scale <em>(<a href="https://www.youtube.com/watch?v=g23GHqJje40">Zach used Spark to process 2000 TBs per day at Netflix</a>)</em>, whether in a structured or semi-structured for&#8230;</p>
      <p>
          <a href="https://blog.dataexpert.io/p/duckdb-can-be-100x-faster-than-spark">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Migrating 13,000 Iceberg Tables in 4 hours to Glue Catalog]]></title><description><![CDATA[At midnight on September 16th, Jason Reid (data engineering advocate at Databricks) messages me on LinkedIn saying, &#8220;Tabular will be sunsetted in 24 hours.]]></description><link>https://blog.dataexpert.io/p/how-i-migrated-13000-iceberg-tables</link><guid isPermaLink="false">https://blog.dataexpert.io/p/how-i-migrated-13000-iceberg-tables</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Wed, 17 Sep 2025 22:12:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_hl4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee83343-fe15-4eb3-8cd7-fc35eaaeea41_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At midnight on September 16th, <a href="https://www.linkedin.com/in/jasonreid/">Jason Reid</a> (data engineering advocate at Databricks) messages me on LinkedIn saying, <strong>&#8220;Tabular will be sunsetted in 24 hours. I hope you have migrated.&#8221;</strong> My lazy ass had not. <br><br>Panic immediately set in. I had 13,000 tables, 2,200 schemas and 3 terabytes of data managed by Tabular that my students had generated over the last tw&#8230;</p>
      <p>
          <a href="https://blog.dataexpert.io/p/how-i-migrated-13000-iceberg-tables">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Three Free Tech Bootcamps That Could Change Your Career]]></title><description><![CDATA[How to become a Solutions Architect, Mastering the Foundations of Cybersecurity and the Absolute Basics in Data Engineering]]></description><link>https://blog.dataexpert.io/p/three-free-tech-bootcamps-that-could</link><guid isPermaLink="false">https://blog.dataexpert.io/p/three-free-tech-bootcamps-that-could</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Wed, 27 Aug 2025 15:02:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bFEL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bFEL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bFEL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png 424w, https://substackcdn.com/image/fetch/$s_!bFEL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png 848w, https://substackcdn.com/image/fetch/$s_!bFEL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png 1272w, https://substackcdn.com/image/fetch/$s_!bFEL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bFEL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png" width="1000" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:591837,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/171808416?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bFEL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png 424w, https://substackcdn.com/image/fetch/$s_!bFEL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png 848w, https://substackcdn.com/image/fetch/$s_!bFEL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png 1272w, https://substackcdn.com/image/fetch/$s_!bFEL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d05f6d9-d5ce-442a-8679-a97a26d5dbea_1000x400.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This month, I&#8217;m excited to present <strong>three fantastic opportunities</strong> to learn from top-tier practitioners without spending a dime. <br>Whether you&#8217;re trying to break into tech or looking to make a strategic pivot in your career, these free bootcamps are packed with real-world lessons, hands-on skills, and instruction from people who&#8217;ve been in the trenches.</p><p>We a&#8230;</p>
      <p>
          <a href="https://blog.dataexpert.io/p/three-free-tech-bootcamps-that-could">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Navigating AI's New Frontier with Chip Huyen]]></title><description><![CDATA[Building, Scaling and Thinking in the Age of AI]]></description><link>https://blog.dataexpert.io/p/navigating-ais-new-frontier-with</link><guid isPermaLink="false">https://blog.dataexpert.io/p/navigating-ais-new-frontier-with</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Mon, 25 Aug 2025 13:03:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kKrp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kKrp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kKrp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png 424w, https://substackcdn.com/image/fetch/$s_!kKrp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png 848w, https://substackcdn.com/image/fetch/$s_!kKrp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png 1272w, https://substackcdn.com/image/fetch/$s_!kKrp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kKrp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1001538,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/170787571?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kKrp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png 424w, https://substackcdn.com/image/fetch/$s_!kKrp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png 848w, https://substackcdn.com/image/fetch/$s_!kKrp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png 1272w, https://substackcdn.com/image/fetch/$s_!kKrp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc0bcec-d230-481d-ac40-8b5711dc2e5a_1500x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are some conversations that confirm what you already believe, and then there are conversations that take what you believe and frame it in a way that makes it actionable, precise and inevitable.</p><p>That&#8217;s what happened when I sat down with <strong>Chip Huyen </strong>during a <em>Tech Talk</em> we held at <a href="https://learn.dataexpert.io/">DataExpert.io</a> AI Engineering Boot Camp.</p><p>Chip is an AI researcher, former big tech engineer, best-selling author and one of the most lucid thinkers on AI systems working today. Recently, she&#8217;s been in the public spotlight with her new book <a href="https://www.oreilly.com/library/view/ai-engineering/9781098166298/">AI Engineering</a>, which has quickly become the most comprehensive, well-structured guide to the essential aspects of building generative AI systems (we covered a lot of its content in the boot camp too).</p><p>This talk was neither a book launch nor a formal Q&amp;A. It was a honest, refreshing &amp; grounded conversation that can be distilled into seven core takeaways, each one capturing ideas Chip shared that stuck with me, challenged me or reshaped how I think about building and leading in AI. This article covers the following:</p><ul><li><p>Chip&#8217;s journey into AI</p></li><li><p>Where most GenAI products go wrong</p></li><li><p>The underrated value of UX</p></li><li><p>How to build functional AI agents</p></li><li><p>What really takes to ship value in the modern AI stack.</p></li><li><p>How to stay informed without burning out</p></li><li><p>Building with clarity and conviction</p></li></ul><p>If you want to learn from other brilliant minds like Chip, we are launching a <strong><a href="https://www.dataexpert.io/">10-week Challenge Boot Camp</a></strong> on <strong>Sep 15th</strong> where will be covering insightful tech talks with 15 industry leaders in the data, analytics and AI engineering space. The first 5 people to register can use code <strong><a href="https://www.dataexpert.io/CHIP">CHIP</a></strong> for 30% off!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>&#9997;&#127995; #1: The Primacy of Compute and Data</h2><p>To understand Chip&#8217;s journey into AI, we have to rewind to 2012, the year Deep Learning truly exploded onto the scene. That was the year <em>AlexNet</em>, a deep convolutional neural network, won the ImageNet competition by a massive margin (over 10 percentage points better than the next best model).</p><p>AlexNet rewrote the rules that have defined the last decade of AI. And notably, one of the paper's co-authors, <a href="https://en.wikipedia.org/wiki/Ilya_Sutskever">Ilya Sutskever</a>, would go on to co-found OpenAI, the organization that would later lead the charge on scaling up large language models.</p><p>Chip recounted how a single sentence from that 2012 paper changed her life trajectory:</p><blockquote><p>&#8220;Our experiments show that we can achieve better results by just waiting for more compute and more data.&#8221;</p></blockquote><p>That line reframed AI not as a field of breakthroughs, but as one of <strong>compounding scale</strong>. Chip went on to work at NVIDIA to understand compute infrastructure and later joined Snorkel AI to understand data workflows.</p><p>She also reflected on how OpenAI was initially dismissed for simply <em>scaling up</em>, with many academics saying it wasn't <em>real research</em>. But the turning point came in 2020 when the GPT-2 paper received a best paper award, and the conversation suddenly shifted.</p><blockquote><p>Everyone was like, &#8216;Wow, now it&#8217;s real research.&#8217;</p></blockquote><p>&#128161; I remember similar skepticism back in my days at Facebook. People dismissed what OpenAI was doing as brute force. It turns out brute force was the insight.</p><p>This key takeaways reframes my view of AI progress not as a parade of novel ideas but as an engineering problem of sufficient scale. Chip's clarity here gives me language to explain why things like GPT-5 didn't just appear but emerged from compute and data discipline.</p><h2>&#9997;&#127995; #2: The GenAI Hype Cycle and the Misuse of ML</h2><p>What makes Chip&#8217;s take on GenAI refreshing is that her relationship with AI long predates the hype. Before ChatGPT became a buzzword, she was building simple algorithms to test logic, even designing games her smartest friends couldn&#8217;t win. Not to outsmart them but to understand how reasoning could be codified.</p>
      <p>
          <a href="https://blog.dataexpert.io/p/navigating-ais-new-frontier-with">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Stopping Silent Failures for Meta's Fake Accounts Pipeline]]></title><description><![CDATA[Data Orchestration Challenges I Faced at Airbnb, Netflix & Facebook &#8211; Part IV]]></description><link>https://blog.dataexpert.io/p/saving-metas-fake-accounts-pipeline</link><guid isPermaLink="false">https://blog.dataexpert.io/p/saving-metas-fake-accounts-pipeline</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Tue, 12 Aug 2025 19:38:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9n91!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of my final projects at Facebook was owning the data pipeline that tracked fake accounts. It may sound simple but, in reality, it was one of the most deceptively complex orchestration problems I&#8217;ve ever encountered and made worst by a hidden upstream design choice that prioritized speed of delivery over data consistency.</p><p>Fake accounts come and go. Some are flagged incorrectly, others are later verified, and many are caught by internal ML systems. The goal of our pipeline was to trace the <strong>inflows and outflows</strong> of fake accounts daily. That meant building a reliable dataset that could track:</p><ul><li><p>Accounts <strong>unlabeled </strong>as fake (i.e. after submitting a valid ID)</p></li><li><p>Accounts <strong>relabeled </strong>as fake</p></li><li><p>Accounts flagged as fake<strong> for the first time</strong></p></li><li><p>Accounts that<strong> remained fake</strong></p></li></ul><p>The pattern was very straightforward: a classic cumulative table design. But the way it was wired to upstream data, specifically how it &#8220;waited&#8221; for inputs, created a non-deterministic nightmare. For weeks, I chased what I thought was a bug in my code, only to discover that the real problem had been there from day one.</p><p>In this article, I&#8217;ll cover the following aspects:</p><ul><li><p>Why relying on &#8220;latest&#8221; partition data broke everything</p></li><li><p>How upstream non-deterministic leads to silent data mismatches</p></li><li><p>The simple fix using explicit partition dates</p></li><li><p>The tradeoff between latency and reproducibility</p></li><li><p>Engineering lessons that go beyond code</p></li></ul><p>If you want to learn more in depth about patterns like this, the <a href="https://www.dataexpert.io/FAKE">DataExpert.io</a> academy subscription has 200+ hours of content about system design, streaming pipelines, etc. The first 5 people can use code <strong>FAKE</strong> for 30% off!</p><p>If you enjoy this article, here are some more from my time in big tech: </p><ul><li><p><a href="https://blog.dataexpert.io/p/how-i-prepared-for-a-staff-data-engineer">How I prepared for Airbnb&#8217;s staff data engineer interview</a></p></li><li><p><a href="https://blog.dataexpert.io/p/how-i-got-a-12x-speed-up-in-a-50">How I achieved a 12x speed up on Facebook notification pipelines</a></p></li><li><p><a href="https://blog.dataexpert.io/p/scaling-netflixs-threat-detection">How I used the &#8220;Psycho&#8221; pattern to detect threats at Netflix</a></p></li><li><p><a href="https://blog.dataexpert.io/p/how-i-made-airbnb-millions-with-this">How I cut Airbnb&#8217;s pricing data backfill time by 95%</a></p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9n91!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9n91!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png 424w, https://substackcdn.com/image/fetch/$s_!9n91!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png 848w, https://substackcdn.com/image/fetch/$s_!9n91!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!9n91!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9n91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:727375,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/169466001?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9n91!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png 424w, https://substackcdn.com/image/fetch/$s_!9n91!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png 848w, https://substackcdn.com/image/fetch/$s_!9n91!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!9n91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60dc46a8-f7ab-48f0-902f-91efb61253a8_2000x1440.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Understanding Fake Account Flows</h2><p>At a high level, the pipeline&#8217;s job was to <strong>compare today&#8217;s and yesterday&#8217;s fake account snapshots</strong>, determine who had entered or exited the fake state, and store those inflow/outflow transitions for downstream analytics.</p><p>This was a classic cumulative table design pattern, built in plain vanilla SQL and tracked four main fake states:</p><ul><li><p><strong>New fakes</strong> &#8594; New people who got labeled as fake</p></li><li><p><strong>Resolved accounts</strong> &#8594; People who were labeled fake earlier than today and passed a challenge to remove the label</p></li><li><p><strong>Persisting fakes</strong> &#8594; People who were labeled fake earlier than today and haven&#8217;t passed a challenge</p></li><li><p><strong>Relabeled fakes</strong> &#8594; People who were labeled fake, passed a challenge, and then continued to do fake activity</p></li></ul><p>Originally, it was set up like this:</p><ol><li><p>Fake accounts pipeline waited on the &#8220;latest&#8221; partition of the users table</p></li><li><p>Then joined it with fake_accounts to compute transitions</p></li><li><p>The job ran daily and published results</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NoFD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NoFD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png 424w, https://substackcdn.com/image/fetch/$s_!NoFD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png 848w, https://substackcdn.com/image/fetch/$s_!NoFD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!NoFD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NoFD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png" width="1456" height="944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:944,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:365266,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/169466001?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NoFD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png 424w, https://substackcdn.com/image/fetch/$s_!NoFD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png 848w, https://substackcdn.com/image/fetch/$s_!NoFD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!NoFD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F991e8831-6cb8-46ee-a6cb-7601cefacb39_2160x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><pre><code>-- Define struct and enum types

CREATE TYPE IF NOT EXISTS daily_detection_stats AS (
    detection_date DATE,
    login_attempts INTEGER,
    friend_requests_sent INTEGER,
    posts_created INTEGER,
    flagged_reports INTEGER
);

CREATE TYPE IF NOT EXISTS fake_classification AS ENUM('new', 'resolved', 'persisting', 'relabeled');

-- Create cumulative table
CREATE TABLE IF NOT EXISTS fake_accounts
(
    account_id TEXT,
    country TEXT,
    sign_up_method TEXT,
    sign_up_date DATE,
    daily_detection_stats daily_detection_stats[],
    fake_classification fake_classification,
    days_since_last_detected INTEGER,
    current_detection_date DATE,
    PRIMARY KEY (account_id, current_detection_date)
);


-- Dynamically pick the "latest" date
WITH latest_date AS (
    SELECT MAX(detection_date) AS detection_date FROM account_daily_signals
),
yesterday AS (
    SELECT * 
    FROM fake_accounts 
    WHERE current_detection_date = (SELECT detection_date - INTERVAL '1 day' FROM latest_date)
),
today AS (
    SELECT * FROM account_daily_signals 
    WHERE detection_date = (SELECT detection_date FROM latest_date)
)

-- Non-idempotent insert
INSERT INTO fake_accounts
SELECT
    COALESCE(t.account_id, y.account_id) AS account_id,
    COALESCE(t.country, y.country) AS country,
    COALESCE(t.sign_up_method, y.sign_up_method) AS sign_up_method,
    COALESCE(t.sign_up_date, y.sign_up_date) AS sign_up_date,
    CASE
        WHEN y.daily_detection_stats IS NULL THEN ARRAY[ROW(
            t.detection_date,
            t.login_attempts,
            t.friend_requests_sent,
            t.posts_created,
            t.flagged_reports
        )::daily_detection_stats]
        WHEN t.detection_date IS NOT NULL THEN y.daily_detection_stats || ARRAY[ROW(
            t.detection_date,
            t.login_attempts,
            t.friend_requests_sent,
            t.posts_created,
            t.flagged_reports
        )::daily_detection_stats]
        ELSE y.daily_detection_stats
    END AS daily_detection_stats,
    CASE
        WHEN y.account_id IS NULL THEN 'new'
        WHEN t.account_id IS NULL THEN 'resolved'
        WHEN t.detection_date IS NOT NULL AND y.fake_classification = 'resolved' THEN 'relabeled'
        ELSE 'persisting'
    END::fake_classification AS fake_classification,

    CASE
        WHEN t.detection_date IS NOT NULL THEN 0
        ELSE y.days_since_last_detected + 1
    END AS days_since_last_detected,
    COALESCE(t.detection_date, y.current_detection_date + INTERVAL '1 day')::DATE AS current_detection_date

FROM today t
FULL OUTER JOIN yesterday y
ON t.account_id = y.account_id;</code></pre><h2><strong>The Architecture That Broke Everything</strong></h2>
      <p>
          <a href="https://blog.dataexpert.io/p/saving-metas-fake-accounts-pipeline">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How I got a 12x speed up in a 50 TB pipeline at Meta]]></title><description><![CDATA[Data Orchestration Challenges I Faced at Airbnb, Netflix & Facebook &#8211; Part III]]></description><link>https://blog.dataexpert.io/p/how-i-got-a-12x-speed-up-in-a-50</link><guid isPermaLink="false">https://blog.dataexpert.io/p/how-i-got-a-12x-speed-up-in-a-50</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Mon, 04 Aug 2025 19:39:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!7U8i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my time at Facebook, I worked on <em><strong>Notifications</strong></em><strong> </strong>which, along with <em>Messages </em>and<em> Ads</em>, was the most volume-heavy pipeline in the company. Every ping you get from likes, tags, shares, comments, events is backed by mountains of notification data.</p><p>One of my most challenging assignments was owning the pipeline that deduplicated all notification events. This dataset drove downstream metrics like CTRs, conversions, and even machine learning signal quality.</p><p>This pipeline presented one big problem: it was slow. Very, very slow.</p><p>When I joined Facebook, the deduped notifications pipeline ran a <strong>giant Hive GROUP BY job once a day at UTC midnight</strong> which took <strong>9.5 hours</strong> to complete. This latency issue represented a huge <strong>bottleneck for all downstream models </strong>in the Core Growth team.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7U8i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7U8i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png 424w, https://substackcdn.com/image/fetch/$s_!7U8i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png 848w, https://substackcdn.com/image/fetch/$s_!7U8i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!7U8i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7U8i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:726551,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/169453671?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7U8i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png 424w, https://substackcdn.com/image/fetch/$s_!7U8i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png 848w, https://substackcdn.com/image/fetch/$s_!7U8i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!7U8i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0631c986-9b08-4e6e-bdf6-bda599bb1913_2000x1440.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This article is the story of how I brought 9.5-hour latency down to <strong>45 minutes</strong>, and what it taught me about I/O, orchestration, and never trusting a &#8220;simple&#8221; DAG.</p><p>Here I&#8217;ll be covering the following:</p><ul><li><p>Why streaming deduplication at scale failed</p></li><li><p>The hourly dedup job that exploded compute usage</p></li><li><p>A tree-based DAG design that saved the day</p></li><li><p>Key orchestration lessons from building a 300-step daily DAG</p></li><li><p>How this project got me promoted and why I almost gave up</p></li></ul><p>If you want to learn more in depth about patterns like this, the <a href="https://www.dataexpert.io/DEDUP">DataExpert.io</a> academy subscription has 200+ hours of content about system design, streaming pipelines, etc. The first 5 people can use code <strong>DEDUP</strong> for 30% off!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Stakes: Notifications at Facebook Scale</h2><p>The <strong>notif_events</strong> table contained every event tied to a notification:</p><ul><li><p><strong>Sent</strong> to your phone</p></li><li><p><strong>Delivered</strong> to device</p></li><li><p><strong>Clicked</strong></p></li><li><p><strong>Converted</strong></p></li></ul><p>Since one person might click the same notification <strong>multiple times</strong>, we had to dedup those. If we counted every click, we&#8217;d get click-through rates above 100%, which negatively impacted metric tracking and, worse, model training.</p><p>But the problem wasn&#8217;t logic. It was latency.</p><p>As I mentioned earlier, the dedup job ran once a day via Hive &amp; took 9.5 hours to complete. My managers wanted me to reduce latency dramatically.</p><h3>Approach 1: Stream It &#128547;</h3><p>At first, my manager&#8217;s request was straightforward: &#8216;<em>Let&#8217;s dedup in real-time&#8217;</em>.</p><p>So, I tried. I built a Spark Streaming job that listened to notification events and tried to hold recent activity in-memory for comparison. But this approach was holding as much as <strong>50+ terabytes in RAM</strong> to do real-time deduping.</p><p>This wasn&#8217;t feasible. The streaming job crumbled under memory pressure. I had to come up with a better solution.</p><h3>Approach 2: Hourly Dedup + Merge &#129300;</h3><p>Our second approach was to dedup every hour. The idea was simple:</p><ol><li><p>Dedup the current hour table and write it to a sorted, bucketed table <a href="https://towardsdev.com/spark-beyond-basics-smb-join-in-apache-spark-no-shuffle-join-3c0559105b87">(to minimize shuffle later with SMB join)</a></p></li><li><p>Merge with the previous hour table (a cumulative of all previous hours&#8217; deduped data). </p></li><li><p>Output a deduped table up to the current hour.</p></li></ol><p>The read and merge hourly deduped table to the previous hour can be implemented with this simple SQL pattern inside your DAG:</p><p></p><pre><code><code>-- Hourly dedup logic 
-- Remember this table is sorted and bucketed on user_id

INSERT OVERWRITE TABLE notif_deduped_hourly(ds='{{ ds }}', hour={{ current_hour}}, channel='{{ channel }}')
  SELECT 
     notif_id,
     user_id, 
     -- count the number of events of each type
     -- a custom UDF that returned a MAP {"sent":1, "clicked":3}
     COUNT_MAP(event_type) as event_map_count
  FROM notif_events
  WHERE event_hour = '{{ current_hour }}'
  AND ds = '{{ ds }}'
  AND channel = '{{ channel }}'
  GROUP BY notif_id, user_id</code></code></pre><pre><code><em>-- Then we merged the cumulative previous hours and the current hour with FULL OUTER JOIN

INSERT OVERWRITE TABLE notif_deduped_combined_hourly(ds='{{ ds }}', hour='{{ current_hour }}', channel='{{ channel }}')
</em>WITH dedup_current_hour AS (
  SELECT 
     *
  FROM notif_deduped_hourly
  WHERE hour = '{{ current_hour }}'
  AND ds = '{{ ds }}'
  AND channel = '{{ channel }}'
),
previous_hour AS (
   SELECT 
     *
  FROM notif_deduped_combined_hourly
  WHERE hour = '{{ previous_hour }}'
  AND ds = '{{ ds }}'
  AND channel = '{{ channel }}'
)

SELECT  
   COALESCE(c.notif_id, p.notif_id) as notif_id,
   COALESCE(c.user_id, p.user_id) as user_id,
   -- udf that merges the keys of two maps
   -- {"sent": 1, "clicked": 3} + {"clicked":4} 
   -- = {"sent":1, "clicked": 7}
   COMBINE_MAPS(c.event_map_count, p.event_map_count)  as event_map_count
FROM dedup_current_hour c FULL OUTER JOIN previous_hour p 
ON c.notif_id = p.notif_id 
-- This condition triggers the SMB join because both tables are sorted and bucketed on user_id
AND c.user_id = p.user_id</code></pre><p>This approach actually worked. It lowered latency and produced correct results.</p><p>But it had a huge flaw: <strong>compute usage exploded. </strong>It used <strong>15 times </strong>more compute than the original 9.5-hour GROUP BY job.</p><p>How was that possible? Quite straightforward:</p><ul><li><p>Hour 1 processes 1 hour of data</p></li><li><p>Hour 2 reads and merges 2 hours</p></li><li><p>Hour 3 reads and merges 3 hours&#8230;</p></li></ul><p>By hour 22, you find yourself <strong>reprocessing nearly the entire day&#8217;s data</strong> on every run.</p><p>It looked like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4ELm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4ELm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png 424w, https://substackcdn.com/image/fetch/$s_!4ELm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png 848w, https://substackcdn.com/image/fetch/$s_!4ELm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!4ELm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4ELm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png" width="1456" height="944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:944,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:368565,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/169453671?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4ELm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png 424w, https://substackcdn.com/image/fetch/$s_!4ELm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png 848w, https://substackcdn.com/image/fetch/$s_!4ELm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!4ELm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaf7c75-1d7a-4a89-ba1c-ffbbe8c8f7ba_2160x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once again, this solution was not sustainable, especially not for one of Facebook&#8217;s biggest datasets.<br></p><h3>Approach 3: Tree-Style Merge DAG &#129321;</h3>
      <p>
          <a href="https://blog.dataexpert.io/p/how-i-got-a-12x-speed-up-in-a-50">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Scaling Netflix's threat detection pipelines without streaming]]></title><description><![CDATA[Data orchestration challenges I faced at Netflix, Airbnb, & Facebook (Part II)]]></description><link>https://blog.dataexpert.io/p/scaling-netflixs-threat-detection</link><guid isPermaLink="false">https://blog.dataexpert.io/p/scaling-netflixs-threat-detection</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Fri, 25 Jul 2025 19:06:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5PEv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd044320f-324d-4967-87e9-360ac5bbb267_2000x1440.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Back in 2018, I was part of Netflix&#8217;s real-time threat detection team. I owned the orchestration and delivery layer of a detection pipeline that flagged fraudulent behavior, security breaches, and abuse patterns across our global platform.</p><p>At the time, we were leveraging a creative hybrid architecture internally dubbed as the <strong>&#8220;Psycho Pattern.&#8221;</strong> Think of i&#8230;</p>
      <p>
          <a href="https://blog.dataexpert.io/p/scaling-netflixs-threat-detection">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How I cut Airbnb's Pricing pipeline backfill time 95%]]></title><description><![CDATA[Data Orchestration challenges I faced in my years at Airbnb, Netflix & Facebook (Part I)]]></description><link>https://blog.dataexpert.io/p/how-i-made-airbnb-millions-with-this</link><guid isPermaLink="false">https://blog.dataexpert.io/p/how-i-made-airbnb-millions-with-this</guid><dc:creator><![CDATA[Zach Wilson]]></dc:creator><pubDate>Fri, 18 Jul 2025 18:18:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uVtk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I spent over three years at Airbnb as Staff Engineer for <strong>Marketplace Dynamics</strong>, owning everything related to pricing, availability &amp; profitability.<br><br>One of my biggest projects was overhauling the Pricing &amp; Availability pipeline. Among other things, I was fixing definitions, squashing time zone bugs and rethinking orchestration to turn weeks-long backfills into hours.</p><p>In this deep dive, I&#8217;ll walk you through the challenges I faced, the architectural mistakes I inherited and the solutions that made Airbnb earn millions. </p><p>This article covers the following topics:</p><ul><li><p>The subtle nuance in &#8216;availability&#8217; definitions</p></li><li><p>The original P&amp;A pipeline design and its pain points</p></li><li><p>Why massive backfills were so slow (and expensive)</p></li><li><p>Introducing staging tables for rapid iteration</p></li><li><p>What valuable lessons I learned</p></li><li><p>The business impact of my work and some personal reflections</p></li></ul><p>There&#8217;s a summary infographic of the entire data orchestration pipeline at the end of this article!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.dataexpert.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uVtk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uVtk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png 424w, https://substackcdn.com/image/fetch/$s_!uVtk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png 848w, https://substackcdn.com/image/fetch/$s_!uVtk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!uVtk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uVtk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:712627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/168272588?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uVtk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png 424w, https://substackcdn.com/image/fetch/$s_!uVtk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png 848w, https://substackcdn.com/image/fetch/$s_!uVtk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!uVtk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd597311a-753a-4b10-8bf6-cc283a29cd82_2000x1440.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>The True meaning of &#8220;Available&#8221;</h3><p>Airbnb&#8217;s legacy definition of an &#8220;available&#8221; night was simply:</p><blockquote><p><em>A host has not blocked this night, and it is not already reserved</em></p></blockquote><p>On the surface, that sounds reasonable but, in reality, it diverged from what guests actually could book. Therefore, Airbnb aimed to establish a more reliable meaning of what&#8217;s &#8220;available&#8221;. </p><blockquote><p><em>A trip can be booked that contains this night.</em></p></blockquote><p>In fact, the two definitions only matched 96% of the time. But this subtle change captures the nuances between both definitions.</p><p>Key edge cases:</p><ul><li><p><strong>Minimum stay requirements</strong>: Hosts or local regulations (e.g., 30-day minimum in New York) made many unblocked nights unbookable.</p></li><li><p><strong>Last minute/ time zone bugs</strong>: The system evaluated availability one second before midnight UTC. So Asia or Europe-based listings were sometimes asking, &#8220;<em>Can I book yesterday?</em>&#8221;</p></li></ul><h3>Original Pipeline Architecture &amp; Pain Points</h3><p>Here&#8217;s what I inherited in the P&amp;A pipeline:</p><ol><li><p><strong>Fifteen raw datasets</strong> for blocked nights, calendar entries, regional regulations, minimum stays, etc.</p></li><li><p>A <strong>single Spark job</strong> that:</p><ol><li><p>Joins all fifteen tables in one massive operation</p></li><li><p>Calls the Airbnb Java P&amp;A library (via Scala Spark) to calculate availability</p></li><li><p>Writes out the master P&amp;A table for downstream models (i.e. Smart Pricing)</p></li></ol></li></ol><h4>Why this was a problem</h4><ul><li><p><strong>Massive, repeated joins</strong>: Every time we tweaked a rule, the pipeline re-joined all 15 tables across <strong>eight years</strong> of historical data.</p></li><li><p><strong>Unpredictable runtimes</strong>: Backfilling could take <strong>2&#189; weeks</strong>&#8212;despite only ~10 GB of daily data.</p></li><li><p><strong>High compute costs</strong>: Multiple backfill attempts (due to late requirements changes) meant wasted weeks and tens of thousands of dollars.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dPRg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dPRg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png 424w, https://substackcdn.com/image/fetch/$s_!dPRg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png 848w, https://substackcdn.com/image/fetch/$s_!dPRg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png 1272w, https://substackcdn.com/image/fetch/$s_!dPRg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dPRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/04d61536-cf69-4d2a-a960-95305f075661_2125x850.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:229212,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.dataexpert.io/i/168272588?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dPRg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png 424w, https://substackcdn.com/image/fetch/$s_!dPRg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png 848w, https://substackcdn.com/image/fetch/$s_!dPRg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png 1272w, https://substackcdn.com/image/fetch/$s_!dPRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d61536-cf69-4d2a-a960-95305f075661_2125x850.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Why Backfills were glacially slow</h3><p>I kept asking myself: &#8220;<em>This isn&#8217;t even Big Data. It is not Netflix-scale&#8230; therefore, why so slow?</em>&#8221; A few realizations:</p><ul><li><p><strong>Monolithic joins</strong>: Spark spent most of its time shuffling data across executors for each join.</p></li><li><p><strong>Lack of decoupling</strong>: The join logic (inputs) and the calculation logic (P&amp;A library) were tightly coupled, with every change rippled across the entire dataset.</p></li><li><p><strong>Zero incrementalism</strong>: No opportunity to reuse intermediate results; every run was a full historical sweep.</p></li></ul><div><hr></div><h3>A solution: Staging Tables &amp; Incremental Orchestration</h3><p>The breakthrough came when I introduced a <strong>staging table</strong> and materialize all raw P&amp;A inputs into one &#8220;inputs&#8221; dataset.<br></p>
      <p>
          <a href="https://blog.dataexpert.io/p/how-i-made-airbnb-millions-with-this">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>