DataExpert.io Newsletter

DataExpert.io Newsletter

Share this post

DataExpert.io Newsletter
DataExpert.io Newsletter
How I cut Airbnb's Pricing pipeline backfill time 95%

How I cut Airbnb's Pricing pipeline backfill time 95%

Data Orchestration challenges I faced in my years at Airbnb, Netflix & Facebook (Part I)

Zach Wilson's avatar
Zach Wilson
Jul 18, 2025
∙ Paid
36

Share this post

DataExpert.io Newsletter
DataExpert.io Newsletter
How I cut Airbnb's Pricing pipeline backfill time 95%
1
Share

I spent over three years at Airbnb as Staff Engineer for Marketplace Dynamics, owning everything related to pricing, availability & profitability.

One of my biggest projects was overhauling the Pricing & Availability pipeline. Among other things, I was fixing definitions, squashing time zone bugs and rethinking orchestration to turn weeks-long backfills into hours.

In this deep dive, I’ll walk you through the challenges I faced, the architectural mistakes I inherited and the solutions that made Airbnb earn millions.

This article covers the following topics:

  • The subtle nuance in ‘availability’ definitions

  • The original P&A pipeline design and its pain points

  • Why massive backfills were so slow (and expensive)

  • Introducing staging tables for rapid iteration

  • What valuable lessons I learned

  • The business impact of my work and some personal reflections

There’s a summary infographic of the entire data orchestration pipeline at the end of this article!

DataExpert.io Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.


The True meaning of “Available”

Airbnb’s legacy definition of an “available” night was simply:

A host has not blocked this night, and it is not already reserved

On the surface, that sounds reasonable but, in reality, it diverged from what guests actually could book. Therefore, Airbnb aimed to establish a more reliable meaning of what’s “available”.

A trip can be booked that contains this night.

In fact, the two definitions only matched 96% of the time. But this subtle change captures the nuances between both definitions.

Key edge cases:

  • Minimum stay requirements: Hosts or local regulations (e.g., 30-day minimum in New York) made many unblocked nights unbookable.

  • Last minute/ time zone bugs: The system evaluated availability one second before midnight UTC. So Asia or Europe-based listings were sometimes asking, “Can I book yesterday?”

Original Pipeline Architecture & Pain Points

Here’s what I inherited in the P&A pipeline:

  1. Fifteen raw datasets for blocked nights, calendar entries, regional regulations, minimum stays, etc.

  2. A single Spark job that:

    1. Joins all fifteen tables in one massive operation

    2. Calls the Airbnb Java P&A library (via Scala Spark) to calculate availability

    3. Writes out the master P&A table for downstream models (i.e. Smart Pricing)

Why this was a problem

  • Massive, repeated joins: Every time we tweaked a rule, the pipeline re-joined all 15 tables across eight years of historical data.

  • Unpredictable runtimes: Backfilling could take 2½ weeks—despite only ~10 GB of daily data.

  • High compute costs: Multiple backfill attempts (due to late requirements changes) meant wasted weeks and tens of thousands of dollars.

Why Backfills were glacially slow

I kept asking myself: “This isn’t even Big Data. It is not Netflix-scale… therefore, why so slow?” A few realizations:

  • Monolithic joins: Spark spent most of its time shuffling data across executors for each join.

  • Lack of decoupling: The join logic (inputs) and the calculation logic (P&A library) were tightly coupled, with every change rippled across the entire dataset.

  • Zero incrementalism: No opportunity to reuse intermediate results; every run was a full historical sweep.


A solution: Staging Tables & Incremental Orchestration

The breakthrough came when I introduced a staging table and materialize all raw P&A inputs into one “inputs” dataset.

Keep reading with a 7-day free trial

Subscribe to DataExpert.io Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Zach Wilson
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share