Scaling Netflix's threat detection pipelines without streaming
Data orchestration challenges I faced at Netflix, Airbnb, & Facebook (Part II)
Back in 2018, I was part of Netflix’s real-time threat detection team. I owned the orchestration and delivery layer of a detection pipeline that flagged fraudulent behavior, security breaches, and abuse patterns across our global platform.
At the time, we were leveraging a creative hybrid architecture internally dubbed as the “Psycho Pattern.” Think of it as a middle ground between batch and real-time, powered by Spark, Kafka, and Airflow.
On paper, it worked. In practice, it taught me some of the hardest lessons of my engineering career. This article covers the following topics:
What the Psycho Pattern is and how it works
The orchestration flaws that broke production
Deep dive into Spark batch processing
How memory tuning became a nightmare
Why migrating to Flink failed spectacularly
The real lesson: Stakeholders wanted signal, not speed
If you want to learn more in depth about patterns like this, the DataExpert.io academy subscription has 200+ hours of content about system design, streaming pipelines, etc. The first 5 people can use code PSYCHO for 30% off!
The Setup: Netflix’s Threat Detection Stack
The team needed a pipeline that could ingest security event data (from Kafka or SQS), apply machine learning-based threat detection, and raise alerts. All with minimal latency.
Instead of using Spark Streaming or Flink right away, we built a continuously running Spark batch pipeline using Airflow and a high watermark table.
The Psycho Pattern, explained
Opposite to most pipelines that run hourly or daily, the psycho pattern runs all the time, one job at a time. Unlike with streaming, each job does start and stops.
This diagram clearly illustrates the end-to-end process of any job that operates under this pattern:
Spark reads the high watermark from the high watermark table (last processed file’s landing timestamp)
Spark pulls from Kafka or SQS the events newer than the watermark
Runs fraud detection logic in Spark batch processor
Spark writes the results to the “processed events” database
Spark updates the watermark to the most recent file it saw.
Jobs are orchestrated via an Airflow DAG configured to execute under a continuous schedule, one active run at a time
schedule= "@continuous"
max_active_runs= 1
Note that the watermark timestamp corresponds to the latest landing time in Kafka, not the event time. This allowed for late arriving data to show up without the risk of missing it (i.e. an event from yesterday would still be processed as it landed in Kafka today).
Back then, this setup was pretty solid and, because it used file landing time, not event time, allowed to gracefully handle late-arriving data.
Why this pattern was fragile
The Psycho Pattern wasn’t broken by design. It was broken by scale and bugs and presented two main issues:
High latency in threat detection (~5-7 minutes, which was considered too much)
It struggled with memory spikes during delays
Memory footprint explosion
If the DAG didn’t run for a while (say, due to a bug), the watermark would lag. Then, on the next run, the job would suddenly ingest massive amounts of data, far more than usual.
Spark’s memory usage was unpredictable and hard to tune. One job would run fine with 2GB and another one would crash with 4GB.
My quick fix to ‘solve’ this problem was to max it out configs like this:
--executor-memory 16G
--driver-memory 8G
--conf spark.sql.shuffle.partitions=800
Far from being an optimal solution, memory on the cloud is less expensive than the engineering time to troubleshoot this memory issue.
But executors still failed under burst loads. We had no dynamic scaling. No backpressure. No resource elasticity.
Spark Batch in a Real-Time World
Spark batch is rock solid when:
Input sizes are known
Processing volumes are consistent
Orchestration is predictable
But in a pattern like this:
Inputs vary by delay time
DAGs run non-stop
Memory needs spike without warning
There’s no checkpointing. No recovery. You either finish the run, or you start from scratch
Stakeholders wanted real-time processing and Flink. We Delivered. We still failed.
Security teams came to us and said: “We need this to be real-time. Move off Airflow. Switch to Flink.”
We assumed they meant latency. So, we jumped into migrating one specific detection use case to a Flink-based pipeline. (I have a full three hour course on Flink-based pipelines on YouTube)
And it worked... sort of. Latency dropped from 6 minutes to 4 minutes.
What about performance? Meh.
What about signal quality? Same.
What about engineering costs? Enormous.
So, what were we doing wrong?
The first rookie mistake was to chase a real-time approach without asking what the real problem to solve was. It turned out that latency wasn’t the bottleneck. Signal quality was.
Our detection model had a high false positive rate. Moving to Flink wouldn’t fix that.
What Actually Needed Fixing
The psycho pattern wasn’t perfect but it was close.
What was missing:
Better data validation at the source with cleaner event schemas
More context for the ML model + high ML precision
False positives were a much bigger pain than time-to-detection
Determining a security anomaly is a needle-in-a-haystack problem that is false positive prone. It requires an unbelievable amount of context that we were not able to get because of the focus on low-latency and streaming.
Smarter memory tuning with dynamic Spark strategies
A lot of these things were solved in later version of Spark. And greatly improved in Spark 4.0!
Iceberg and TableFlow makes this problem much easier today! Sourcing from Iceberg instead of Kafka allows for partitioning of the input data in case of failure.
Instead, we spent 6 months rewriting pipelines that already worked well enough.
Business Impact & Final Reflections
Looking back, the Psycho pattern wasn’t a mistake. It did what it was supposed to, until scale, edge cases and misunderstood expectations broke it.
What actually failed was communication.
Had we focused on improving signal accuracy rather than chasing lower latency we would’ve delivered more value, faster, and with less complexity.
Here’s what I took away from that experience:
Micro-batch isn’t broken. If you can consistently hit 3–5 minute latency with Airflow + Spark + Kafka, you’re already operating at Netflix-grade.
High watermark logic is your lifeline. It’s the only state your batch job owns. Audit it, monitor it, and treat it like the heartbeat of your system.
Memory management in Spark isn’t about throwing more RAM. Static configs can’t keep up with unpredictable bursts. Use spill monitoring, GC logs, and lean into dynamic allocation strategies.
Ask better questions. When someone says they want faster pipelines, what they often mean is better data. Don’t just ship quicker results. Ship better ones.
One final take on real-time data.
Real-time data is seductive. But without clear goals, it’s just the woman in the red dress… a beautiful distraction. Most of the times, the marginal gain of moving from micro-batch to streaming is not worth it.
Thanks for reading!