Netflix Live Origin: Engineering a Reliable Cloud Live Streaming Pipeline
11 March 2026
by
Suraj Barman
Context & History
Netflix entered the live streaming arena to complement its massive on‑demand catalog. Early efforts relied on the same infrastructure that served video‑on‑demand (VOD), but live events demanded sub‑second latency, continuous segment generation, and fault‑tolerant delivery. To meet these needs, the engineering team created the Live Origin, a dedicated broker positioned between the cloud‑based live pipelines and Netflixs proprietary CDN, Open Connect. The service debuted alongside the Behind the Streams series, showcasing how Netflix adapts its architecture for live content while preserving the reliability users expect from VOD.
Implementation & Best Practices
The Live Origin is built as a multi‑tenant microservice running on Amazon EC2 instances within the AWS cloud. It communicates exclusively over standard HTTP, using PUT requests from the packager to store segments and GET requests from Open Connect to retrieve them. Each segments URL encodes its storage location, allowing the CDN to request the exact object without additional lookup steps. This design mirrors the VOD workflow but adds live‑specific logic for candidate selection and defect handling. For deeper insights into systematic implementation patterns, refer to our guide on preset annotations for design systems.
Resilience and Multi‑Region Design
Resilience is achieved through redundant regional pipelines. Two independent encoding streams run in separate AWS regions, each producing its own segment set. The Live Origin inspects metadata supplied by the packager-such as defect flags-and selects the first valid segment in a deterministic order. If both pipelines produce a defective segment, the origin forwards the defect information downstream, enabling client‑side concealment strategies. This dual‑pipeline approach dramatically lowers the probability of simultaneous failures, ensuring uninterrupted playback for millions of concurrent viewers.
Caching and Request‑Holding Mechanics
Traditional HTTP caching operates at second granularity, which is too coarse for segments generated every 2 seconds. To address this, Netflix extended nginx with millisecond‑level cache control. When a client requests a segment that has not yet been published, the Live Origin can return a 404 with a short‑lived cache directive, allowing Open Connect to cache the negative response until the segment becomes available. Additionally, the origin can hold open the request: it keeps the TCP connection alive and streams the segment as soon as the packager pushes it. This technique reduces network chatter and improves perceived latency. Detailed examples of similar security‑focused enhancements can be found in the article on Cloudflares stateful API vulnerability scanner.
```nginx
# nginx snippet for millisecond‑level caching and request holding
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=live_cache:10m max_size=1g inactive=5s use_temp_path=off;
server {
listen 80;
location /live/ {
proxy_pass http://live-origin-upstream;
proxy_cache live_cache;
proxy_cache_valid 200 200ms;
proxy_cache_valid 404 50ms;
proxy_ignore_headers Set-Cookie;
# Enable request holding for future segments
proxy_next_upstream error timeout invalid_header http_404;
}
}
```
Header‑Based Notifications and In‑Memory State
Every segment published by the packager includes custom HTTP headers that convey live‑event notifications (e.g., availability start time, segment number, or emergency alerts). The Live Origin injects these headers into the response stream, and Open Connect appliances extract them to maintain an in‑memory data structure keyed by event ID. This structure ensures that any OCA, regardless of how far behind a client is, always has the latest notification payload attached to subsequent segment responses. The approach eliminates the need for separate signaling channels and scales transparently with the number of concurrent live events.
Fault Detection and Intelligent Invalidation
Live streams can suffer from discontinuities, audio/video sync errors, or corrupted frames. The packager performs lightweight media inspection and annotates problematic segments with defect metadata. The Live Origin records this information and, if a segment is deemed unusable, it can issue an invalidation command that clears all cached copies of the affected segment across the CDN tier. This rapid purge prevents defective content from propagating to end users and triggers the selection of an alternate pipeline segment.
Operational Monitoring and Observability
To sustain high availability, the Live Origin emits detailed metrics via Prometheus: request latency, cache hit/miss ratios, defect rates, and pipeline health status. Dashboards aggregate these signals, allowing engineers to spot anomalies within seconds. Alerting policies trigger automatic failover to the secondary region when error thresholds exceed predefined limits, ensuring seamless continuity without manual intervention.
Summary of Best Practices
- Deploy the Live Origin as a stateless microservice behind an autoscaling group.
- Leverage dual‑region pipelines with deterministic candidate ordering.
- Extend nginx for millisecond‑level caching and request‑holding.
- Use custom HTTP headers for real‑time notifications across the CDN.
- Implement defect‑aware invalidation to protect end‑user experience.
- Instrument with metrics and alerts for proactive resilience.