How Netflix Resolved Container Mount Bottlenecks on AWS Metal Instances

28 February 2026 by

Suraj Barman

Netflix Container Scaling: Diagnosing Mount Table Bottlenecks on AWS Metal Instances

Netflix runs thousands of containers per second to serve video streams. When a new AWS instance boots, the kubelet schedules pods, each container requiring dozens of bind mounts for overlayfs construction. A surge of mounts overwhelmed the kernel’s global mount lock, causing pod‑startup delays of up to 30 seconds.

Root Cause: Mount Table Lock Contention

The kernel VFS serializes every mount and unmount operation through a single lock. With images that contain 50+ layers, each container triggers thousands of mount calls, saturating the lock.

Per‑container mount count: 2 × (1 + layers + layers) operations per launch.
Example load: 100 containers × 50 layers → ~20 200 mount actions.
Observed symptom: health‑check timeouts and systemd stalls.
Flamegraph insight: >90% of CPU time spent in path_init() waiting on a sequence lock.
Lock type: global mount_lock causing spin‑wait loops.

Impact of Instance Architecture (NUMA and Hyper‑Threading)

Dual‑socket metal instances expose two NUMA nodes. Remote memory accesses add latency to lock acquisition, amplifying the bottleneck on r5.metal servers.

NUMA definition: each CPU group has local memory remote accesses travel via an interconnect (Wikipedia).
Observed pattern: single‑socket instances maintained launch rates dual‑socket r5.metal failed near 100 concurrent pods.
Hyper‑Threading effect: more logical threads increase contention on the same lock.
Benchmark result: 48xl (2 NUMA nodes) showed higher failure rates than 24xl (1 NUMA node).
Reference: see the web‑interoperability guide for deeper analysis of cross‑CPU resource sharing.

Mitigation Strategies Implemented

Netflix introduced a series of runtime and orchestration changes to reduce lock pressure and improve launch latency.

Layer flattening: pre‑merge frequently used layers into a single image, cutting bind‑mount count.
Parallel mount batching: group bind mounts per NUMA node before invoking the global lock.
Switch to containerd with reduced user‑namespace calls: eliminated the second mount pass.
Instance type shift: migrate high‑concurrency workloads to m7a instances with better NUMA scaling.
Monitoring addition: expose mount‑lock latency via Prometheus alerts trigger before pod‑startup failures.

Best Practices for Future Deployments

Teams can adopt the following guidelines to avoid similar bottlenecks when scaling container workloads.

Keep image layers shallow: aim for <10 layers for frequently launched services.
Prefer single‑socket or NUMA‑aware instance families: align pod placement with local memory.
Enable overlayfs caching: reuse lower‑dir mounts where possible.
Instrument mount latency: use perf or eBPF to detect lock spikes early.
Reference implementation: the service‑worker guide illustrates similar low‑level optimizations for file‑system operations.