Scaling Containers on Modern CPUs: Insights from Netflix's Engineering Team

1 May 2026 by

Suraj Barman

Scaling Containers on Modern CPUs: Insights from Netflix's Engineering Team

Netflix has consistently demonstrated engineering excellence in delivering seamless streaming services to millions of users globally. This article delves into the challenges the company faced while scaling containers on modern CPUs and the innovative solutions they employed to maintain their high-quality user experience.

The Importance of Efficient Container Scaling

Netflix relies on efficient container scaling to support the demands of its massive global user base. Containers are lightweight, portable environments that host application workloads. When users click play, hundreds of containers are orchestrated to ensure the requested content is delivered quickly and reliably. To achieve this, Netflix employs a sophisticated container runtime system that scales up server capacity almost instantaneously.

Despite their efforts to modernize their container runtime, Netflix encountered a bottleneck rooted in the underlying CPU architecture. This issue posed a threat to their ability to maintain a consistent streaming experience, especially during high-demand periods such as weekends and holidays.

Unveiling the Problem: Mount Table Length and System Stalls

As Netflix transitioned to a new container platform, engineers began observing unexpected delays and system stalls. Specifically, nodes running on r5.metal instances experienced prolonged periods of inactivity. These issues were traced back to an unusual increase in mount table length, which significantly slowed down operations.

Mount tables, which track filesystem mounts, grew excessively large during container creation, causing delays in basic health checks. In some cases, the systemd process responsible for managing these mounts became overwhelmed, leading to complete system lockup. This was a critical discovery as it highlighted a deeper issue tied to the hardware-level implementation.

Analyzing Performance Bottlenecks

To identify the root cause, Netflix engineers examined flame graphs of the affected systems. The analysis revealed that containerd, the container runtime interface, spent most of its time acquiring kernel-level locks during mount-related activities. This high contention for locks was a direct result of the increased number of layers in container images.

The flame graph provided a visual representation of the system's performance bottlenecks. It became clear that the mounting process was causing significant delays, which cascaded into other system components such as the kubelet, further exacerbating performance issues.

Hardware-Specific Challenges with r5.metal Instances

The affected nodes were primarily r5.metal instances, which are high-performance servers optimized for memory-intensive workloads. Despite their advanced specifications, these instances struggled with the sudden surge in mount-related operations. The issue was particularly pronounced for containers with numerous filesystem layers, highlighting a limitation in how the CPU architecture handled concurrent tasks.

This observation underscored the need for a deeper understanding of the interplay between modern hardware capabilities and containerized workloads. It also emphasized the importance of optimizing both software and hardware configurations to achieve the desired level of performance.

Lessons Learned and Future Directions

Netflix's experience with container scaling on modern CPUs provides valuable insights for the tech community. One key lesson is the critical role of hardware-level optimizations in supporting large-scale containerized environments. Addressing bottlenecks requires a holistic approach that considers both software processes and the underlying infrastructure.

Moving forward, Netflix aims to explore new strategies for improving container runtime efficiency. This includes optimizing mount table management, reducing lock contention, and potentially re-evaluating the use of certain hardware configurations. These efforts are crucial for maintaining their commitment to a seamless streaming experience for users worldwide.

Conclusion: Pioneering the Future of Scalable Streaming

By addressing the challenges of scaling containers on modern CPUs, Netflix continues to set a benchmark in cloud computing and distributed systems engineering. Their proactive approach to problem-solving not only ensures a better experience for their users but also contributes to the broader understanding of containerization and system performance.

Scaling Containers on Modern CPUs: Insights from Netflix's Engineering Team

Scaling Containers on Modern CPUs: Insights from Netflix's Engineering Team

The Importance of Efficient Container Scaling

Unveiling the Problem: Mount Table Length and System Stalls

Analyzing Performance Bottlenecks

Hardware-Specific Challenges with r5.metal Instances

Lessons Learned and Future Directions

Conclusion: Pioneering the Future of Scalable Streaming

Latest Stories