Optimizing Container Scaling on Modern CPU Architectures

16 April 2026 by

Suraj Barman

Optimizing Container Scaling on Modern CPU Architectures

Netflix has achieved a reputation for delivering unparalleled streaming experiences, supported by advanced engineering and innovative system designs. One critical aspect of this delivery involves the efficient scaling of containers to meet user demand. Containers are lightweight, portable software units that encapsulate applications and dependencies, enabling rapid deployment and scalability. However, as Netflix modernized its container runtime, it encountered unexpected bottlenecks related to CPU architecture.

Understanding the Problem with Container Scaling

When Netflix scales its servers to accommodate increased application demand, new instances are provisioned from AWS. These instances are immediately tasked with hosting containers to utilize their resources effectively. However, during the transition to a modern container platform, certain nodes began to experience prolonged health check timeouts exceeding 30 seconds. These delays directly impacted the system's responsiveness and reliability, presenting a significant engineering challenge.

Investigations into the issue revealed that the root cause stemmed from the mounting table length increasing dramatically. This prolonged the time required to process mounts, causing system-level complications such as lock-ups and frequent kubelet timeouts. These events were particularly prevalent on r5metal instances, which were handling container images with numerous layers.

Diagnosing Mount Table Lock Contention

Netflix engineers conducted in-depth analyses to pinpoint the bottleneck. Flamegraphs revealed that the container runtime, containerd, spent most of its time attempting to acquire kernel-level locks during mount-related operations. These locks are critical for assembling container root filesystems but introduced significant latency during scaling scenarios. The contention over these locks not only slowed down individual nodes but also affected overall system stability.

Another contributing factor was the high number of layers in container images. While layered images optimize storage and reuse, they also increase the complexity of mount operations. This complexity amplified lock contention, making the process inefficient at scale.

Systemd's Role in Performance Challenges

The stack analysis of systemd revealed that it was heavily occupied with processing mount events during periods of high demand. This activity exacerbated the delays and led to complete system lock-ups in some cases. Systemd plays a crucial role in managing system processes, and its inability to handle mounting efficiently caused cascading failures across the container runtime.

These findings underscored the need for architectural adjustments. Resolving the performance bottleneck required rethinking how mount events were processed and optimizing systemd's interactions with the container runtime.

Hardware-Level Observations on r5metal Instances

The bottlenecks were disproportionately observed on r5metal instances, highlighting a hardware-level challenge. These instances are optimized for high-performance workloads but showed limitations when handling container images with numerous layers. The combination of hardware-specific constraints and software inefficiencies created a unique challenge for Netflix's engineers.

By studying the performance characteristics of r5metal instances, engineers identified opportunities to adjust container runtime operations to better align with the underlying CPU architecture. This alignment was critical to achieving efficient scaling and maintaining responsiveness.

Implementing Solutions to Optimize Scaling

Netflix's engineering team implemented several solutions to address the identified bottlenecks. One approach involved streamlining mount operations to reduce contention for kernel-level locks. This included modifying the container runtime to handle layered images more efficiently and optimizing the interaction between systemd and containerd.

Additionally, engineers explored hardware-specific optimizations to enhance compatibility with r5metal instances. By tailoring the container runtime to the unique characteristics of these instances, Netflix improved scaling efficiency and reduced system lock-ups.

Conclusion and Lessons Learned

Through rigorous analysis and targeted optimizations, Netflix overcame significant challenges in scaling containers on modern CPU architectures. The lessons learned from this experience highlight the importance of aligning software operations with hardware capabilities to achieve optimal performance. Netflix's efforts ensure that millions of users can enjoy seamless streaming experiences even during periods of high demand.

Optimizing Container Scaling on Modern CPU Architectures

Optimizing Container Scaling on Modern CPU Architectures

Understanding the Problem with Container Scaling

Diagnosing Mount Table Lock Contention

Systemd's Role in Performance Challenges

Hardware-Level Observations on r5metal Instances

Implementing Solutions to Optimize Scaling

Conclusion and Lessons Learned

Latest Stories