State of Routing in Model Serving at Netflix

10 June 2026 by

Suraj Barman

State of Routing in Model Serving at Netflix

Netflix has developed a centralized machine learning (ML) model serving platform to enhance personalized user experiences at scale. This platform supports rapid iteration, seamless traffic routing, and model inference for various microservices. In this article, we explore the platform's architecture, focusing on its traffic routing capabilities and how it empowers innovation across Netflix's ML-driven applications.

Understanding Netflix's ML Model Serving Infrastructure

The ML model serving platform at Netflix serves as a unified entry point for handling model inference across multiple domains, such as title recommendations and commerce. Unlike traditional systems, Netflixs platform abstracts the complexities of ML inference, allowing client services and researchers to focus on innovation without being burdened by underlying technical challenges.

This platform supports hundreds of distinct ML model types and versions, handling over 1 million requests per second as of 2025. It enables microservices at Netflix to request model inference without needing to understand the intricacies of ML workflows. This abstraction accelerates the deployment of both new and updated ML models, fostering rapid experimentation and scaling of personalized user experiences.

Core Challenges in Large-Scale ML Model Routing

One of the primary challenges in large-scale ML model serving is efficient traffic routing. Netflixs platform must direct incoming requests to the correct model instance on the appropriate cluster shard for each user and use case. This process must balance scalability, accuracy, and simplicity to ensure optimal performance for all stakeholders.

The centralized system achieves this through domain-independent API abstractions. These APIs simplify the integration process for various microservices and ensure that model researchers can safely release updates or new versions without disrupting existing workflows. By providing a singular API, the platform unifies traffic routing across diverse applications.

Netflix's Definition of Machine Learning Models

Netflixs approach to ML models is distinct from conventional definitions. At Netflix, an ML model is more than a function for scoring features it represents a self-contained workflow. This workflow includes input transformation, feature extraction, and final scoring, encapsulating all necessary steps within a single model package.

This self-contained design simplifies the deployment and management of ML models, as each model carries all dependencies required for inference. Researchers and engineers can focus on improving the models without worrying about external dependencies or configurations, streamlining the entire lifecycle of ML development and deployment.

Traffic Routing via Domain-Independent APIs

To address the challenge of traffic routing, Netflix employs a domain-independent API that acts as the gateway to the ML model serving platform. This API abstracts away complexities such as cluster management and model versioning, ensuring that client services can seamlessly request inferences for their specific use cases.

The API also facilitates controlled experimentation by allowing researchers to deploy new models or updates in a safe and scalable manner. By centralizing routing logic, Netflix ensures consistent performance and reliability across all services that depend on ML inference, regardless of their unique requirements.

Impact on Innovation and Scalability

The centralized ML serving platform at Netflix has significantly accelerated the pace of innovation. By simplifying traffic routing and model deployment, the platform enables researchers and engineers to iterate on new hypotheses rapidly. This capability is critical in a fast-paced environment where user preferences and behaviors continually evolve.

Additionally, the platforms ability to handle over 1 million requests per second demonstrates its scalability. This robustness ensures that Netflix can continue delivering high-quality, personalized experiences to its global user base while maintaining operational efficiency and technical simplicity.

Conclusion

Netflixs ML model serving platform exemplifies the power of centralized infrastructure in addressing complex challenges like traffic routing and model inference. By abstracting technical complexities and unifying workflows, the platform empowers innovation and ensures scalability, positioning Netflix as a leader in ML-driven personalization at scale.

State of Routing in Model Serving at Netflix

State of Routing in Model Serving at Netflix

Understanding Netflix's ML Model Serving Infrastructure

Core Challenges in Large-Scale ML Model Routing

Netflix's Definition of Machine Learning Models

Traffic Routing via Domain-Independent APIs

Impact on Innovation and Scalability

Conclusion

Latest Stories