Netflix Ranker's serendipity scoring computes a "novelty" value for each candidate title by comparing its embedding against a viewer's watch history, enabling personalized row placement on the home screen.
Problem Overview
The original implementation performed a nested loop of candidate‑to‑history cosine calculations, leading to high CPU consumption and poor cache behavior, especially under mixed single‑ and batch‑request traffic.
- O(M × N) dot products per request caused extensive sequential work.
- Repeated embedding fetches resulted in scattered memory accesses.
- Cache line misses amplified latency on high‑throughput nodes.
- Profiling highlighted Java dot‑product hotspots in the serendipity encoder.
- Batch requests represented roughly half of total processing volume despite being only 2 % of calls.
Batching Transformation
Converting the per‑pair computation into a single matrix multiplication allowed the CPU to apply highly tuned linear‑algebra kernels instead of repetitive scalar operations.
- Candidate embeddings assembled into matrix A (M × D) and history embeddings into B (N × D).
- Rows normalized to unit length before multiplication to obtain cosine similarity directly.
- Matrix multiply (A × Bᵀ) produced an M × N similarity matrix in one step.
- Both
encode()(single) andbatchEncode()(batch) APIs retained backward compatibility. - Initial canary runs revealed a modest performance regression, prompting deeper investigation.
Cache‑Friendly Memory Layout
Switching from a two‑dimensional double array to a flat row‑major buffer eliminated fragmented allocations and improved spatial locality.
- Flat
double[]buffers hold candidate and history vectors contiguously. - Thread‑local BufferHolder reuses buffers across requests, removing per‑request GC pressure.
- Buffers expand only when needed and never shrink, preserving capacity for burst traffic.
- Isolation per thread avoids contention without sacrificing memory efficiency.
- Similar buffer‑reuse patterns are discussed in terminal accessibility tooling.
Vectorized Matrix Multiplication
Pure‑Java SIMD via the incubating JDK Vector API provided the necessary speed without native dependencies, complementing the flat‑buffer design.
- Expressed dot‑product loops as vector operations, allowing the JIT to emit SSE/AVX instructions.
- Fallback to scalar code ensures correctness on older CPUs.
- Benchmarks showed a 15 % reduction in latency compared with hand‑rolled loops.
- BLAS libraries were evaluated but rejected due to JNI overhead and integration complexity.
- For background on SIMD concepts, see the SIMD Wikipedia page.
Deployment & Observability
After the optimizations, the service was rolled out gradually, with monitoring focused on CPU usage, GC metrics, and latency percentiles.
- CPU per node dropped from ~7.5 % to under 3 % for the Ranker service.
- Heap pressure decreased, leading to fewer GC pauses.
- Latency 99th‑percentile improved by ~20 % across mixed traffic.
- Canary metrics fed into the same observability pipeline used for other Netflix microservices.
- Further reading on large‑scale deployment patterns is available in the AWS scalability guide.