How Netflix’s MediaFM Powers Multimodal Content Understanding

23 February 2026 by

Suraj Barman

MediaFM is Netflix’s in‑house multimodal transformer that converts shot‑level audio, video, and text into unified embeddings for large‑scale content analysis.

Data Preparation and Shot Representation

Each movie or episode is split into shots using a shot‑boundary detector the three modality embeddings are then combined into a single vector.

Apply shot boundary detection to segment titles.
Extract visual features with a pretrained CNN.
Generate audio embeddings via a spectrogram‑based encoder.
Produce subtitle/text embeddings using Netflix’s post‑training framework.
Concatenate and L2‑normalize to a 2304‑dimensional fused vector.

Model Architecture

The core is a BERT‑style transformer that processes sequences of fused shot embeddings, enriched with a global title token.

Input: up to 512 ordered shot embeddings per sequence.
Special [GLOBAL] token carries title‑level metadata.
12‑layer transformer encoder with multi‑head attention.
Position embeddings capture temporal order.
Output: contextualized shot vectors ready for downstream probes.

Training Objectives and Optimization

MediaFM learns via a self‑supervised Masked Shot Modeling (MSM) task, predicting masked shot embeddings.

Randomly mask 20% of shot vectors and replace with a learnable [MASK] token.
Loss: minimize cosine distance between predicted and original embeddings.
Optimizer mix: Muon for hidden layers, AdamW for the rest.
Training runs on multi‑GPU clusters with mixed‑precision.
Early stopping based on validation MSM loss.

Evaluation, Probes, and Business Applications

After training, frozen embeddings are assessed with linear probes on several production tasks.

Clip‑level popularity prediction using a single linear classifier.
Ad‑break relevance ranking by feeding contextual shot vectors.
Automatic tag generation for new titles.
Embedding‑in‑context improves accuracy over isolated clip embeddings.
Results reported in internal performance dashboards.