MediaFM is Netflix’s in‑house multimodal transformer that converts shot‑level audio, video, and text into unified embeddings for large‑scale content analysis.
Data Preparation and Shot Representation
Each movie or episode is split into shots using a shot‑boundary detector the three modality embeddings are then combined into a single vector.
- Apply shot boundary detection to segment titles.
- Extract visual features with a pretrained CNN.
- Generate audio embeddings via a spectrogram‑based encoder.
- Produce subtitle/text embeddings using Netflix’s post‑training framework.
- Concatenate and L2‑normalize to a 2304‑dimensional fused vector.
Model Architecture
The core is a BERT‑style transformer that processes sequences of fused shot embeddings, enriched with a global title token.
- Input: up to 512 ordered shot embeddings per sequence.
- Special [GLOBAL] token carries title‑level metadata.
- 12‑layer transformer encoder with multi‑head attention.
- Position embeddings capture temporal order.
- Output: contextualized shot vectors ready for downstream probes.
Training Objectives and Optimization
MediaFM learns via a self‑supervised Masked Shot Modeling (MSM) task, predicting masked shot embeddings.
- Randomly mask 20% of shot vectors and replace with a learnable [MASK] token.
- Loss: minimize cosine distance between predicted and original embeddings.
- Optimizer mix: Muon for hidden layers, AdamW for the rest.
- Training runs on multi‑GPU clusters with mixed‑precision.
- Early stopping based on validation MSM loss.
Evaluation, Probes, and Business Applications
After training, frozen embeddings are assessed with linear probes on several production tasks.
- Clip‑level popularity prediction using a single linear classifier.
- Ad‑break relevance ranking by feeding contextual shot vectors.
- Automatic tag generation for new titles.
- Embedding‑in‑context improves accuracy over isolated clip embeddings.
- Results reported in internal performance dashboards.