What Is Backend Aggregation (BAG)?
Backend Aggregation (BAG) is a centralized, Ethernet‑based super‑spine layer that interconnects multiple spine‑fabric networks across data‑center buildings, regions, and even continents. In Meta’s Prometheus AI cluster, BAG serves as the aggregation point between regional L2 fabrics (such as Disaggregated Schedule Fabric (DSF) and Non‑Scheduled Fabric (NSF)) and the global backbone, providing petabit‑scale inter‑BAG bandwidth (e.g., 16‑48 Pbps per region pair).
Why BAG Is Critical for Gigawatt‑Scale AI Clusters
Building a cluster that can deliver a gigawatt of AI compute requires connecting tens of thousands of GPUs with predictable latency, high throughput, and strong fault tolerance. BAG addresses these needs by:
- Providing a single, high‑capacity aggregation tier that avoids oversubscribing the GPU‑to‑network path.
- Enabling modular expansion – new spine fabrics can be added without redesigning the entire network.
- Supporting resilient topologies (planar and spread) that limit the impact of link or switch failures.
- Facilitating uniform security (MACsec) and routing policies across disparate regions.
How BAG Is Designed and Deployed
Meta’s BAG implementation follows a set of repeatable design patterns that can be applied to any large‑scale AI cluster.
How BAG Connects Multiple Spine Fabrics
Each regional L2 fabric (DSF or NSF) terminates at a dedicated backend edge pod. Those edge pods attach to BAG switches using high‑speed 800 Gbps ports, creating a predictable oversubscription ratio (typically ~4.5:1 from L2 to BAG).
How BAG Layers Interconnect Across Regions
Two primary topologies are used, chosen based on site size and fiber availability:
- Planar (direct‑match) topology: One‑to‑one BAG‑to‑BAG links follow a geometric plane, simplifying management but concentrating failure domains.
- Spread connection topology: Links are distributed across multiple BAG switches and planes, providing path diversity and higher resilience.
Hardware and Routing Details
Key hardware components and routing mechanisms include:
- Modular chassis equipped with Jericho3 (J3) ASIC line cards, each offering up to 432 × 800 Gbps ports.
- Central “hub” BAG chassis that aggregates spokes and long‑distance links, using varied cable lengths to optimize buffer utilization.
- eBGP with bandwidth attributes for route selection, enabling Unequal Cost Multipath (UCMP) load‑balancing.
- MACsec encryption on BAG‑to‑BAG links to meet security requirements.
Designing the Network for Resilience
Resilience is built into every layer of BAG:
- Port striping and IP addressing schemes that allow traffic to be shifted away from a failed switch without renumbering.
- Failure‑domain analysis at the BAG, data‑hall, and power‑distribution levels.
- Black‑hole mitigation techniques such as draining affected BAG planes and conditional route aggregation.
Considerations for Long‑Distance Cabling
Because BAG keeps the L2 edge close to the GPUs, shallow‑buffer NSF switches can be used locally. For longer BAG‑to‑BAG spans, deep‑buffer switches are required to provide headroom for lossless congestion‑control protocols (e.g., PFC).