RCCLX Open‑Source Collective Library for AMD GPUs – Context, Implementation, and Best Practices

24 February 2026 by

Suraj Barman

Context & History

RCCLX started as an internal project at Meta to address the growing need for high‑performance collective communication on AMD hardware. Building on the existing RCCL codebase, the team added support for AMD's MI300 and MI350 GPUs and introduced new features such as Direct Data Access (DDA) and low‑precision collectives. The library was released alongside the Torchcomms API to give developers a single interface across NVIDIA and AMD platforms, allowing rapid experimentation without changing application code. For a broader view of AMD's GPU strategy see the AMD Instinct Wikipedia page. The concept of DDA builds on ideas from Remote Direct Memory Access, enabling the GPU to read and write data directly in peer buffers.

Implementation & Best Practices

Before diving into specific features, follow this roadmap: install the compatible ROCm stack, compile RCCLX with the matching version, integrate the Torchcomms backend, enable optional features via environment variables, and validate performance with the provided benchmark suite. This sequence ensures a stable environment and lets you isolate the impact of each optimization.

Direct Data Access (DDA)

DDA reduces the number of memory copies required during AllReduce by allowing the collective to operate directly on the source buffers. When enabled, the library registers GPU memory with the transport layer, eliminating intermediate staging. To activate DDA, set RCCL_DDA_ENABLE=1 before launching the application. In practice, DDA delivers 10‑50% lower latency on decode‑phase workloads and 10‑30% speedup on pre‑fill phases for MI300X GPUs. Developers should verify that their tensors are allocated with torch.cuda.memory_format=contiguous to maximize the benefit.

Low‑Precision Collectives

Low‑precision (LP) collectives compress data to FP8 before transmission, achieving up to a 4:1 reduction in payload size. The library automatically de‑compresses on the receiving side while keeping the compute path in FP32 for stability. Enable LP collectives with RCCL_LOW_PRECISION_ENABLE=1. They work best for messages larger than 16 MB, where the communication cost dominates. When using BF16 tensors, the same flag provides noticeable throughput gains without sacrificing the accuracy required for most transformer inference workloads.

Integration with Torchcomms

Torchcomms abstracts the underlying communication library, letting you switch between NCCL, RCCL, and RCCLX with a single import. A minimal setup looks like this:

import torchcomms
comm = torchcomms.new_comm("rcclx", torch.device("hip"), name="my_comm")
print(f"Rank {comm.get_rank()} of {comm.get_size()}")

tensor = torch.full((10, 20), comm.rank, dtype=torch.float, device="hip")
comm.allreduce(tensor, torchcomms.ReduceOp.SUM)

After confirming the basic all‑reduce works, run the Pixel Watch 3 deal benchmark script to collect baseline numbers, then repeat with DDA and LP flags to measure improvement. For guidance on profiling, refer to the Google I/O 2026 coverage, which outlines tools compatible with ROCm.

Performance Evaluation

Meta’s internal tests used param‑bench rccl‑tests with 10 warm‑up and 100 measurement iterations. On MI300 with ROCm 6.4, DDA cut decode latency by up to 50%, while LP collectives increased throughput by roughly 35% for 32 MB messages. Single‑node scaling showed near‑linear gains up to eight GPUs, after which the InfiniBand fabric became the limiting factor.

Key takeaways:

RCCLX adds AMD‑specific optimizations to the well‑known RCCL codebase.
DDA and low‑precision collectives are toggled via environment variables and provide measurable speedups.
Using Torchcomms ensures a consistent API across hardware vendors.
Benchmark with realistic workloads to confirm that accuracy remains within acceptable bounds.