RoCE Networks: The Backbone of Scalable Distributed AI Training

The development of AI networks is crucial for connecting tens of thousands of GPUs, forming the essential infrastructure required for training extensive models with hundreds of billions of parameters, such as LLAMA 3.1 405B. Recently, at ACM SIGCOMM 2024 in Sydney, Australia, Meta presented insights into the network architecture developed over recent years to support large-scale distributed AI training workloads. Their paper titled “RDMA over Ethernet for Distributed AI Training at Meta Scale” elaborates on the design, implementation, and operation of one of the largest AI networks globally.

As AI continues to proliferate, it brings forth new communication demands. In particular, distributed training places significant pressure on data center networking infrastructure. For example, a typical generative AI (GenAI) task may require tight coordination among tens of thousands of GPUs over several weeks. Building a reliable and high-performance network infrastructure capable of meeting this growing demand necessitates a fundamental reassessment of data center network design.

When Meta launched distributed GPU-based training, they opted to create specialized data center networks designed specifically for these GPU clusters. The choice fell on RDMA Over Converged Ethernet version 2 (RoCEv2) as the primary inter-node communication transport for most of their AI capacity.

Over time, Meta has successfully expanded its RoCE networks from initial prototypes to numerous clusters, each housing thousands of GPUs. These RoCE clusters facilitate a wide array of production distributed GPU training tasks, including ranking, content recommendation, content understanding, natural language processing, and GenAI model training.

Network Topology

Meta constructed a dedicated backend network tailored specifically for distributed training. This strategic decision allowed them to evolve and operate independently from the broader data center network. To support large language models (LLMs), the backend network was expanded to a data center scale, incorporating topology-awareness into the training job scheduler.

Network Separation

The training cluster utilizes two distinct networks: the frontend (FE) network handles tasks such as data ingestion, checkpointing, and logging, while the backend (BE) network is dedicated to training processes. A training rack connects to both FE and BE networks within the data center framework.

The FE comprises a hierarchy of network layers—including rack switches (RSWs), fabric switches (FSWs), and higher layers—housing a storage warehouse that supplies GPUs with essential input data for training tasks. Adequate ingress bandwidth is ensured at the rack switch level to prevent hindrance to the training workload.

The BE features a specialized fabric that connects all RDMA NICs in a non-blocking architecture, delivering high bandwidth, low latency, and lossless transport between any two GPUs within the cluster regardless of their physical locations. This backend fabric employs the RoCEv2 protocol, encapsulating RDMA services within UDP packets for transportation across the network.

Evolution of AI Zones

The BE networks have transformed significantly over time. Initially, GPU clusters operated on a simple star topology with several AI racks linked to a central Ethernet switch utilizing the non-routable RoCEv1 protocol. This configuration presented limitations in GPU scalability and switch redundancy. Consequently, Meta transitioned swiftly to a fabric-based architecture to enhance scalability and availability.

An innovative two-stage Clos topology was designed for AI racks, referred to as an AI Zone. The rack training switch (RTSW) acts as the leaf switch providing scale-up connectivity for GPUs within each rack via copper-based DAC cables. The spine tier consists of modular cluster training switches (CTSW), which facilitate scale-out connectivity among all racks within the cluster. RTSWs connect to CTSWs using single-mode fiber and 400G pluggable transceivers.AI Zones are engineered to support numerous interconnected GPUs in a non-blocking manner; however, recent advancements in AI technologies like LLMs necessitate GPU scales larger than what a single AI Zone can provide. To address this requirement, an aggregator training switch (ATSW) layer was integrated to link CTSWs across data center buildings, thereby extending the RoCE domain beyond individual AI Zones.

It is important to note that cross-AI Zone connectivity is intentionally oversubscribed by design while balancing network traffic through Equal-Cost Multi-Path (ECMP). To alleviate performance bottlenecks associated with cross-AI Zone traffic, enhancements were made to the training job scheduler to identify a “minimum cut” when distributing training nodes across different AI Zones, effectively minimizing cross-AI Zone traffic and reducing overall completion time.

Routing Challenges

The scaling of compute power alongside evolving network topology raised questions regarding efficient balancing and routing of substantial training traffic. Notably, AI workloads exhibit several unique characteristics:

Low Entropy: Compared to traditional data center workloads, AI workloads typically involve fewer flows with repetitive and predictable patterns.
Burstiness: Flows often display an “on-and-off” nature at millisecond granularity.
Elephant Flows: During bursts, flow intensity can reach up to line rate capabilities of NICs.

ECMP and Path Pinning

Initially considering ECMP—commonly used for random flow distribution based on five-tuple hashing—Meta found it inadequate due to low flow entropy in their specific context. Instead, they implemented a path-pinning scheme during early deployment stages that routed packets along designated paths based on destination slices from RTSW downlinks. This approach functioned well under conditions where each rack was fully assigned to one job without network failures; however, partial job allocations led to uneven traffic distribution and congestion issues.

To mitigate these challenges temporarily, Meta doubled RTSW uplink bandwidth but recognized this as an expensive short-term fix requiring increased capacity.

Queue Pair Scaling

Subsequently revisiting ECMP aimed at increasing flow counts through queue pair (QP) scaling software features in their collective library became necessary. Switches were configured for Enhanced ECMP (E-ECMP), hashing based on destination QP fields in addition to standard parameters. This adjustment resulted in performance improvements up to 40% for collective operations like AllReduce.

Two QP scaling strategies were evaluated: one involved distributing messages across multiple QPs while another employed round-robin posting per message across different queues. The latter proved more effective given production message sizes observed with NCCL.

Congestion Control Solutions

Transitioning towards 400G deployments prompted tuning efforts for Data Center Quantized Congestion Notification (DCQCN) settings; however, performance degradation was noted due to firmware changes introducing bugs affecting visibility and CNP counting accuracy. As such, Meta proceeded without DCQCN while relying solely on Priority Flow Control (PFC) over past year-long experiences with stable performance observed during collective operations.

Receiver-driven traffic admission mechanisms were co-designed alongside collective libraries and RoCE transport systems aimed at optimizing performance by managing in-flight traffic during congestion scenarios effectively.

Despite challenges in configuring optimal parameters due to resource contention among GPU threads during concurrent operations or balancing channel buffer sizes against potential under-utilization risks associated with RoCE’s flow control methods—Meta has made strides through experimental parameter adjustments and high-priority queuing implementations at switches.

Future Directions

As demands for computational density and scalability continue rising within distributed AI workloads evolve rapidly alongside GenAI trends—Meta’s design philosophy emphasizes deep comprehension of workload characteristics translated into coherent network component designs. This approach has proven instrumental in advancing distributed AI training infrastructure while ensuring reliable performance amid increasing complexity in requirements.

Through these ongoing innovations and optimizations within their RoCE networks framework—Meta aims not only to accommodate current demands but also prepare proactively for future developments within the realm of artificial intelligence technology advancements.

Read more such articles from our Newsletter here.