Building High-Performance GPU Clusters for Enterprise ML Training

Building a GPU cluster capable of training modern large language models at enterprise scale is a complex engineering challenge that spans hardware selection, network design, storage architecture, and software orchestration. This guide walks through the critical decisions and tradeoffs involved in designing high-performance GPU infrastructure for production ML workloads.

Choosing the Right GPU Hardware

The GPU selection decision has profound implications for both training performance and total cost of ownership. The current generation of NVIDIA H100 GPUs represents a significant leap over the previous A100 generation, particularly for transformer-based models, due to their Transformer Engine with FP8 precision support and improved NVLink bandwidth.

For most enterprise training workloads in 2025, the H100 SXM5 form factor provides the best performance-per-dollar at scale. The SXM5 variant offers 900 GB/s NVLink 4.0 bandwidth — nearly double the A100 SXM4's 600 GB/s — and 80 GB of HBM3 memory with 3.35 TB/s bandwidth. The PCIe variant trades about 15% of peak compute for lower per-server cost and greater flexibility in server selection.

However, the A100 80GB remains highly cost-effective for workloads that do not require the H100's FP8 capabilities or maximum NVLink bandwidth. Many organizations run mixed fleets, using H100 clusters for frontier model training and A100 clusters for fine-tuning and smaller-scale experiments.

Intra-Node Interconnect: NVLink and NVSwitch

Within a single server node, GPU-to-GPU communication happens over NVLink. A typical H100 DGX system contains eight H100 GPUs connected via NVSwitch, which provides an all-to-all bandwidth of 900 GB/s per GPU. This bandwidth is sufficient for most tensor parallelism configurations within a single node — enabling sharding of a model's weight matrices across 8 GPUs with minimal communication overhead.

The key insight here is that all-reduce operations within a single NVSwitch domain are dramatically faster than any inter-node communication. This means that the most performance-sensitive parallelism dimensions — those with the highest communication-to-computation ratios — should be assigned within the NVSwitch domain first. Tensor parallelism belongs within a single node; pipeline stages and data-parallel replicas can span nodes.

Inter-Node Networking: InfiniBand vs RoCE vs Ethernet

When training jobs span multiple nodes, inter-node communication becomes the primary performance bottleneck for gradient synchronization and pipeline communications. There are three main networking options, each with distinct tradeoffs:

InfiniBand HDR/NDR provides the highest performance with native RDMA support, sub-microsecond latency, and reliable lossless transport. HDR InfiniBand delivers 200 Gb/s per port, while the newer NDR standard reaches 400 Gb/s. InfiniBand fat-tree topologies with non-blocking switches are the gold standard for distributed training at scale. The primary drawback is cost — InfiniBand infrastructure is significantly more expensive than Ethernet alternatives.

RoCE (RDMA over Converged Ethernet) v2 provides RDMA capabilities over standard Ethernet hardware, significantly reducing infrastructure cost while delivering near-InfiniBand latency when properly configured. RoCE requires careful network engineering — Priority Flow Control (PFC) must be configured to prevent packet drops that would trigger expensive RDMA retransmissions. ECN-based congestion control is also essential. When properly deployed, RoCE v2 at 400 Gb/s can approach InfiniBand performance at a lower infrastructure cost.

High-speed Ethernet (400G/800G) without RDMA can work for specific workload types, particularly when using NCCL with TCP transport. However, the additional CPU overhead and higher latency compared to RDMA options makes it unsuitable for latency-sensitive, bandwidth-intensive distributed training at scale.

Network Topology Considerations

The network topology has a direct impact on the all-reduce communication patterns used in data-parallel training. Fat-tree topologies with fully non-blocking switches provide bisection bandwidth equal to the sum of all edge port bandwidths — ideal for all-to-all collective operations. However, they are expensive to scale beyond a few hundred nodes.

For clusters in the 256-2048 GPU range, a two-tier fat-tree with 400G InfiniBand provides excellent performance. Rail-optimized topologies, where specific network rails connect GPUs at the same physical position across all nodes, can further improve NCCL all-reduce performance by 15-30% for certain collective operation patterns.

Storage Architecture for ML Workloads

Storage is frequently overlooked in GPU cluster design but can become a significant bottleneck, particularly for training runs with large datasets. The key requirements are high throughput for sequential reads (data loading), moderate IOPS for checkpoint writes, and low latency for model checkpoint recovery.

Distributed parallel file systems like GPFS (IBM Spectrum Scale) and Lustre are the traditional choice for HPC and ML cluster storage. They can aggregate bandwidth across many storage nodes, providing hundreds of GB/s of sustained read throughput for data-hungry training jobs. Object storage (S3-compatible) is increasingly used for dataset storage and checkpoint archiving, with local NVMe SSDs on each compute node serving as fast scratch space for active training data.

Job Scheduling and Cluster Management

Kubernetes with GPU support (via NVIDIA GPU Operator) has become the dominant platform for ML cluster orchestration in enterprise environments. It provides a familiar API, strong multi-tenancy capabilities, and integration with cloud-native observability tools. However, vanilla Kubernetes lacks features critical for distributed ML — specifically, gang scheduling (ensuring all pods in a distributed job are scheduled simultaneously) and topology-aware placement (ensuring that GPUs within the same pipeline stage are placed on nodes with low-latency interconnects).

The Volcano scheduler and MCAD (Multi-Cluster Application Dispatcher) both add gang scheduling capabilities to Kubernetes. NVIDIA's Network Operator provides topology awareness based on InfiniBand fabric topology information. Slurm remains widely used in research and HPC environments for its mature job scheduling capabilities and tight integration with MPI-based workloads.

Power and Cooling Requirements

A cluster of 1024 H100 GPUs draws approximately 2.8 megawatts of power under full load. This requires careful facility planning — dedicated power distribution infrastructure, high-capacity cooling systems, and backup power provisions. Modern H100 systems often deploy direct liquid cooling (DLC) to manage thermal density that would be prohibitive with air cooling alone.

Operational Best Practices

Beyond the initial hardware and software decisions, operational practices determine the long-term reliability of a GPU cluster. Automated health checks that validate GPU memory, NVLink bandwidth, and InfiniBand connectivity before assigning nodes to jobs are essential. Proactive node quarantine procedures prevent unhealthy nodes from disrupting active training runs. Comprehensive metrics collection covering GPU utilization, memory bandwidth, and network throughput enables capacity planning and early detection of hardware degradation.