Distributed Training Strategies for Large Language Models

Training large language models at the scale of GPT-4, LLaMA 3, or Gemini requires distributing computation across hundreds or thousands of GPUs simultaneously. A single H100 GPU has 80 GB of HBM3 memory — nowhere near enough to hold the parameters, optimizer states, and activations for a 70-billion-parameter model, let alone a trillion-parameter one. Distributed training is not an optimization; it is a fundamental requirement.

The field has converged on several complementary parallelism strategies, each addressing a different bottleneck. Understanding when and how to combine them is one of the core engineering challenges of operating large-scale AI infrastructure. This article examines the four primary strategies — data parallelism, tensor parallelism, pipeline parallelism, and sequence parallelism — along with the practical tradeoffs that determine the optimal configuration for a given workload.

Data Parallelism: The Foundation

Data parallelism (DP) is the simplest and most widely used form of distributed training. Each worker holds a complete copy of the model and processes a different mini-batch of data. After the forward and backward passes, gradients are synchronized across all workers using an all-reduce collective operation, and each worker applies the same gradient update to its local model copy.

The appeal of data parallelism is its simplicity — the model itself does not change; only the batch is split. NCCL's ring-allreduce algorithm achieves near-linear scaling efficiency for gradient synchronization, particularly when the communication can be overlapped with the backward pass computation. At 256 GPUs with a sufficiently large model and batch size, data parallelism alone can achieve 85-90% scaling efficiency.

The critical limitation is memory: every worker must hold the full model in GPU memory. For models above roughly 10-20 billion parameters, this becomes prohibitive even with the latest 80 GB GPUs. Microsoft's ZeRO (Zero Redundancy Optimizer) optimizer, part of the DeepSpeed library, addresses this by partitioning optimizer states, gradients, and parameters across data-parallel workers. ZeRO Stage 3 reduces per-GPU memory requirements by a factor proportional to the number of data-parallel replicas, effectively enabling data parallelism for models far larger than any single GPU's memory capacity.

Tensor Parallelism: Splitting Model Layers

Tensor parallelism (TP) splits individual tensor operations — specifically the large matrix multiplications in transformer attention and feed-forward layers — across multiple GPUs. Rather than replicating the entire model, each GPU in the tensor-parallel group holds a shard of each weight matrix and computes a partial result, which is then all-reduced or all-gathered to produce the full layer output.

In the Megatron-LM implementation developed by NVIDIA, self-attention layers are split along the attention head dimension, and MLP layers are split column-wise in the first linear layer and row-wise in the second. This specific pattern minimizes the number of all-reduce operations needed per forward pass to one per transformer layer block — a critical optimization given the high bandwidth requirements of frequent inter-GPU communication.

Tensor parallelism is latency-sensitive: every layer requires synchronization between the TP group members. This makes tensor parallelism suitable primarily within a single NVSwitch-connected server node, where NVLink bandwidth (900 GB/s on H100 SXM) keeps synchronization costs manageable. Extending TP across nodes over InfiniBand introduces unacceptable communication overhead for all but the very largest models and most bandwidth-rich networks. Typical TP degrees are 4 or 8, matching the number of GPUs per node.

Pipeline Parallelism: Splitting Model Depth

Pipeline parallelism (PP) assigns different layers of the model to different pipeline stages, with each stage running on one or more GPUs. A micro-batch of data flows through stages sequentially: stage 1 computes the first N layers and passes activations to stage 2, which computes the next N layers, and so on. The backward pass runs in reverse.

Naive pipeline parallelism suffers from severe pipeline bubbles — periods when early stages sit idle waiting for later stages to finish before the backward pass can begin. The GPipe and PipeDream schedules address this with micro-batching: a large mini-batch is divided into micro-batches that fill the pipeline, keeping all stages active. Interleaved scheduling (Megatron-LM v2) further reduces bubble fraction by assigning multiple non-contiguous layer chunks to each stage, trading memory for utilization.

Pipeline parallelism is well-suited for inter-node communication because the activations passed between pipeline stages are much smaller than the gradient tensors synchronized in data parallelism. A typical inter-stage communication volume for a 70B model might be 4 MB per micro-batch per layer boundary — manageable even over 200 Gb/s InfiniBand. Pipeline parallelism degrees of 8-64 are common for large-scale training runs.

3D Parallelism: Combining All Three Strategies

For the largest training runs, optimal hardware utilization requires combining data parallelism, tensor parallelism, and pipeline parallelism simultaneously — the approach known as 3D parallelism. Each GPU in the cluster belongs to one replica in each of the three parallelism dimensions. The total number of GPUs equals DP_degree × TP_degree × PP_degree.

The assignment of parallelism dimensions to network topology is critical for performance. Tensor parallelism goes within a single NVSwitch domain (intra-node). Pipeline parallelism goes between nodes within the same InfiniBand rail group, where the point-to-point bandwidth is highest. Data parallelism goes across rail groups, where the all-reduce traffic is more tolerant of lower bisection bandwidth.

As a concrete example: a 1024 H100 training run for a 70B parameter model might use TP=8 (within each node), PP=8 (across 8 nodes in each pipeline), and DP=16 (16 pipeline replicas). This fills a cluster of 8 × 8 × 16 = 1024 GPUs with each parallelism dimension matched to the appropriate network tier. Achieving optimal 3D parallelism configurations requires careful profiling and often empirical search over the configuration space.

Sequence Parallelism and Context Length Scaling

As context lengths grow from 4K to 128K tokens, activations for a single sequence no longer fit in a single GPU's memory — even within a tensor-parallel group. Sequence parallelism addresses this by distributing the sequence dimension across GPUs, partitioning the key-value cache and attention computation along the token axis.

Ring attention (DeJaVu, Ring Flash Attention) implements causal attention in a ring-allreduce pattern: each GPU processes a segment of the sequence and incrementally computes attention scores with remote key-value blocks as they are passed around the ring. This enables training with context lengths of 1 million tokens or more at the cost of increased inter-GPU communication proportional to the sequence length. For retrieval-augmented and long-context applications, sequence parallelism is becoming a fourth required dimension in distributed training configurations.

Key Takeaways

Data parallelism is the baseline strategy; ZeRO Stage 3 enables it for models much larger than a single GPU's memory capacity.
Tensor parallelism splits individual matrix multiplications within a node, using NVLink for low-latency synchronization — typically limited to TP degree of 4 or 8.
Pipeline parallelism distributes model depth across nodes, with careful micro-batch scheduling to minimize pipeline bubbles.
3D parallelism combines all three dimensions, with each assigned to the appropriate network tier for optimal bandwidth utilization.
Sequence parallelism is becoming essential for long-context training, distributing the sequence dimension across GPUs in a ring pattern.
Optimal parallelism configurations require empirical profiling — no single formula applies across all models and hardware configurations.

Conclusion

Distributed training for large language models is a multi-dimensional engineering problem. The right combination of parallelism strategies depends on model architecture, hardware topology, network bandwidth at each tier, and the specific bottlenecks of the target workload. As models continue to scale and context lengths grow, the sophistication of distributed training frameworks will continue to advance. Understanding the fundamental tradeoffs of each parallelism dimension equips ML infrastructure teams to reason systematically about training efficiency and make principled decisions as they scale from hundreds to thousands of GPUs.