【DeepSpeed】3D 并行原理解读

DeepSpeed 的 3D 并行 是一种高级分布式训练策略，通过结合 数据并行 (Data Parallelism, DP)、模型并行 (Model Parallelism, MP) 和 流水线并行 (Pipeline Parallelism, PP)，在多 GPU 和多节点环境中高效训练超大规模模型（如 GPT-3、LLaMA、Bloom 等）。3D 并行通过将模型参数、计算任务和数据分片到多个设备，显著降低了内存需求并提升了训练效率，尤其适合千亿参数级别的模型。

以下是对 DeepSpeed 3D 并行的全面讲解，涵盖其原理、配置方法、代码示例、优化效果、注意事项及实际应用场景。

1. DeepSpeed 3D 并行的原理

1.1 3D 并行的基本概念

3D 并行通过以下三种并行策略的组合，实现高效的分布式训练：

数据并行 (DP)：
- 将训练数据分片到多个 GPU，每个 GPU 处理一个数据子集。
- 传统数据并行要求每个 GPU 持有完整模型副本，而 DeepSpeed 使用 ZeRO（Zero Redundancy Optimizer） 分区参数、优化器状态和梯度，减少内存冗余。
- 通信：通过 AllReduce 或 ReduceScatter 同步梯度。
模型并行 (MP)：
- 将模型的计算图分片到多个 GPU，分为 层级模型并行（分层）和 张量并行 (Tensor Parallelism, TP)（分片单层矩阵运算）。
- 每个 GPU 仅持有部分模型参数或计算任务，减少单 GPU 内存需求。
- 通信：通过点对点通信（如 Send/Recv）传递激活值，或 AllReduce 同步张量并行结果。
流水线并行 (PP)：
- 将模型分成多个阶段（stages），每个阶段分配到不同 GPU，数据按流水线方式逐阶段处理。
- 减少了激活值内存占用，但需要在阶段间传递中间结果。
- 通信：通过点对点通信传递激活值和梯度。

3D 并行 将 DP、MP 和 PP 结合，通过合理的分片和通信优化，在大规模集群上实现高效训练。

1.2 3D 并行的工作机制

分片策略：
- DP：数据分片到多个 GPU 组（DP groups）。
- TP：单层计算分片到多个 GPU（TP groups）。
- PP：模型分阶段到多个 GPU（PP groups）。
内存优化：
- ZeRO（Stage 1/2/3）分区参数、优化器状态和梯度，减少 DP 的内存冗余。
- TP 和 PP 通过分片降低单 GPU 的模型内存需求。
通信优化：
- DP 使用 AllReduce/ReduceScatter 同步梯度。
- TP 使用 AllReduce 同步张量分片。
- PP 使用 Send/Recv 传递阶段间的激活值。
- DeepSpeed 通过 NCCL 优化通信，并支持通信与计算重叠。

1.3 3D 并行的优势

内存效率：将模型和数据分片，单 GPU 内存需求从 O(N)（N 为参数量）降至 O(N/(DP*TP*PP))。
可扩展性：支持从单节点多 GPU 到数百节点集群，适合超大模型。
灵活性：通过调整 DP、TP 和 PP 的并行度，适配不同模型规模和硬件配置。
DeepSpeed 优化：
- 集成 Megatron-LM 的张量并行实现。
- 支持 FP16/BF16 混合精度训练。
- 提供卸载（offload）到 CPU/NVMe，进一步节省 GPU 内存。

1.4 GPU 分配

总 GPU 数量 = DP_size * TP_size * PP_size。
例如：8 GPU 可配置为 DP=2, TP=2, PP=2（222=8）。
每个 GPU 属于一个 DP 组、TP 组和 PP 组，分别处理数据分片、层分片和阶段计算。

2. 配置 DeepSpeed 3D 并行

DeepSpeed 的 3D 并行通过 配置文件（ds_config.json）和 代码级别的 API 配置，通常结合 Megatron-LM 或 Hugging Face Transformers 实现。以下是具体步骤。

2.1 环境准备

安装 DeepSpeed

pip install deepspeed torch transformers

确保环境满足要求：

PyTorch（推荐 2.0+）。
NVIDIA GPU（支持 CUDA 11.0+）。
NCCL（用于分布式通信）。
可选：MPI（多节点训练）。

验证安装

ds_report

2.2 配置文件（`ds_config.json`）

3D 并行需要同时配置数据并行（ZeRO）、张量并行和流水线并行。以下是一个典型配置文件：

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 4,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  },
  "zero_optimization": {
    "stage": 3,                        
    "allgather_partitions": true,
    "reduce_scatter": true,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "nvme"                
    }
  },
  "tensor_parallel": {
    "enabled": true,                   
    "tp_size": 4                      
  },
  "pipeline_parallel": {
    "enabled": true,                   
    "pp_size": 2,                     
    "micro_batches": 8                
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": [0.9, 0.999],
      "eps": 1e-8
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true
  },
  "steps_per_print": 100,
  "wall_clock_breakdown": true
}

配置要点：

"zero_optimization.stage"：选择 ZeRO 阶段（推荐 Stage 3 用于超大模型）。
"tensor_parallel.tp_size"：设置张量并行度（例如 4 GPU 分片单层）。
"pipeline_parallel.pp_size"：设置流水线阶段数（例如 2 阶段）。
"pipeline_parallel.micro_batches"：设置微批次数量，优化流水线效率。
"offload_optimizer" 和 "offload_param"：卸载到 CPU/NVMe，节省 GPU 内存。
"activation_checkpointing"：减少激活值内存，适合深层模型。
GPU 分配：确保总 GPU 数量 = DP_size * TP_size * PP_size。

2.3 代码实现

3D 并行需要对模型和训练代码进行适配，通常结合 Megatron-LM 或 DeepSpeed 的流水线引擎。以下是一个简化的示例，展示如何实现 3D 并行：

import deepspeed
import torch
import torch.nn as nn
from deepspeed.pipe import PipelineModule
from deepspeed import init_distributed


class SimpleTransformerLayer(nn.Module):
    def __init__(self, hidden_size=512):
        super().__init__()
        self.linear = nn.Linear(hidden_size, hidden_size)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.linear(x))


class PipelineTransformer(PipelineModule):
    def __init__(self, num_layers=4, hidden_size=512):
        layers = [SimpleTransformerLayer(hidden_size) for _ in range(num_layers)]
        super().__init__(
            layers=layers,
            loss_fn=nn.MSELoss(),
            partition_method="uniform",  // Uniform partitioning for pipeline
            activation_checkpoint_interval=0
        )


init_distributed(dist_backend="nccl")


model = PipelineTransformer(num_layers=4)


ds_config = {
    "train_batch_size": 16,
    "gradient_accumulation_steps": 4,
    "fp16": {"enabled": True},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"}
    },
    "tensor_parallel": {
        "enabled": True,
        "tp_size": 2  // 2 GPUs for tensor parallelism
    },
    "pipeline_parallel": {
        "enabled": True,
        "pp_size": 2,  // 2 stages for pipeline parallelism
        "micro_batches": 4
    }
}


model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config
)


data = torch.randn(16, 512).to(model_engine.device)
labels = torch.randn(16, 512).to(model_engine.device)


for step in range(100):
    loss = model_engine.train_batch(data=data, targets=labels)
    if step % 10 == 0 and model_engine.local_rank == 0:
        print(f"Step {step}, Loss: {loss.item():.4f}")

关键点：

分布式初始化：使用 init_distributed 初始化 NCCL 通信。
流水线模型：使用 PipelineModule 定义流水线阶段，自动分配层到不同 GPU。
3D 并行配置：
- zero_optimization.stage=3：启用 ZeRO Stage 3 数据并行。
- tensor_parallel.tp_size=2：2 GPU 张量并行。
- pipeline_parallel.pp_size=2：2 阶段流水线并行。
微批次：micro_batches 控制流水线效率，需实验优化。
通信管理：DeepSpeed 自动处理 DP、TP 和 PP 的通信（AllReduce、Send/Recv）。

2.4 结合 Hugging Face Transformers

Hugging Face 的 Trainer 支持 DeepSpeed 3D 并行，但需要适配模型以支持流水线并行。以下是一个示例：

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset


model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("glue", "mrpc")


def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length", max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    deepspeed="ds_config.json",  // Use 3D parallelism config
    logging_dir="./logs",
    logging_steps=10
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)


trainer.train()

注意：Hugging Face 默认支持 ZeRO 数据并行，但张量并行和流水线并行需要模型支持（如 Megatron-LM 或自定义 PipelineModule）。

2.5 运行训练

使用 DeepSpeed 命令行启动：

deepspeed --num_gpus 8 train.py --deepspeed_config ds_config.json

多节点运行

deepspeed --num_nodes 2 --num_gpus 8 --hostfile hostfile.txt train.py

确保总 GPU 数量 = DP_size * TP_size * PP_size（例如 16 GPU 可配置为 DP=4, TP=2, PP=2）。

3. 3D 并行的优化效果

以下是 DeepSpeed 3D 并行的典型效果（基于公开案例和实验）：

3.1 内存优化

场景：训练 100 亿参数的 Transformer 模型。
硬件：16 张 NVIDIA A100 40GB GPU（2 节点）。
效果：
- PyTorch DDP：单 GPU OOM（内存不足）。
- DeepSpeed 3D 并行（DP=4, TP=2, PP=2, ZeRO Stage 3）：单 GPU 内存降至 ~15GB，成功训练。
- 内存节省：约 80%。

3.2 训练速度

场景：训练 GPT-3（1750 亿参数）。
硬件：128 张 A100 GPU（16 节点）。
效果：
- PyTorch（无并行）：无法运行。
- DeepSpeed 3D 并行（DP=16, TP=4, PP=2）：每秒 ~200 样本，训练时间从数月缩短到数周。
- 速度提升：依赖通信优化和计算重叠。

3.3 分布式扩展

场景：训练 Bloom（1760 亿参数）。
硬件：256 张 A100 GPU（32 节点）。
效果：
- 使用 3D 并行（DP=32, TP=4, PP=2）+ ZeRO Stage 3。
- 通信开销减少 ~50%，训练效率接近线性扩展。

4. 3D 并行与其他并行策略的对比

策略	内存需求	通信开销	适用场景	DeepSpeed 支持
数据并行 (DP)	高（完整模型副本）	中等（AllReduce 梯度）	小模型、数据量大	是（结合 ZeRO）
模型并行 (MP)	低（模型分片）	高（激活值/AllReduce）	大模型、单 GPU 内存不足	是（层级 + 张量并行）
流水线并行 (PP)	中等（阶段分片）	中等（阶段间通信）	深层模型、跨 GPU	是（结合模型并行）
3D 并行 (DP+MP+PP)	极低（多维度分片）	高（多种通信）	超大模型、集群训练	是（全功能支持）

选择建议：

小模型 (：数据并行 + ZeRO Stage 1/2。
中等模型 (1-10 亿参数)：数据并行 + 张量并行 + ZeRO Stage 2。
超大模型 (>10 亿参数)：3D 并行（DP + TP + PP + ZeRO Stage 3）。

5. 配置 3D 并行的实用技巧

5.1 选择并行度

DP_size：根据数据量和 GPU 数量，推荐 4-32。
TP_size：通常为 2、4、8（2 的幂），匹配单层计算需求。
PP_size：根据模型深度，推荐 2-8 阶段。
GPU 分配：确保 DP_size * TP_size * PP_size = 总 GPU 数量。

5.2 优化通信

启用高带宽网络：

export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=ib0
export NCCL_ALGO=Tree

通信压缩：

{
  "communication_data_type": "fp16",
  "compress_communication": true
}

NVLink：设置 NCCL_P2P_DISABLE=0。

5.3 混合精度训练

5.4 卸载到 CPU/NVMe

5.5 微批次优化

设置合适的 micro_batches（通常 4-16），平衡流水线效率和内存。
实验调整以最大化 GPU 利用率。

5.6 调试内存问题

监控内存：使用 deepspeed.utils.memory_status() 或 torch.cuda.memory_allocated()。
OOM 解决：
- 增加 TP_size 或 PP_size。
- 启用 ZeRO Stage 3 和卸载。
- 降低 train_micro_batch_size_per_gpu。

5.7 通信监控

6. 注意事项与局限性

6.1 通信开销

3D 并行引入多种通信（AllReduce、Send/Recv），高带宽网络（如 InfiniBand）至关重要。
低带宽环境可能导致性能瓶颈。

6.2 模型适配

3D 并行需要模型支持分片和流水线（Transformer 架构兼容性好）。
自定义模型需适配 PipelineModule 或 Megatron-LM 逻辑。

6.3 硬件要求

需要大量 GPU（通常 16+）和高速网络（InfiniBand/NVLink）。
卸载到 NVMe 需要高性能存储。

6.4 调试复杂性

通信错误（如 AllReduce 失败）难以定位。
启用 NCCL_DEBUG=TRACE 和 DEEPSPEED_LOG_LEVEL=DEBUG 排查。

6.5 配置复杂性

需要协调 DP、TP 和 PP 的并行度，确保 GPU 分配正确。
建议参考 DeepSpeed 官方示例逐步调整。

7. 常见问题与解答

如何选择 DP、TP 和 PP 的并行度？
- 根据模型规模、GPU 数量和网络带宽选择。
- 示例：100 亿参数模型，16 GPU，可试 DP=4, TP=2, PP=2。
通信开销高怎么办？
- 启用 InfiniBand（NCCL_IB_DISABLE=0）和 NVLink（NCCL_P2P_DISABLE=0）。
- 设置 NCCL_ALGO=Tree 和通信压缩。
为什么 OOM？
- 增加 TP_size 或 PP_size。
- 启用 ZeRO Stage 3 和卸载（"offload_param": {"device": "nvme"}）。
- 降低 train_micro_batch_size_per_gpu。
如何调试流水线并行？
- 启用 wall_clock_breakdown 和 NCCL_DEBUG=INFO。
- 检查 micro_batches 和 pp_size 是否匹配模型深度。
3D 并行适合哪些模型？
- 超大模型（>10 亿参数，如 GPT、LLaMA）。
- 需要大规模集群（16+ GPU）。

8. 进阶用法

8.1 结合 Megatron-LM

使用 Megatron-LM 的模型支持 3D 并行：

from megatron.model import GPTModel
model = GPTModel(num_layers=24, hidden_size=1024, num_attention_heads=16)
model_engine, _, _, _ = deepspeed.initialize(model=model, config=ds_config)

8.2 动态并行度

动态调整并行度：

import json
ds_config = json.load(open("ds_config.json"))
for tp_size, pp_size in [(2, 2), (4, 2)]:
    ds_config["tensor_parallel"]["tp_size"] = tp_size
    ds_config["pipeline_parallel"]["pp_size"] = pp_size
    model_engine, _, _, _ = deepspeed.initialize(model=model, config=ds_config)

8.3 性能分析

使用 PyTorch Profiler：

from torch.profiler import profile
with profile(activities=[torch.profiler.ProfilerActivity.CUDA]):
    model_engine.train_batch(data=data, targets=labels)

8.4 检查点管理

保存和加载 3D 并行检查点：

model_engine.save_checkpoint("checkpoint_dir")
model_engine.load_checkpoint("checkpoint_dir")

原创文章。转载请注明：作者:JiangYuan 网址: https://www.icnma.com