Skip to content

13 - Distributed Training

This module covers parallel strategies and optimization techniques for large-scale model training.

Module Structure

13-distributed-training/
├── 01-data-parallel/           # Data Parallel
├── 02-model-parallel/          # Model Parallel
├── 03-mixed-precision/         # Mixed Precision
└── 04-large-scale-training/    # Large-Scale Training

Core Content

01 - Data Parallel

TechnologyDescriptionUse Case
DDPDistributed Data ParallelMulti-GPU
FSDPFully Sharded Data ParallelLarge models
ZeROZero Redundancy OptimizerMemory optimization

02 - Model Parallel

TechnologyDescription
Tensor ParallelIntra-layer partitioning
Pipeline ParallelInter-layer partitioning
Sequence ParallelSequence dimension partitioning

03 - Mixed Precision

TechnologyDescription
AMPAutomatic Mixed Precision
BF16Brain Float 16
Gradient ScalingPrevent underflow

04 - Large-Scale Training

FrameworkFeatures
DeepSpeedMicrosoft, ZeRO optimization
Megatron-LMNVIDIA, 3D parallelism
FSDPPyTorch native

Learning Path

DDP → FSDP/ZeRO → Tensor Parallel → Pipeline Parallel → Mixed Precision → DeepSpeed

Released under the MIT License.