DDP using SageMaker and PyTorch Lightning
(Note: SageMaker’s own DDP - aka SDDP has been benchmarked as better than PyTorch’s native DDP. However, at the time of this writing, it only supports three instances; ml.p3.16xlarge, ml.p3dn.24xlarge, and ml.p4d.24xlarge)
Notebook
We only need to pass in a distribution
argument to let SageMaker know that we want to run a distributed training.
distribution = {
"pytorchddp": {
"enabled": True,
"custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
}
}
The second argument simply tells to print the detailed debug output for NCCL (which is the backend communication mechanism)
We will run training on two instances
instance_type = "ml.g4dn.12xlarge"
instance_count = 2
Also, for distributed training, we disable profiler.
disable_profiler=True,
debugger_hook_config=False,
Code
Import DDPStrategy
to enable DDP.
from pytorch_lightning.strategies import DDPStrategy
For distributed training, we need to disable validation and testing. It should only run on the Master Node. Make sure you only pass the train dataloader and not a complete datamodule.
For the complete code, visit the repo.