DDP using SageMaker and PyTorch Lightning

(Note: SageMaker’s own DDP - aka SDDP has been benchmarked as better than PyTorch’s native DDP. However, at the time of this writing, it only supports three instances; ml.p3.16xlarge, ml.p3dn.24xlarge, and ml.p4d.24xlarge)

Notebook

We only need to pass in a distribution argument to let SageMaker know that we want to run a distributed training.

distribution = { 
    "pytorchddp": {
        "enabled": True,
        "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
    }
}

The second argument simply tells to print the detailed debug output for NCCL (which is the backend communication mechanism)

We will run training on two instances

instance_type = "ml.g4dn.12xlarge"
instance_count = 2

Also, for distributed training, we disable profiler.

disable_profiler=True,
debugger_hook_config=False,

Code

Import DDPStrategy to enable DDP.

from pytorch_lightning.strategies import DDPStrategy

For distributed training, we need to disable validation and testing. It should only run on the Master Node. Make sure you only pass the train dataloader and not a complete datamodule.

For the complete code, visit the repo.