An additional level of debug is to add NCCL_DEBUG=INFO environment variable as follows: | |
NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py | |
This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported. |