Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
An additional level of debug is to add NCCL_DEBUG=INFO environment variable as follows:
NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported.