Here is the full benchmark code and outputs: | |
```bash | |
DDP w/ NVLink | |
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ | |
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ | |
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ | |
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 | |
{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} | |
DDP w/o NVLink | |
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ | |
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ | |
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train | |
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 | |
{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} | |
Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m) | |
Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0 |