|
In the second benchmark we use NCCL_P2P_DISABLE=1 to tell the GPUs not to use NVLink. |
|
Here is the full benchmark code and outputs: |
|
```bash |
|
DDP w/ NVLink |
|
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ |
|
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ |
|
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ |
|
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 |
|
{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} |
|
DDP w/o NVLink |
|
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ |
|
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ |
|
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train |
|
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 |
|
{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} |
|
|
|
Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m) |
|
Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0. |