However, it may or may not match the actual GPU on the target machine which is why it is better to explicitly specify the correct architecture. For training on multiple machines with the same setup, you'll need to make a binary wheel: git clone https://github.com/microsoft/DeepSpeed/ cd DeepSpeed rm -rf build TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \ python setup.py build_ext -j8 bdist_wheel This command generates a binary wheel that'll look something like dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl.