Spaces:
Running
Running benchmarks on multiple GPU nodes with Pegasus
Pegasus is an SSH-based multi-node command runner. Different models have different verbosity, and benchmarking takes vastly different amounts of time. Therefore, we want an automated piece of software that drains a queue of benchmarking jobs (one job per model) on a set of GPUs.
Setup
Install Pegasus
Pegasus needs to keep SSH connections with all the nodes in order to queue up and run jobs over SSH. So you should install and run Pegasus on a computer that you can keep awake.
If you already have Rust set up:
$ cargo install pegasus-ssh
Otherwise, you can set up Rust here, or just download Pegasus release binaries here.
Necessary setup for each node
Every node must have two things:
- This repository cloned under
~/workspace/leaderboard
.
- If you want a different path, search and replace in
spawn-containers.yaml
.
- Model weights under
/data/leaderboard/weights
.
- If you want a different path, search and replace in
setupspawn-containers.yaml
andbenchmark.yaml
.
Specify node names for Pegasus
Modify hosts.yaml
with nodes. See the file for an example.
hostname
: List the hostnames you would use in order tossh
into the node, e.g.jaywonchung@gpunode01
.gpu
: We want to create one Docker container for each GPU. List the indices of the GPUs you would like to use for the hosts.
Set up Docker containers on your nodes with Pegasus
This spawns one container per GPU (named leaderboard%d
), for every node.
$ cd pegasus
$ cp spawn-containers.yaml queue.yaml
$ pegasus b
b
stands for broadcast. Every command is run once on all (hostname
, gpu
) combinations.
Benchmark
Now use Pegasus to run benchmarks for all the models across all nodes.
$ cd pegasus
$ cp benchmark.yaml queue.yaml
$ pegasus q
q
stands for queue. Each command is run once on the next available (hostname
, gpu
) combination.