Spaces:
Running
Running
File size: 1,490 Bytes
b10121d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# LLM Text Generation (Chat)
This benchmark suite benchmarks vLLM and TGI with the chat completion task with various models.
## Setup
### Docker images
You can pull vLLM and TGI Docker images with:
```sh
docker pull mlenergy/vllm:v0.4.2-openai
docker pull mlenergy/tgi:v2.0.2
```
### Installing Benchmark Script Dependencies
```sh
pip install -r requirements.txt
```
### Starting the NVML container
Changing the power limit requires the `SYS_ADMIN` Linux security capability, which we delegate to a daemon Docker container running a base CUDA image.
```sh
bash ../../common/start_nvml_container.sh
```
With the `nvml` container running, you can change power limit with something like `docker exec nvml nvidia-smi -i 0 -pl 200`.
### HuggingFace cache directory
The scripts assume the HuggingFace cache directory will be under `/data/leaderboard/hfcache` on the node that runs this benchmark.
## Benchmarking
### Obtaining one datapoint
Export your HuggingFace hub token as environment variable `$HF_TOKEN`.
The script `scripts/benchmark_one_datapoint.py` assumes that it was run from the directory where `scripts` is, like this:
```sh
python scripts/benchmark_one_datapoint.py --help
```
### Obtaining all datapoints for a single model
Run `scripts/benchmark_one_model.py`.
### Running the entire suite with Pegasus
You can use [`pegasus`](https://github.com/jaywonchung/pegasus) to run the entire benchmark suite.
Queue and host files are in [`./pegasus`](./pegasus).
|