Spaces:
Running
Running
The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume. | |
## How is energy different? | |
The energy consumption of running inference on a model will depends on factors such as architecture, size, and GPU model. | |
However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**. | |
That is, when asked the same thing, different models answer in different lengths. | |
## Metrics | |
- `gpu`: NVIDIA GPU model name | |
- `task`: Name of the task. See *Tasks* below for details. | |
- `throughput` (token/s): The average number of tokens generated per second. | |
- `response_length` (token): The average number of tokens in the model's response. | |
- `latency` (s): The average time it took for the model to generate a response. | |
- `energy` (J): The average energy consumed by the model to generate a response. | |
- `parameters`: The number of parameters the model has, in units of billion. | |
## Tasks | |
For each task, every model uses the same system prompt. We still account for differences in roles, e.g. `USER`, `HUMAN`, `ASSISTANT`, `GPT`. | |
| Name | System prompt | | |
|--|--| | |
| chat | A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. | | |
| chat-concise | A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant's answers are very concise. | | |
| instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. | | |
| instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. | | |
## Setup | |
Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py). | |
### Software | |
- PyTorch 2.0.1 | |
- [FastChat](https://github.com/lm-sys/fastchat) -- For various model support | |
- [Zeus](https://ml.energy/zeus) -- For GPU energy measurement | |
### Hardware | |
- NVIDIA A40 GPU | |
### Parameters | |
- Model | |
- Batch size 1 | |
- FP16 | |
- Sampling (decoding) | |
- Greedy sampling from multinomial distribution | |
- Temperature 0.7 | |
- Repetition penalty 1.0 | |
## Data | |
We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered). | |
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset. | |
We used identical system prompts for all models (while respecting their own *role* tokens): | |
``` | |
A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. | |
``` | |
## Upcoming | |
- Compare energy numbers against more optimized inference runtimes, like TensorRT. | |
- More GPU types | |
- More models | |
- Other model/sampling parameters | |
# License | |
This leaderboard is a research preview intended for non-commercial use only. | |
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md). | |
Please direct inquiries and reports of potential license/copyright violation to Jae-Won Chung. | |
# Acknowledgements | |
We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`). | |