Spaces:
Running
Running
File size: 4,737 Bytes
360f81c 69897d5 aa739dd 360f81c 55aeee4 360f81c 55aeee4 360f81c a2463c2 55aeee4 360f81c 55aeee4 b428657 360f81c 55aeee4 360f81c 55aeee4 360f81c 55aeee4 360f81c 55aeee4 360f81c 55aeee4 82141cb 6a28b83 82141cb 55aeee4 6bdaf0a 55aeee4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume.
## How is energy different?
The energy consumption of running inference depends on factors such as model architecture, size, and GPU model.
However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**.
That is, when asked the same thing, different models answer in different lengths.
## Columns
- `gpu`: NVIDIA GPU model name. Note that NLP evaluation was only run once on our A40 GPUs, so this column only changes system-level measurements like latency and energy.
- `task`: Name of the task. See *Tasks* below for details.
- `energy_eff`: Our definition of energy efficiency: Average NLP evaluation metric attained per Joule of energy (`nlp_average / energy`).
- `energy` (J): The average energy consumed by the model to generate a response.
- `nlp_average`: The arithmetic average of the NLP evaluation metrics we obtained. See *NLP evaluation metrics* below for details.
- `throughput` (token/s): The average number of tokens generated per second.
- `latency` (s): The average time it took for the model to generate a response.
- `response_length` (token): The average number of tokens in the model's response.
- `parameters`: The number of parameters the model has, in units of billion.
## Tasks
For each task, every model uses the same system prompt. We still account for differences in roles, e.g. `USER`, `HUMAN`, `ASSISTANT`, `GPT`.
| Name | System prompt |
|--|--|
| chat | A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. |
| chat-concise | A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant's answers are very concise. |
| instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
| instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
You can see that response length is shorter on average for the `-concise` variants of the tasks.
This affects the number of decoding iterations the model has to run in order to finish responding, thus affecting latency and energy consumption per prompt.
## Setup
Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
### Software
- PyTorch 2.0.1
- [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
- [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) -- For NLP evaluation metrics
### Hardware
- NVIDIA A40 GPU
- NVIDIA A100 GPU
### Parameters
- Model
- Batch size 1
- FP16
- Sampling (decoding)
- Greedy sampling from multinomial distribution
- Temperature 0.7
- Repetition penalty 1.0
## Data
We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
## NLP evaluation metrics
- `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset, measures capability to do grade-school level question answering, 25 shot
- `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag), measuring grounded commonsense, 10 shot
- `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958), measuring truthfulness against questions that elicit common falsehoods, 0 shot
## Upcoming
- More optimized inference runtimes, like TensorRT.
- More GPU models, like V100.
- More models, like RWKV.
# License
This leaderboard is a research preview intended for non-commercial use only.
Model weights were taken as is from the Hugging Face Hub if available and are subject to their licenses.
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
Please direct inquiries/reports of potential violation to Jae-Won Chung.
# Acknowledgements
We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 80GB GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).
|