The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume.

## How is energy different?

Even between models with the exact same architecture and size, the average energy consumption per prompt is different because they have **different verbosity**.
That is, when asked the same thing, they answer in different lengths.

## Metrics

- `gpu`: NVIDIA GPU model name
- `task`: Name of the task. See *Tasks* below for details.
- `throughput` (token/s): The average number of tokens generated per second.
- `response_length` (token): The average number of tokens in the model's response.
- `latency` (s): The average time it took for the model to generate a response.
- `energy` (J): The average energy consumed by the model to generate a response.
- `parameters`: The number of parameters the model has, in units of billion.

## Tasks

For each task, every model uses the same system prompt. We still account for differences in roles, e.g. `USER`, `HUMAN`, `ASSISTANT`, `GPT`.

| Name | System prompt |
|--|--|
| chat | A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. |
| chat-concise | A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant's answers are very concise. |
| instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
| instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |

## Setup

Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).

### Software

- PyTorch 2.0.1
- [FastChat](https://github.com/lm-sys/fastchat) -- For various model support
- [Zeus](https://ml.energy/zeus) -- For GPU energy measurement

### Hardware

- NVIDIA A40 GPU

### Parameters

- Model
  - Batch size 1
  - FP16
- Sampling (decoding)
  - Greedy sampling from multinomial distribution
  - Temperature 0.7
  - Repetition penalty 1.0

## Data

We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.

We used identical system prompts for all models (while respecting their own *role* tokens):
```
A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
```

## Upcoming

- Compare against more optimized inference runtimes, like TensorRT.
- Other GPUs
- Other model/sampling parameters
- More models
- Model quality evaluation numbers (e.g., AI2 Reasoning Challenge, HellaSwag)

# License

This leaderboard is a research preview intended for non-commercial use only.
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
Please direct inquiries and reports of potential license/copyright violation to Jae-Won Chung.