Spaces:

ml-energy
/

leaderboard

Running

App Files Files Community

leaderboard / LEADERBOARD.md

Jae-Won Chung

Add Acknowledgement of cloud providers

6bdaf0a over 1 year ago

preview code

raw

history blame

3.66 kB

	The goal of the ML.ENERGY Leaderboard is to give people a sense of how much energy LLMs would consume.

	## How is energy different?

	The energy consumption of running inference on a model will depends on factors such as architecture, size, and GPU model.
	However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption per prompt is different because different models have different verbosity.
	That is, when asked the same thing, different models answer in different lengths.

	## Metrics

	- `gpu`: NVIDIA GPU model name
	- `task`: Name of the task. See Tasks below for details.
	- `throughput` (token/s): The average number of tokens generated per second.
	- `response_length` (token): The average number of tokens in the model's response.
	- `latency` (s): The average time it took for the model to generate a response.
	- `energy` (J): The average energy consumed by the model to generate a response.
	- `parameters`: The number of parameters the model has, in units of billion.

	## Tasks

	For each task, every model uses the same system prompt. We still account for differences in roles, e.g. `USER`, `HUMAN`, `ASSISTANT`, `GPT`.

	\| Name \| System prompt \|
	\|--\|--\|
	\| chat \| A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. \|
	\| chat-concise \| A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant's answers are very concise. \|
	\| instruct \| Below is an instruction that describes a task. Write a response that appropriately completes the request. \|
	\| instruct-concise \| Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. \|

	## Setup

	Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).

	### Software

	- PyTorch 2.0.1
	- [FastChat](https://github.com/lm-sys/fastchat) -- For various model support
	- [Zeus](https://ml.energy/zeus) -- For GPU energy measurement

	### Hardware

	- NVIDIA A40 GPU

	### Parameters

	- Model
	- Batch size 1
	- FP16
	- Sampling (decoding)
	- Greedy sampling from multinomial distribution
	- Temperature 0.7
	- Repetition penalty 1.0

	## Data

	We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
	See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.

	We used identical system prompts for all models (while respecting their own role tokens):
	```
	A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
	```

	## Upcoming

	- Compare energy numbers against more optimized inference runtimes, like TensorRT.
	- More GPU types
	- More models
	- Other model/sampling parameters

	# License

	This leaderboard is a research preview intended for non-commercial use only.
	The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
	Please direct inquiries and reports of potential license/copyright violation to Jae-Won Chung.

	# Acknowledgements

	We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).