The goal of the ML.ENERGY Leaderboard is to give people a sense of how much energy LLMs would consume.

How is energy different?

Even between models with the exact same architecture and size, the average energy consumption per prompt is different because they have different verbosity. That is, when asked the same thing, they answer in different lengths.

Metrics

gpu: NVIDIA GPU model name
task: Name of the task. See Tasks below for details.
throughput (token/s): The average number of tokens generated per second.
response_length (token): The average number of tokens in the model's response.
latency (s): The average time it took for the model to generate a response.
energy (J): The average energy consumed by the model to generate a response.
parameters: The number of parameters the model has, in units of billion.

Tasks

For each task, every model uses the same system prompt. We still account for differences in roles, e.g. USER, HUMAN, ASSISTANT, GPT.

Name	System prompt
chat	A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
chat-concise	A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant's answers are very concise.
instruct	Below is an instruction that describes a task. Write a response that appropriately completes the request.
instruct-concise	Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise.

Setup

Find our benchmark script for one model here.

Software

PyTorch 2.0.1
FastChat -- For various model support
Zeus -- For GPU energy measurement

Hardware

NVIDIA A40 GPU

Parameters

Model
- Batch size 1
- FP16
Sampling (decoding)
- Greedy sampling from multinomial distribution
- Temperature 0.7
- Repetition penalty 1.0

Data

We randomly sampled around 3000 prompts from the cleaned ShareGPT dataset. See here for more detail on how we created the benchmark dataset.

We used identical system prompts for all models (while respecting their own role tokens):

A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

Upcoming

Compare against more optimized inference runtimes, like TensorRT.
Other GPUs
Other model/sampling parameters
More models
Model quality evaluation numbers (e.g., AI2 Reasoning Challenge, HellaSwag)

License

This leaderboard is a research preview intended for non-commercial use only. The use of LLaMA weights are subject to their license. Please direct inquiries and reports of potential license/copyright violation to Jae-Won Chung.