Spaces:
Sleeping
Sleeping
Fix bullet point about evaluation
Browse files- content.py +1 -3
content.py
CHANGED
@@ -43,9 +43,7 @@ INTRODUCTION_TEXT = f"""
|
|
43 |
|
44 |
π€ A key advantage of this leaderboard is that anyone from the community can submit a model for automated evaluation on the π€ GPU cluster, as long as it is a π€ Transformers model with weights on the Hub. We also support evaluation of models with delta-weights for non-commercial licensed models, such as LLaMa.
|
45 |
|
46 |
-
π
|
47 |
-
|
48 |
-
Evaluation is performed against 4 popular benchmarks:
|
49 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
50 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
51 |
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
|
|
43 |
|
44 |
π€ A key advantage of this leaderboard is that anyone from the community can submit a model for automated evaluation on the π€ GPU cluster, as long as it is a π€ Transformers model with weights on the Hub. We also support evaluation of models with delta-weights for non-commercial licensed models, such as LLaMa.
|
45 |
|
46 |
+
π We evaluate models on 4 key benchmarks from the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks:
|
|
|
|
|
47 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
48 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
49 |
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|