Update src/about.py
Browse files- src/about.py +7 -4
src/about.py
CHANGED
@@ -47,11 +47,14 @@ Note: **We plan to release an evaluation framework soon in which the details and
|
|
47 |
# Which evaluations are you running? how can people reproduce what you have?
|
48 |
LLM_BENCHMARKS_TEXT = f"""
|
49 |
## ABOUT
|
50 |
-
|
51 |
-
|
52 |
-
|
|
|
|
|
|
|
|
|
53 |
### Tasks
|
54 |
-
📈 We evaluate models on 6 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
|
55 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
56 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
57 |
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
|
|
47 |
# Which evaluations are you running? how can people reproduce what you have?
|
48 |
LLM_BENCHMARKS_TEXT = f"""
|
49 |
## ABOUT
|
50 |
+
For now, the only competitive open language models capable of properly speaking Persian are the multilingual ones, Meta's Llama 3.1 being the prime example.
|
51 |
+
There are only a few capable multilingual LLMs in Persian that derive their main knowledge from English. A Persian LLM is almost an imagination right now as there doesn't exist
|
52 |
+
that many models being expert in Persian in the first place.
|
53 |
+
|
54 |
+
Our goal is to provide a benchmark on diverse domains and tasks that provides insights on how much is the gap between the SOTA models right now in different settings.
|
55 |
+
|
56 |
+
We use our own framework to evaluate the models on the following benchmarks (TO BE RELEASED SOON)
|
57 |
### Tasks
|
|
|
58 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
59 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
60 |
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|