persian_llm_leaderboard

Running

App Files Files Community

Behnamm commited on Aug 28

Commit

b821ed7

•

1 Parent(s): 34b4f04

Update src/about.py

Browse files

Files changed (1) hide show

src/about.py +7 -4

src/about.py CHANGED Viewed

@@ -47,11 +47,14 @@ Note: **We plan to release an evaluation framework soon in which the details and
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
 ## ABOUT
-With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.
-🤗 Submit a model for automated evaluation on the 🤗 GPU cluster on the "Submit" page!
-The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - read more details below!
 ### Tasks
-📈 We evaluate models on 6 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
 - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
 - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 - <a href="https://arxiv.org/abs/2009.03300" target="_blank">  MMLU </a>  (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.

 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
 ## ABOUT
+For now, the only competitive open language models capable of properly speaking Persian are the multilingual ones, Meta's Llama 3.1 being the prime example.
+There are only a few capable multilingual LLMs in Persian that derive their main knowledge from English. A Persian LLM is almost an imagination right now as there doesn't exist
+that many models being expert in Persian in the first place.
+Our goal is to provide a benchmark on diverse domains and tasks that provides insights on how much is the gap between the SOTA models right now in different settings.
+We use our own framework to evaluate the models on the following benchmarks (TO BE RELEASED SOON)
 ### Tasks
 - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
 - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 - <a href="https://arxiv.org/abs/2009.03300" target="_blank">  MMLU </a>  (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.