Behnamm commited on
Commit
b821ed7
1 Parent(s): 34b4f04

Update src/about.py

Browse files
Files changed (1) hide show
  1. src/about.py +7 -4
src/about.py CHANGED
@@ -47,11 +47,14 @@ Note: **We plan to release an evaluation framework soon in which the details and
47
  # Which evaluations are you running? how can people reproduce what you have?
48
  LLM_BENCHMARKS_TEXT = f"""
49
  ## ABOUT
50
- With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.
51
- 🤗 Submit a model for automated evaluation on the 🤗 GPU cluster on the "Submit" page!
52
- The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - read more details below!
 
 
 
 
53
  ### Tasks
54
- 📈 We evaluate models on 6 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
55
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
56
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
57
  - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
 
47
  # Which evaluations are you running? how can people reproduce what you have?
48
  LLM_BENCHMARKS_TEXT = f"""
49
  ## ABOUT
50
+ For now, the only competitive open language models capable of properly speaking Persian are the multilingual ones, Meta's Llama 3.1 being the prime example.
51
+ There are only a few capable multilingual LLMs in Persian that derive their main knowledge from English. A Persian LLM is almost an imagination right now as there doesn't exist
52
+ that many models being expert in Persian in the first place.
53
+
54
+ Our goal is to provide a benchmark on diverse domains and tasks that provides insights on how much is the gap between the SOTA models right now in different settings.
55
+
56
+ We use our own framework to evaluate the models on the following benchmarks (TO BE RELEASED SOON)
57
  ### Tasks
 
58
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
59
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
60
  - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.