persian_llm_leaderboard

Running

Behnamm commited on Aug 28

Commit

b5c5474

•

1 Parent(s): f97bde5

Update src/about.py

Files changed (1) hide show

src/about.py CHANGED Viewed

@@ -70,8 +70,8 @@ For all these evaluations, a higher score is a better score.
 We chose these benchmarks for now, but several other benchmarks are going to be added later to help us perform a more thorough examination of models.
 The last two benchmarks, ParsiNLU NLI and ParsiNLU QQP are evaluated in different few-shot settings and then the maximum score is returned as the final evaluation.
-We argue that is indeed a fair evaluation method since many light-weight models (around ~7B and less) can have a pooor in-context learning and thus they perform better
-in small shots. We wish to not hold this against the model by trying to measure performances in different settings and take the maximum score achieved .
 ## REPRODUCIBILITY
 The parameters used for evaluation along with instructions and prompts will be available once the framework is release. (TO BE COMPLETED)

 We chose these benchmarks for now, but several other benchmarks are going to be added later to help us perform a more thorough examination of models.
 The last two benchmarks, ParsiNLU NLI and ParsiNLU QQP are evaluated in different few-shot settings and then the maximum score is returned as the final evaluation.
+We argue that this is indeed a fair evaluation scheme since many light-weight models (around ~7B and less) can have a poor in-context learning and thus perform better
+in small shots (or have a small knowledge capacity and perform poorly in zero-shot). We wish to not hold this against the model by trying to measure performances in different settings and take the maximum score achieved .
 ## REPRODUCIBILITY
 The parameters used for evaluation along with instructions and prompts will be available once the framework is release. (TO BE COMPLETED)