Behnamm commited on
Commit
3a5b349
1 Parent(s): b5c5474

Update src/about.py

Browse files
Files changed (1) hide show
  1. src/about.py +3 -1
src/about.py CHANGED
@@ -60,13 +60,15 @@ We use our own framework to evaluate the models on the following benchmarks (TO
60
  ### Tasks
61
  - PeKA: Persian Knowledge Assesment (0-shot) - a set of multiple-choice questions that tests the level of native knowledge in Persian language in more 15 domains and categories: From art to history and geography, cinema, tv, sports, law and medicine, and much more.
62
  - PersBETS: Persian Bias Ethics Toxicity and Skills (0-shot) - a test of model's capability in linguistic skills such as Grammar and Praphrasing, and also questions examining the bias, ethics, and toxicity of the model.
63
- - <a href="https://arxiv.org/abs/2404.06644" target="_blank"> Khayyam Challenge (Persian MMLU) </a> (0-shot) - comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages
64
  - <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU MCQA </a> (0-shot) - a series of multiple-choice questions in domains of *literature*, *math & logic*, and *common knowledge*.
65
  - <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU NLI </a> (max[0,3,5,10]-shot) - a 3-way classification to determine whether a hypothesis sentence entails, contradicts, or is neutral with respect to a given premise sentence.
66
  - <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU QQP </a> (max[0,2,5,10]-shot) - task of deciding whether a whether two given questions are paraphrases of each other or not.
67
 
68
  For all these evaluations, a higher score is a better score.
69
 
 
 
70
  We chose these benchmarks for now, but several other benchmarks are going to be added later to help us perform a more thorough examination of models.
71
 
72
  The last two benchmarks, ParsiNLU NLI and ParsiNLU QQP are evaluated in different few-shot settings and then the maximum score is returned as the final evaluation.
 
60
  ### Tasks
61
  - PeKA: Persian Knowledge Assesment (0-shot) - a set of multiple-choice questions that tests the level of native knowledge in Persian language in more 15 domains and categories: From art to history and geography, cinema, tv, sports, law and medicine, and much more.
62
  - PersBETS: Persian Bias Ethics Toxicity and Skills (0-shot) - a test of model's capability in linguistic skills such as Grammar and Praphrasing, and also questions examining the bias, ethics, and toxicity of the model.
63
+ - <a href="https://arxiv.org/abs/2404.06644" target="_blank"> Khayyam Challenge (Persian MMLU) </a> (0-shot) - comprising 20,805 four-choice questions (of which we use 20,776, removing questions that are longer than 200 words) sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages
64
  - <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU MCQA </a> (0-shot) - a series of multiple-choice questions in domains of *literature*, *math & logic*, and *common knowledge*.
65
  - <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU NLI </a> (max[0,3,5,10]-shot) - a 3-way classification to determine whether a hypothesis sentence entails, contradicts, or is neutral with respect to a given premise sentence.
66
  - <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU QQP </a> (max[0,2,5,10]-shot) - task of deciding whether a whether two given questions are paraphrases of each other or not.
67
 
68
  For all these evaluations, a higher score is a better score.
69
 
70
+ We use the given *test* subset (for thsoe who also have *train* and *dev* subsets) for all these evaluations.
71
+
72
  We chose these benchmarks for now, but several other benchmarks are going to be added later to help us perform a more thorough examination of models.
73
 
74
  The last two benchmarks, ParsiNLU NLI and ParsiNLU QQP are evaluated in different few-shot settings and then the maximum score is returned as the final evaluation.