matsuo-lab commited on
Commit
1c63404
1 Parent(s): 96db70d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -47,7 +47,21 @@ This repository provides a Japanese-centric multilingual GPT-NeoX model of 10 bi
47
 
48
  # Benchmarking
49
 
50
- * **Japanese benchmark**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  - *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
53
  - *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*
 
47
 
48
  # Benchmarking
49
 
50
+ * **Japanese benchmark : JGLUE 8-task (2023-08-27)**
51
+
52
+ - *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
53
+ - *The 8-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, JSQuAD-1.1, jaqket_v2-0.2, xlsum_ja-1.0, xwinograd_ja, and mgsm-1.0.*
54
+ - *model loading is performed with float16, and evaluation is performed with template version 0.3 using the few-shot in-context learning.*
55
+ - *The number of few-shots is 3,3,3,2,1,1,0,5.*
56
+ - *special_tokens_map.json is modified to avoid errors during the evaluation of the second half benchmarks. As a result, the results of the first half benchmarks became slightly different.*
57
+
58
+ model | average | jcommonsenseqa | jnli | marc_ja | jsquad | jaqket_v2 | xlsum_ja | xwinograd_ja | mgsm
59
+ | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
60
+ weblab-10b-instruction-sft | 59.11 | 74.62 | 66.56 | 95.49 | 78.34 | 63.32 | 20.57 | 71.95 | 2
61
+ weblab-10b | 50.74 | 66.58 | 53.74 | 82.07 | 62.94 | 56.19 | 10.03 | 71.95 | 2.4
62
+
63
+
64
+ * **Japanese benchmark : JGLUE 4-task (2023-08-18)**
65
 
66
  - *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
67
  - *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*