cstr commited on
Commit
68446c6
1 Parent(s): e86e51d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md CHANGED
@@ -35,6 +35,87 @@ score of 64.81, but only
35
  |GSM8k (5-shot) |62.70|
36
 
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## 🧩 Configuration
39
 
40
  ```yaml
 
35
  |GSM8k (5-shot) |62.70|
36
 
37
 
38
+ | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
39
+ |--------------------------------------------------------------|------:|------:|---------:|-------:|------:|
40
+ |[Spaetzle-v12-7b](https://huggingface.co/cstr/Spaetzle-v12-7b)| 42.64| 74.3| 58.44| 44.44| 54.95|
41
+
42
+ ### AGIEval
43
+ | Task |Version| Metric |Value| |Stderr|
44
+ |------------------------------|------:|--------|----:|---|-----:|
45
+ |agieval_aqua_rat | 0|acc |24.02|± | 2.69|
46
+ | | |acc_norm|21.65|± | 2.59|
47
+ |agieval_logiqa_en | 0|acc |36.10|± | 1.88|
48
+ | | |acc_norm|37.63|± | 1.90|
49
+ |agieval_lsat_ar | 0|acc |24.35|± | 2.84|
50
+ | | |acc_norm|23.04|± | 2.78|
51
+ |agieval_lsat_lr | 0|acc |48.82|± | 2.22|
52
+ | | |acc_norm|47.25|± | 2.21|
53
+ |agieval_lsat_rc | 0|acc |60.59|± | 2.98|
54
+ | | |acc_norm|57.99|± | 3.01|
55
+ |agieval_sat_en | 0|acc |76.21|± | 2.97|
56
+ | | |acc_norm|74.76|± | 3.03|
57
+ |agieval_sat_en_without_passage| 0|acc |46.60|± | 3.48|
58
+ | | |acc_norm|45.63|± | 3.48|
59
+ |agieval_sat_math | 0|acc |37.27|± | 3.27|
60
+ | | |acc_norm|33.18|± | 3.18|
61
+
62
+ Average: 42.64%
63
+
64
+ ### GPT4All
65
+ | Task |Version| Metric |Value| |Stderr|
66
+ |-------------|------:|--------|----:|---|-----:|
67
+ |arc_challenge| 0|acc |59.13|± | 1.44|
68
+ | | |acc_norm|61.26|± | 1.42|
69
+ |arc_easy | 0|acc |83.67|± | 0.76|
70
+ | | |acc_norm|80.89|± | 0.81|
71
+ |boolq | 1|acc |87.83|± | 0.57|
72
+ |hellaswag | 0|acc |66.45|± | 0.47|
73
+ | | |acc_norm|84.63|± | 0.36|
74
+ |openbookqa | 0|acc |37.40|± | 2.17|
75
+ | | |acc_norm|45.80|± | 2.23|
76
+ |piqa | 0|acc |82.15|± | 0.89|
77
+ | | |acc_norm|83.13|± | 0.87|
78
+ |winogrande | 0|acc |76.56|± | 1.19|
79
+
80
+ Average: 74.3%
81
+
82
+ ### TruthfulQA
83
+ | Task |Version|Metric|Value| |Stderr|
84
+ |-------------|------:|------|----:|---|-----:|
85
+ |truthfulqa_mc| 1|mc1 |42.59|± | 1.73|
86
+ | | |mc2 |58.44|± | 1.58|
87
+
88
+ Average: 58.44%
89
+
90
+ ### Bigbench
91
+ | Task |Version| Metric |Value| |Stderr|
92
+ |------------------------------------------------|------:|---------------------|----:|---|-----:|
93
+ |bigbench_causal_judgement | 0|multiple_choice_grade|55.26|± | 3.62|
94
+ |bigbench_date_understanding | 0|multiple_choice_grade|64.77|± | 2.49|
95
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|37.60|± | 3.02|
96
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|32.31|± | 2.47|
97
+ | | |exact_str_match |21.45|± | 2.17|
98
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|31.00|± | 2.07|
99
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|22.43|± | 1.58|
100
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|53.00|± | 2.89|
101
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|40.40|± | 2.20|
102
+ |bigbench_navigate | 0|multiple_choice_grade|51.30|± | 1.58|
103
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|68.50|± | 1.04|
104
+ |bigbench_ruin_names | 0|multiple_choice_grade|48.66|± | 2.36|
105
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|30.36|± | 1.46|
106
+ |bigbench_snarks | 0|multiple_choice_grade|70.17|± | 3.41|
107
+ |bigbench_sports_understanding | 0|multiple_choice_grade|70.39|± | 1.45|
108
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|31.00|± | 1.46|
109
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|21.44|± | 1.16|
110
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|18.29|± | 0.92|
111
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|53.00|± | 2.89|
112
+
113
+ Average: 44.44%
114
+
115
+ Average score: 54.95%
116
+
117
+ Elapsed time: 02:50:51
118
+
119
  ## 🧩 Configuration
120
 
121
  ```yaml