Update README.md
Browse files
README.md
CHANGED
@@ -192,7 +192,7 @@ The model training took roughly two months.
|
|
192 |
|
193 |
## Benchmarks
|
194 |
|
195 |
-
We evaluate our model on all benchmarks of the leaderboard's version
|
196 |
|
197 |
|
198 |
| `model name` |`IFEval`| `BBH` |`MATH LvL5`| `GPQA`| `MUSR`|`MMLU-PRO`|`Average`|
|
@@ -212,6 +212,8 @@ We evaluate our model on all benchmarks of the leaderboard's version 2 using the
|
|
212 |
| `gemma-7B` | 26.59 | 21.12 | 6.42 | 4.92 | 10.98 | 21.64 |**15.28**|
|
213 |
|
214 |
|
|
|
|
|
215 |
|
216 |
| `model name` |`ARC`|`HellaSwag` |`MMLU` |`Winogrande`|`TruthfulQA`|`GSM8K`|`Average` |
|
217 |
|:-----------------------------|:------:|:---------:|:-----:|:----------:|:----------:|:-----:|:----------------:|
|
|
|
192 |
|
193 |
## Benchmarks
|
194 |
|
195 |
+
We evaluate our model on all benchmarks of the new leaderboard's version using the `lm-evaluation-harness` package, and then normalize the evaluation results with HuggingFace score normalization.
|
196 |
|
197 |
|
198 |
| `model name` |`IFEval`| `BBH` |`MATH LvL5`| `GPQA`| `MUSR`|`MMLU-PRO`|`Average`|
|
|
|
212 |
| `gemma-7B` | 26.59 | 21.12 | 6.42 | 4.92 | 10.98 | 21.64 |**15.28**|
|
213 |
|
214 |
|
215 |
+
Also, we evaluate our model on the benchmarks of the first leaderboard using `lighteval`.
|
216 |
+
|
217 |
|
218 |
| `model name` |`ARC`|`HellaSwag` |`MMLU` |`Winogrande`|`TruthfulQA`|`GSM8K`|`Average` |
|
219 |
|:-----------------------------|:------:|:---------:|:-----:|:----------:|:----------:|:-----:|:----------------:|
|