jondurbin
/

bagel-dpo-8x7b-v0.2

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

jondurbin commited on Jan 8

Commit

fb9973a

•

1 Parent(s): 53885cb

Update README.md

Files changed (1) hide show

README.md +49 -0

README.md CHANGED Viewed

@@ -44,6 +44,55 @@ This is the model after both SFT and DPO.  Check out the [non-DPO version here](
 Hardware kindly provided by [Massed Compute](https://massedcompute.com/?utm_source=huggingface&utm_creative_format=model_card&utm_content=creator_jon)
 ### Data sources
 *Yes, you will see benchmark names in the list, but this only uses the train splits, and a decontamination by cosine similarity is performed at the end as a sanity check*

 Hardware kindly provided by [Massed Compute](https://massedcompute.com/?utm_source=huggingface&utm_creative_format=model_card&utm_content=creator_jon)
+## Benchmark info
+I didn't run any sort of comprehensive set of benchmarks, but here are a couple of note:
+### MT-Bench
+| model | turn | score |
+| --- | --- | --- |
+| bagel-dpo-8x7b-v0.2 | 1 | 8.43750 |
+| bagel-8x7b-v0.2 | 1 | 8.05625 |
+| bagel-dpo-8x7b-v0.2 | 2 | 7.6000 |
+| bagel-8x7b-v0.2 | 2 | 7.1375 |
+Average:
+| model | score |
+| --- | --- |
+| bagel-dpo-8x7b-v0.2 | 8.018750 |
+| bagel-8x7b-v0.2 | 7.596875 |
+### TruthfulQA
+| model | score |
+| --- | --- |
+| bagel-dpo-8x7b-v0.2 | 0.7242 |
+| bagel-8x7b-v0.2 | 0.5921 |
+### GSM8K
+The default GSM8K configuration seems to break because this model outputs multiple newlines at times (for some reason?).  If you apply this patch to lm-evaluation-harness, the bench works properly:
+```
+diff --git a/lm_eval/tasks/gsm8k/gsm8k.yaml b/lm_eval/tasks/gsm8k/gsm8k.yaml
+index ccf6a5a3..df0b7422 100644
+--- a/lm_eval/tasks/gsm8k/gsm8k.yaml
++++ b/lm_eval/tasks/gsm8k/gsm8k.yaml
+@@ -21,10 +21,10 @@ metric_list:
+       - "(?s).*#### "
+ generation_kwargs:
+   until:
+-    - "\n\n"
+     - "Question:"
+   do_sample: false
+   temperature: 0.0
++  max_new_tokens: 2048
+ repeats: 1
+ num_fewshot: 5
+ filter_list:
+```
 ### Data sources
 *Yes, you will see benchmark names in the list, but this only uses the train splits, and a decontamination by cosine similarity is performed at the end as a sanity check*