jondurbin commited on
Commit
fb9973a
1 Parent(s): 53885cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md CHANGED
@@ -44,6 +44,55 @@ This is the model after both SFT and DPO. Check out the [non-DPO version here](
44
 
45
  Hardware kindly provided by [Massed Compute](https://massedcompute.com/?utm_source=huggingface&utm_creative_format=model_card&utm_content=creator_jon)
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ### Data sources
48
 
49
  *Yes, you will see benchmark names in the list, but this only uses the train splits, and a decontamination by cosine similarity is performed at the end as a sanity check*
 
44
 
45
  Hardware kindly provided by [Massed Compute](https://massedcompute.com/?utm_source=huggingface&utm_creative_format=model_card&utm_content=creator_jon)
46
 
47
+ ## Benchmark info
48
+
49
+ I didn't run any sort of comprehensive set of benchmarks, but here are a couple of note:
50
+
51
+ ### MT-Bench
52
+
53
+ | model | turn | score |
54
+ | --- | --- | --- |
55
+ | bagel-dpo-8x7b-v0.2 | 1 | 8.43750 |
56
+ | bagel-8x7b-v0.2 | 1 | 8.05625 |
57
+ | bagel-dpo-8x7b-v0.2 | 2 | 7.6000 |
58
+ | bagel-8x7b-v0.2 | 2 | 7.1375 |
59
+
60
+ Average:
61
+
62
+ | model | score |
63
+ | --- | --- |
64
+ | bagel-dpo-8x7b-v0.2 | 8.018750 |
65
+ | bagel-8x7b-v0.2 | 7.596875 |
66
+
67
+ ### TruthfulQA
68
+
69
+ | model | score |
70
+ | --- | --- |
71
+ | bagel-dpo-8x7b-v0.2 | 0.7242 |
72
+ | bagel-8x7b-v0.2 | 0.5921 |
73
+
74
+ ### GSM8K
75
+
76
+ The default GSM8K configuration seems to break because this model outputs multiple newlines at times (for some reason?). If you apply this patch to lm-evaluation-harness, the bench works properly:
77
+ ```
78
+ diff --git a/lm_eval/tasks/gsm8k/gsm8k.yaml b/lm_eval/tasks/gsm8k/gsm8k.yaml
79
+ index ccf6a5a3..df0b7422 100644
80
+ --- a/lm_eval/tasks/gsm8k/gsm8k.yaml
81
+ +++ b/lm_eval/tasks/gsm8k/gsm8k.yaml
82
+ @@ -21,10 +21,10 @@ metric_list:
83
+ - "(?s).*#### "
84
+ generation_kwargs:
85
+ until:
86
+ - - "\n\n"
87
+ - "Question:"
88
+ do_sample: false
89
+ temperature: 0.0
90
+ + max_new_tokens: 2048
91
+ repeats: 1
92
+ num_fewshot: 5
93
+ filter_list:
94
+ ```
95
+
96
  ### Data sources
97
 
98
  *Yes, you will see benchmark names in the list, but this only uses the train splits, and a decontamination by cosine similarity is performed at the end as a sanity check*