mrzjy commited on
Commit
d945ed1
·
verified ·
1 Parent(s): 511969d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -2
README.md CHANGED
@@ -191,18 +191,32 @@ This approach ensures a balanced distribution of positive and negative labels.
191
  We trained 2 models on the above dataset:
192
 
193
  - NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory).
 
194
  - [NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward](https://huggingface.co/mrzjy/NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward): The PRM for outline generation task, trained by using TRL library ([Refer to Doc](https://huggingface.co/docs/trl/prm_trainer)).
195
  - Note: This model is trained with `train_on_last_step_only` flag set to `True`
 
196
 
197
  ## 4. Usage & Performance Evaluation
198
 
199
  ### 4.1 Accuracy Metric
200
 
201
- - Case Study
202
 
203
  ```
 
 
 
 
 
 
 
 
204
  ```
205
 
 
 
 
 
206
  ### 4.2 Sequential Rejection Sampling
207
 
208
  Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
@@ -221,7 +235,7 @@ The "Test-Time Scaling Performance" is visualized as follows:
221
  - There is significant room for improvement in the training data construction. For example, it could be enhanced by introducing a variety of flaws (e.g., repetitive patterns, toxic content, instruction-following failures, etc.) and incorporating outputs from more diverse LLMs.
222
 
223
 
224
- #### 4.3 Generalization Concerns
225
 
226
  - Case Study: Format affects the results
227
 
 
191
  We trained 2 models on the above dataset:
192
 
193
  - NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory).
194
+ - We trained for 2 epochs since validation loss began to increase.
195
  - [NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward](https://huggingface.co/mrzjy/NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward): The PRM for outline generation task, trained by using TRL library ([Refer to Doc](https://huggingface.co/docs/trl/prm_trainer)).
196
  - Note: This model is trained with `train_on_last_step_only` flag set to `True`
197
+ - We trained for 3 epochs. (The validation loss seems to be unstable)
198
 
199
  ## 4. Usage & Performance Evaluation
200
 
201
  ### 4.1 Accuracy Metric
202
 
203
+ - Classification Report
204
 
205
  ```
206
+ precision recall f1-score support
207
+
208
+ label 0 0.97 0.97 0.97 216
209
+ label 1 0.99 0.99 0.99 476
210
+
211
+ accuracy 0.98 692
212
+ macro avg 0.98 0.98 0.98 692
213
+ weighted avg 0.98 0.98 0.98 692
214
  ```
215
 
216
+ As noted, the accuracy metric appears inflated, likely due to one of two reasons: either the constructed negative labels are too easy to distinguish, or the model is overfitting, with the test data sharing an identical distribution to the training data. As a result, the metric may fail to accurately reflect the model’s generalization capability.
217
+
218
+ Let's move on nonetheless to see how it actually performs with LLM sampling.
219
+
220
  ### 4.2 Sequential Rejection Sampling
221
 
222
  Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
 
235
  - There is significant room for improvement in the training data construction. For example, it could be enhanced by introducing a variety of flaws (e.g., repetitive patterns, toxic content, instruction-following failures, etc.) and incorporating outputs from more diverse LLMs.
236
 
237
 
238
+ #### 4.3 Generalization Issue
239
 
240
  - Case Study: Format affects the results
241