mrzjy
/

NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward

@@ -191,18 +191,32 @@ This approach ensures a balanced distribution of positive and negative labels.
 We trained 2 models on the above dataset:
 - NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory).
 - [NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward](https://huggingface.co/mrzjy/NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward): The PRM for outline generation task, trained by using TRL library ([Refer to Doc](https://huggingface.co/docs/trl/prm_trainer)).
   - Note: This model is trained with `train_on_last_step_only` flag set to `True`
 ## 4. Usage & Performance Evaluation
 ### 4.1 Accuracy Metric
-- Case Study
 ```
 ```
 ### 4.2 Sequential Rejection Sampling
 Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
@@ -221,7 +235,7 @@ The "Test-Time Scaling Performance" is visualized as follows:
 - There is significant room for improvement in the training data construction. For example, it could be enhanced by introducing a variety of flaws (e.g., repetitive patterns, toxic content, instruction-following failures, etc.) and incorporating outputs from more diverse LLMs.
-#### 4.3 Generalization Concerns
 - Case Study: Format affects the results

 We trained 2 models on the above dataset:
 - NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory).
+  - We trained for 2 epochs since validation loss began to increase.
 - [NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward](https://huggingface.co/mrzjy/NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward): The PRM for outline generation task, trained by using TRL library ([Refer to Doc](https://huggingface.co/docs/trl/prm_trainer)).
   - Note: This model is trained with `train_on_last_step_only` flag set to `True`
+  - We trained for 3 epochs. (The validation loss seems to be unstable)
 ## 4. Usage & Performance Evaluation
 ### 4.1 Accuracy Metric
+- Classification Report
 ```
+              precision    recall  f1-score   support
+     label 0       0.97      0.97      0.97       216
+     label 1       0.99      0.99      0.99       476
+    accuracy                           0.98       692
+   macro avg       0.98      0.98      0.98       692
+weighted avg       0.98      0.98      0.98       692
 ```
+As noted, the accuracy metric appears inflated, likely due to one of two reasons: either the constructed negative labels are too easy to distinguish, or the model is overfitting, with the test data sharing an identical distribution to the training data. As a result, the metric may fail to accurately reflect the model’s generalization capability.
+Let's move on nonetheless to see how it actually performs with LLM sampling.
 ### 4.2 Sequential Rejection Sampling
 Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
 - There is significant room for improvement in the training data construction. For example, it could be enhanced by introducing a variety of flaws (e.g., repetitive patterns, toxic content, instruction-following failures, etc.) and incorporating outputs from more diverse LLMs.
+#### 4.3 Generalization Issue
 - Case Study: Format affects the results