Update README.md
Browse files
README.md
CHANGED
@@ -191,18 +191,32 @@ This approach ensures a balanced distribution of positive and negative labels.
|
|
191 |
We trained 2 models on the above dataset:
|
192 |
|
193 |
- NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory).
|
|
|
194 |
- [NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward](https://huggingface.co/mrzjy/NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward): The PRM for outline generation task, trained by using TRL library ([Refer to Doc](https://huggingface.co/docs/trl/prm_trainer)).
|
195 |
- Note: This model is trained with `train_on_last_step_only` flag set to `True`
|
|
|
196 |
|
197 |
## 4. Usage & Performance Evaluation
|
198 |
|
199 |
### 4.1 Accuracy Metric
|
200 |
|
201 |
-
-
|
202 |
|
203 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
204 |
```
|
205 |
|
|
|
|
|
|
|
|
|
206 |
### 4.2 Sequential Rejection Sampling
|
207 |
|
208 |
Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
|
@@ -221,7 +235,7 @@ The "Test-Time Scaling Performance" is visualized as follows:
|
|
221 |
- There is significant room for improvement in the training data construction. For example, it could be enhanced by introducing a variety of flaws (e.g., repetitive patterns, toxic content, instruction-following failures, etc.) and incorporating outputs from more diverse LLMs.
|
222 |
|
223 |
|
224 |
-
#### 4.3 Generalization
|
225 |
|
226 |
- Case Study: Format affects the results
|
227 |
|
|
|
191 |
We trained 2 models on the above dataset:
|
192 |
|
193 |
- NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory).
|
194 |
+
- We trained for 2 epochs since validation loss began to increase.
|
195 |
- [NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward](https://huggingface.co/mrzjy/NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward): The PRM for outline generation task, trained by using TRL library ([Refer to Doc](https://huggingface.co/docs/trl/prm_trainer)).
|
196 |
- Note: This model is trained with `train_on_last_step_only` flag set to `True`
|
197 |
+
- We trained for 3 epochs. (The validation loss seems to be unstable)
|
198 |
|
199 |
## 4. Usage & Performance Evaluation
|
200 |
|
201 |
### 4.1 Accuracy Metric
|
202 |
|
203 |
+
- Classification Report
|
204 |
|
205 |
```
|
206 |
+
precision recall f1-score support
|
207 |
+
|
208 |
+
label 0 0.97 0.97 0.97 216
|
209 |
+
label 1 0.99 0.99 0.99 476
|
210 |
+
|
211 |
+
accuracy 0.98 692
|
212 |
+
macro avg 0.98 0.98 0.98 692
|
213 |
+
weighted avg 0.98 0.98 0.98 692
|
214 |
```
|
215 |
|
216 |
+
As noted, the accuracy metric appears inflated, likely due to one of two reasons: either the constructed negative labels are too easy to distinguish, or the model is overfitting, with the test data sharing an identical distribution to the training data. As a result, the metric may fail to accurately reflect the model’s generalization capability.
|
217 |
+
|
218 |
+
Let's move on nonetheless to see how it actually performs with LLM sampling.
|
219 |
+
|
220 |
### 4.2 Sequential Rejection Sampling
|
221 |
|
222 |
Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
|
|
|
235 |
- There is significant room for improvement in the training data construction. For example, it could be enhanced by introducing a variety of flaws (e.g., repetitive patterns, toxic content, instruction-following failures, etc.) and incorporating outputs from more diverse LLMs.
|
236 |
|
237 |
|
238 |
+
#### 4.3 Generalization Issue
|
239 |
|
240 |
- Case Study: Format affects the results
|
241 |
|