Update README.md
Browse files
README.md
CHANGED
@@ -412,6 +412,16 @@ The "Test-Time Scaling Performance" is visualized as follows:
|
|
412 |
|
413 |
![test_time_scaling.png](image%2Ftest_time_scaling.png)
|
414 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
415 |
## 5. Limitation
|
416 |
|
417 |
- Since this PRM has a relatively small size of 0.5B, we do **not** expect it to generalize well. It is likely that this PRM can only differentiate between low-quality and high-quality outlines generated by Qwen-7B. Outlines produced by other high-performing LLMs or those crafted with careful prompt engineering may be able to bypass or "fool" our PRM.
|
|
|
412 |
|
413 |
![test_time_scaling.png](image%2Ftest_time_scaling.png)
|
414 |
|
415 |
+
- Human Evaluation
|
416 |
+
|
417 |
+
Despite the small scale of the experiment, we presented the generation results to an expert writer friend and gathered qualitative feedback. The models compared were:
|
418 |
+
|
419 |
+
- Model 1: SFT model with top-p sampling
|
420 |
+
- Model 2: SFT model with sequential rejection sampling (size = 4)
|
421 |
+
- Model 3: "Ground-truth" outlines summarized from real novels
|
422 |
+
|
423 |
+
The expert preferred both Model 2 and Model 3. Specifically, Model 2 performed better in understanding the logic of time/world traversal and in shaping the protagonist and main storyline, while Model 3 excelled in narrative techniques for depicting love triangles. (Note that these preferences may differ from those of typical readers.)
|
424 |
+
|
425 |
## 5. Limitation
|
426 |
|
427 |
- Since this PRM has a relatively small size of 0.5B, we do **not** expect it to generalize well. It is likely that this PRM can only differentiate between low-quality and high-quality outlines generated by Qwen-7B. Outlines produced by other high-performing LLMs or those crafted with careful prompt engineering may be able to bypass or "fool" our PRM.
|