mrzjy
/

NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward

Token Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

mrzjy commited on 2 days ago

Commit

9df8bae

·

verified ·

1 Parent(s): 11676a6

Update README.md

Files changed (1) hide show

README.md +10 -0

README.md CHANGED Viewed

@@ -412,6 +412,16 @@ The "Test-Time Scaling Performance" is visualized as follows:
 ![test_time_scaling.png](image%2Ftest_time_scaling.png)
 ## 5. Limitation
 - Since this PRM has a relatively small size of 0.5B, we do **not** expect it to generalize well. It is likely that this PRM can only differentiate between low-quality and high-quality outlines generated by Qwen-7B. Outlines produced by other high-performing LLMs or those crafted with careful prompt engineering may be able to bypass or "fool" our PRM.

 ![test_time_scaling.png](image%2Ftest_time_scaling.png)
+- Human Evaluation
+Despite the small scale of the experiment, we presented the generation results to an expert writer friend and gathered qualitative feedback. The models compared were:
+- Model 1: SFT model with top-p sampling
+- Model 2: SFT model with sequential rejection sampling (size = 4)
+- Model 3: "Ground-truth" outlines summarized from real novels
+The expert preferred both Model 2 and Model 3. Specifically, Model 2 performed better in understanding the logic of time/world traversal and in shaping the protagonist and main storyline, while Model 3 excelled in narrative techniques for depicting love triangles. (Note that these preferences may differ from those of typical readers.)
 ## 5. Limitation
 - Since this PRM has a relatively small size of 0.5B, we do **not** expect it to generalize well. It is likely that this PRM can only differentiate between low-quality and high-quality outlines generated by Qwen-7B. Outlines produced by other high-performing LLMs or those crafted with careful prompt engineering may be able to bypass or "fool" our PRM.