mrzjy commited on
Commit
9df8bae
Β·
verified Β·
1 Parent(s): 11676a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -0
README.md CHANGED
@@ -412,6 +412,16 @@ The "Test-Time Scaling Performance" is visualized as follows:
412
 
413
  ![test_time_scaling.png](image%2Ftest_time_scaling.png)
414
 
 
 
 
 
 
 
 
 
 
 
415
  ## 5. Limitation
416
 
417
  - Since this PRM has a relatively small size of 0.5B, we do **not** expect it to generalize well. It is likely that this PRM can only differentiate between low-quality and high-quality outlines generated by Qwen-7B. Outlines produced by other high-performing LLMs or those crafted with careful prompt engineering may be able to bypass or "fool" our PRM.
 
412
 
413
  ![test_time_scaling.png](image%2Ftest_time_scaling.png)
414
 
415
+ - Human Evaluation
416
+
417
+ Despite the small scale of the experiment, we presented the generation results to an expert writer friend and gathered qualitative feedback. The models compared were:
418
+
419
+ - Model 1: SFT model with top-p sampling
420
+ - Model 2: SFT model with sequential rejection sampling (size = 4)
421
+ - Model 3: "Ground-truth" outlines summarized from real novels
422
+
423
+ The expert preferred both Model 2 and Model 3. Specifically, Model 2 performed better in understanding the logic of time/world traversal and in shaping the protagonist and main storyline, while Model 3 excelled in narrative techniques for depicting love triangles. (Note that these preferences may differ from those of typical readers.)
424
+
425
  ## 5. Limitation
426
 
427
  - Since this PRM has a relatively small size of 0.5B, we do **not** expect it to generalize well. It is likely that this PRM can only differentiate between low-quality and high-quality outlines generated by Qwen-7B. Outlines produced by other high-performing LLMs or those crafted with careful prompt engineering may be able to bypass or "fool" our PRM.