mrzjy
/

NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward

Token Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

mrzjy commited on 3 days ago

Commit

990485d

·

verified ·

1 Parent(s): f790a9e

Update README.md

Files changed (1) hide show

README.md +15 -1

README.md CHANGED Viewed

@@ -197,6 +197,20 @@ Without delving into further reinforcement learning or policy updates, can we di
 ```
 ```
-#### 4.3 Generalization Issues
 - Case Study: Format affects the results

 ```
 ```
+#### 4.3 Generalization Concerns
 - Case Study: Format affects the results
+## 5. Discussion
+There are many PRM related papers one can refer to, [A Roadmap to Reproduce o1](https://arxiv.org/pdf/2412.14135) can be a good start for understanding the current status of O1 reproduction works.
+The main difference between a PRM for O1-like models and a PRM for this project, is that there is no reasoning process in this project at all. The process or step is defined directly as each line of the final results (instead of CoT lines).
+This difference arises because obtaining step-wise reward signals for reasoning in creative writing is inherently challenging, with frequent ambiguity and subjectivity. Annotators may struggle to determine whether a particular reasoning step in the creative process is good or bad. Unlike math problems, where correctness is well-defined, creative writing allows for **open** valid paths—each leading to a unique outcome, as "all roads lead to Rome."
+## 5. Conclusion
+This project provides some minimum hands-on experience on PRM on a specific domain, and it's far from being perfect in terms of training data, model design, evaluation as well as insights.