mrzjy commited on
Commit
990485d
·
verified ·
1 Parent(s): f790a9e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -197,6 +197,20 @@ Without delving into further reinforcement learning or policy updates, can we di
197
  ```
198
  ```
199
 
200
- #### 4.3 Generalization Issues
201
 
202
  - Case Study: Format affects the results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  ```
198
  ```
199
 
200
+ #### 4.3 Generalization Concerns
201
 
202
  - Case Study: Format affects the results
203
+
204
+ ## 5. Discussion
205
+
206
+ There are many PRM related papers one can refer to, [A Roadmap to Reproduce o1](https://arxiv.org/pdf/2412.14135) can be a good start for understanding the current status of O1 reproduction works.
207
+
208
+ The main difference between a PRM for O1-like models and a PRM for this project, is that there is no reasoning process in this project at all. The process or step is defined directly as each line of the final results (instead of CoT lines).
209
+
210
+ This difference arises because obtaining step-wise reward signals for reasoning in creative writing is inherently challenging, with frequent ambiguity and subjectivity. Annotators may struggle to determine whether a particular reasoning step in the creative process is good or bad. Unlike math problems, where correctness is well-defined, creative writing allows for **open** valid paths—each leading to a unique outcome, as "all roads lead to Rome."
211
+
212
+ ## 5. Conclusion
213
+
214
+ This project provides some minimum hands-on experience on PRM on a specific domain, and it's far from being perfect in terms of training data, model design, evaluation as well as insights.
215
+
216
+