Update README.md
Browse files
README.md
CHANGED
@@ -197,6 +197,20 @@ Without delving into further reinforcement learning or policy updates, can we di
|
|
197 |
```
|
198 |
```
|
199 |
|
200 |
-
#### 4.3 Generalization
|
201 |
|
202 |
- Case Study: Format affects the results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
197 |
```
|
198 |
```
|
199 |
|
200 |
+
#### 4.3 Generalization Concerns
|
201 |
|
202 |
- Case Study: Format affects the results
|
203 |
+
|
204 |
+
## 5. Discussion
|
205 |
+
|
206 |
+
There are many PRM related papers one can refer to, [A Roadmap to Reproduce o1](https://arxiv.org/pdf/2412.14135) can be a good start for understanding the current status of O1 reproduction works.
|
207 |
+
|
208 |
+
The main difference between a PRM for O1-like models and a PRM for this project, is that there is no reasoning process in this project at all. The process or step is defined directly as each line of the final results (instead of CoT lines).
|
209 |
+
|
210 |
+
This difference arises because obtaining step-wise reward signals for reasoning in creative writing is inherently challenging, with frequent ambiguity and subjectivity. Annotators may struggle to determine whether a particular reasoning step in the creative process is good or bad. Unlike math problems, where correctness is well-defined, creative writing allows for **open** valid paths—each leading to a unique outcome, as "all roads lead to Rome."
|
211 |
+
|
212 |
+
## 5. Conclusion
|
213 |
+
|
214 |
+
This project provides some minimum hands-on experience on PRM on a specific domain, and it's far from being perfect in terms of training data, model design, evaluation as well as insights.
|
215 |
+
|
216 |
+
|