mrzjy
/

NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward

Token Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

mrzjy commited on 3 days ago

Commit

8399cd2

·

verified ·

1 Parent(s): 067e712

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -74,7 +74,7 @@ These datasets were combined to form our bilingual training data.
 ### 2.2 SFT Training Data
-For each novel, we used Qwen2.5-7B-Instruct to generate outline summaries for each chapter independently. Subsequently, we applied Qwen2.5-32B-Instruct to refine these outlines, ensuring smoother and more natural sequencing.
 Additionally, a brief `synopsis` and `characters` are summarized as required for the outline generation tasks.
@@ -82,9 +82,9 @@ As a result, we can build an SFT training dataset for LLMs, which also serves as
 ### 2.3 PRM Training Data
-The training data for the outline-PRM is constructed as follows:
-We assume that Qwen2.5-7B-generated outlines under such a simple prompt are **ALWAYS** inferior to human-written ones, and can be regarded as **LOW** quality.
 Starting from the SFT dataset, we generate rollouts of each outline by providing the same prompt and preceding ground-truth outlines. Each rollout is prompted to consist of similar number of words as the ground-truth. And every rollout is then treated as a negative sample.

 ### 2.2 SFT Training Data
+For each novel, we used Qwen2.5-7B-Instruct to generate outline summaries for each chapter independently. Subsequently, we applied Qwen2.5-32B-Instruct to refine these outlines, ensuring smoother and more natural sequencing. We call it the ground-truth `outline`.
 Additionally, a brief `synopsis` and `characters` are summarized as required for the outline generation tasks.
 ### 2.3 PRM Training Data
+The training data for the outline-PRM is basically constructed as follows:
+We assume that Qwen2.5-7B-generated outlines under such a simple prompt are **ALWAYS** inferior to ground-truth outlines, and can be regarded as **LOW** quality.
 Starting from the SFT dataset, we generate rollouts of each outline by providing the same prompt and preceding ground-truth outlines. Each rollout is prompted to consist of similar number of words as the ground-truth. And every rollout is then treated as a negative sample.