license: apache-2.0
language:
- zh
- en
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: token-classification
library_name: transformers
tags:
- novel-writing
- PRM
- outline
PRM for Simplistic Novel Outline Generation
This is a small project driven by personal interest, focused on developing a Process-Level Reward Model (PRM) for a specific task: generating outlines for novels.
The aim is to explore how PRMs can provide quality signals for the process of structured outline creation.
1. Task Definition
1.1 Novel Outline Generation
In practice, creating a novel outline typically involves a far more complex reflective process.
However, for the purposes of this experiment, the task is simplified as follows:
- Given a
story idea
andcharacter designs
, generateoutlines
for the firstn
chapters (n
can range from 1 to 10, as for the construction of the training data).
Below is a system prompt template used for training data construction:
- English
- Chinese
1.2 PRM Definition
A PRM is designed to provide process-level reward signals for generation tasks. In this context, each process or step refers specifically to a one-line outline representing a single chapter of a novel.
(image white background here)
2. Training Data
2.1 Preparation
We collected data from two sources:
- ηͺθε°θ―΄ (Chinese dataset): ~1k novels, limited to the first several chapters.
- GoodNovel (English dataset): ~3k novels, limited to the first several chapters.
These datasets were combined to form our bilingual training data.
2.2 SFT Training Data
For each novel, we used Qwen2.5-7B-Instruct to generate outline summaries for each chapter independently. Subsequently, we applied Qwen2.5-32B-Instruct to refine these outlines, ensuring smoother and more natural sequencing.
Additionally, a brief synopsis
and characters
are summarized as required for the outline generation tasks.
As a result, we can build an SFT training dataset for LLMs, which also serves as the foundation for creating the PRM training dataset.
2.3 PRM Training Data
The training data for the outline-PRM is constructed as follows:
We assume that Qwen2.5-7B-generated outlines under such a simple prompt are ALWAYS inferior to human-written ones, and can be regarded as LOW quality.
Starting from the SFT dataset, we generate rollouts of each outline by providing the same prompt and preceding ground-truth outlines.
Each rollout is then treated as a negative sample.
This approach ensures a balanced distribution of positive and negative labels.
3. Model Training
We trained 2 models on the above dataset:
- NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by Llama-Factory.
- NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward: The PRM for outline generation task, trained by using TRL library (Refer to Doc).
4. Performance Evaluation
4.1 Accuracy Metric
- Case Study
4.2 LLM Sampling with PRM
Without delving into further reinforcement learning or policy updates, can we directly apply PRM with our LLMs? The answer is YES!
4.2.1 Test-Time Scaling
4.2.2 Sequential Rejection Sampling
- Case Study