metadata

license: apache-2.0
language:
  - zh
  - en
base_model:
  - Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: token-classification
library_name: transformers
tags:
  - novel-writing
  - PRM
  - outline

PRM for Simplistic Novel Outline Generation

This is a small project driven by personal interest, focused on developing a Process-Level Reward Model (PRM) for a specific task: generating outlines for novels.

The aim is to explore how PRMs can provide quality signals for the process of structured outline creation.

1. Task Definition

1.1 Novel Outline Generation

In practice, creating a novel outline typically involves a far more complex reflective process.

However, for the purposes of this experiment, the task is simplified as follows:

Given a story idea and character designs, generate outlines for the first n chapters (n can range from 1 to 10, as for the construction of the training data).

Below is a system prompt template used for training data construction:

English

Chinese

1.2 PRM Definition

A PRM is designed to provide process-level reward signals for generation tasks. In this context, each process or step refers specifically to a one-line outline representing a single chapter of a novel.

(image white background here)

2. Training Data

2.1 Preparation

We collected data from two sources:

番茄小说 (Chinese dataset): ~1k novels, limited to the first several chapters.
GoodNovel (English dataset): ~3k novels, limited to the first several chapters.

These datasets were combined to form our bilingual training data.

2.2 SFT Training Data

For each novel, we used Qwen2.5-7B-Instruct to generate outline summaries for each chapter independently. Subsequently, we applied Qwen2.5-32B-Instruct to refine these outlines, ensuring smoother and more natural sequencing.

Additionally, a brief synopsis and characters are summarized as required for the outline generation tasks.

As a result, we can build an SFT training dataset for LLMs, which also serves as the foundation for creating the PRM training dataset.

2.3 PRM Training Data

The training data for the outline-PRM is constructed as follows:

We assume that Qwen2.5-7B-generated outlines under such a simple prompt are ALWAYS inferior to human-written ones, and can be regarded as LOW quality.

Starting from the SFT dataset, we generate rollouts of each outline by providing the same prompt and preceding ground-truth outlines.

Each rollout is then treated as a negative sample.

This approach ensures a balanced distribution of positive and negative labels.

3. Model Training

We trained 2 models on the above dataset:

NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by Llama-Factory.
NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward: The PRM for outline generation task, trained by using TRL library (Refer to Doc).

4. Performance Evaluation

4.1 Accuracy Metric

Case Study

4.2 LLM Sampling with PRM

Without delving into further reinforcement learning or policy updates, can we directly apply PRM with our LLMs? The answer is YES!

4.2.1 Test-Time Scaling

4.2.2 Sequential Rejection Sampling

Case Study