--- license: apache-2.0 language: - zh - en base_model: - Qwen/Qwen2.5-0.5B-Instruct pipeline_tag: token-classification library_name: transformers tags: - novel-writing - PRM - outline --- # PRM for Simplistic Novel Outline Generation This is a small project driven by personal interest, focused on developing a Process-Level Reward Model (PRM) for a specific task: generating outlines for novels. The aim is to explore how PRMs can provide quality signals for the process of structured outline creation. ## 1. Task Definition ### 1.1 Novel Outline Generation In practice, creating a novel outline typically involves a far more complex reflective process. However, for the purposes of this experiment, the task is simplified as follows: - Given a `story idea` and `character designs`, generate `outlines` for the first `n` chapters (`n` can range from 1 to 10, as for the construction of the training data). Below is a system prompt template used for training data construction: - English ``` ``` - Chinese ``` ``` ### 1.2 PRM Definition A PRM is designed to provide process-level reward signals for generation tasks. In this context, each process or step refers specifically to a one-line outline representing a single chapter of a novel. (image white background here) ## 2. Training Data ### 2.1 Preparation We collected data from two sources: - 番茄小说 (Chinese dataset): ~1k novels, limited to the first several chapters. - GoodNovel (English dataset): ~3k novels, limited to the first several chapters. These datasets were combined to form our bilingual training data. ### 2.2 SFT Training Data For each novel, we used Qwen2.5-7B-Instruct to generate outline summaries for each chapter independently. Subsequently, we applied Qwen2.5-32B-Instruct to refine these outlines, ensuring smoother and more natural sequencing. Additionally, a brief `synopsis` and `characters` are summarized as required for the outline generation tasks. As a result, we can build an SFT training dataset for LLMs, which also serves as the foundation for creating the PRM training dataset. ### 2.3 PRM Training Data The training data for the outline-PRM is constructed as follows: We assume that Qwen2.5-7B-generated outlines under such a simple prompt are **ALWAYS** inferior to human-written ones, and can be regarded as **LOW** quality. Starting from the SFT dataset, we generate rollouts of each outline by providing the same prompt and preceding ground-truth outlines. Each rollout is then treated as a negative sample. This approach ensures a balanced distribution of positive and negative labels. ## 3. Model Training We trained 2 models on the above dataset: - NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory). - [NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward](https://huggingface.co/mrzjy/NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward): The PRM for outline generation task, trained by using TRL library ([Refer to Doc](https://huggingface.co/docs/trl/prm_trainer)). ## 4. Performance Evaluation ### 4.1 Accuracy Metric - Case Study ``` ``` ### 4.2 LLM Sampling with PRM Without delving into further reinforcement learning or policy updates, can we directly apply PRM with our LLMs? The answer is YES! #### 4.2.1 Test-Time Scaling #### 4.2.2 Sequential Rejection Sampling - Case Study ``` ```