license: apache-2.0
language:
- zh
- en
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: token-classification
library_name: transformers
tags:
- novel-writing
- PRM
- outline
PRM for Novel Outline Generation
This is a small project driven by personal interest, focused on developing a Process-Level Reward Model (PRM) for a specific and simplified task: generating outlines for novels (in both Chinese and English).
The goal is to explore how PRMs can directly provide quality signals for the final stage of creative writing, without relying on CoT steps.
Use Case
PRM Scoring
Evaluate the generated outlines by providing scores.
Sequential Rejection Sampling
Perform rejection sampling for each steps until the end.
1. Task Definition
1.1 Novel Outline Generation
In practice, creating a novel outline typically involves a far more complex reflective process.
However, for the sake of simplicity in this experiment, the task is simplified as follows:
- Given a brief
story idea
andcharacter designs
, generateoutlines
for the firstn
chapters (wheren
can range from 1 to 10).
Below is a system prompt template used for training data construction:
- English
Act as a novel writer. Your task is to craft novel outlines based on the following story idea:
{story_idea}
Here's your character design:
{character}
Please create the story outlines for the first {n} chapters, with each chapter outline in {n_word} words.
- Chinese
你是一位专业小说写手。你的任务是基于以下故事灵感进行小说大纲创作。
{story_idea}
以下是你设计的角色:
{character}
请基于以上信息为小说前{n}章设计故事大纲,每个大纲大概在{n_word}字左右。
1.2 PRM Definition
A PRM is designed to provide process-level reward signals for generation tasks and it's mainly used to guide the reasoning steps for training O1-like models.
In our context, however, each process or step refers specifically to a one-line outline representing a single chapter of a novel. There is no CoT steps in this task.
(image white background here)
2. Training Data
2.1 Preparation
We collected data from two sources:
- 番茄小说 (Chinese dataset): ~1k novels, limited to the first several chapters.
- GoodNovel (English dataset): ~3k novels, limited to the first several chapters.
These datasets were combined to form our bilingual training data.
2.2 SFT Training Data
For each novel, we used Qwen2.5-7B-Instruct to generate outline summaries for each chapter independently. Subsequently, we applied Qwen2.5-32B-Instruct to refine these outlines, ensuring smoother and more natural sequencing. We call it the ground-truth outline
.
Additionally, a brief synopsis
and characters
are summarized as required for the outline generation tasks.
As a result, we can build an SFT training dataset for LLMs, which also serves as the foundation for creating the PRM training dataset.
- Example assistant responses:
第1章:陈玄意外穿越至另一个时空,继承了一座古旧道观,并发现自己拥有堪舆算命的超凡技能。他在直播算命时遭遇了一系列质疑,但当他展示付款码后,人气迅速飙升,引起了包括九家军在内的众多观众的注意。
第2章:在直播中,陈玄为阴九进行了算命。起初,他的预测遭到嘲笑和质疑。陈玄直接指出阴九将有大凶之兆,这一预言引发了激烈的争议。最终,陈玄凭借问及阴九奶奶的具体去世时间和病因,揭露了事实,证明了自己算命的准确性。
第3章:阴九在直播中质问陈玄,得知家人的安危后,他慌乱中不断打赏。陈玄揭示了阴九家人的凶兆,阴九接到妻子的电话,得知女儿病危。在绝望中,他跪地求救,但陈玄只能告知他已经发生,并警告阴九。
第4章:陈玄通过直播为阴九解难,揭示其面临的危机源自何处,并指导其进行一系列仪式以渡过劫难。直播间观众因为参与怂恿而面临巨大的阴德损失威胁。
第5章:陈玄通过直播算命帮助阴九等人解决了问题,面对大量粉丝的求助请求,他通过发放福袋的方式筛选出第一位求助者杨晨。陈玄成功诊断其为中蛊状态,并提供了解决方案。
第6章:杨晨因梦遗问题向陈玄求助,他描述了梦中被高大披头散发男人折磨的情景。陈玄通过观察宿舍布局和询问室友关系,判断杨晨中了西川巴蜀秘蛊,并指出是室友王浩所为,这一诊断引起了直播间观众的轩然大波。
第7章:杨晨在陈玄的帮助下揭露了室友王浩下蛊的事实,通过直播揭开了真相。王浩试图掩盖事实,但最终真相大白于天下。
第8章:杨晨发现王浩对自己下了蛊,通过陈玄的帮助成功驱除蛊毒,并揭露了王浩的恶行。陈玄通过占卜确认王浩将变成智障,杨晨感激之余继续向陈玄求助。
第9章:陈玄直播解答张伟的梦遗问题,揭示其为仙家后裔纠缠所致,这一解释引起了直播间观众的震惊与好奇。
Chapter 1: Lily Christian, battling a headache and insomnia, overhears Nathaniel and his fiancé Melanie discussing an affair. Devastated, she recalls their years together and Nathaniel's betrayal through his business partnership with Melanie. Seeking validation, Lily receives an unexpected call from Alexander Russell, CEO of La Beauté Group, offering a meeting to discuss a business proposal. Rushing to the café, she boards a limousine, unsure of Russell's intentions.
Chapter 2: Lily arrives at the clerk's office, expecting to discuss a business proposal with Alexander Russell. Instead, he proposes marriage. Reluctantly, she accepts, and they quickly get married. Alexander directs her to pass on perfume information to Edward and schedules a meeting at La Beauté Group. Back at MN Inc., Lily encounters Nathaniel's secretary, Anthony, who informs her that Nathaniel is looking for her. In Nathaniel's office, she overhears his angry outburst at his assistant, Olivia, for not knowing her whereabouts.
Chapter 3: At MN Inc., Nathaniel and Melanie are agitated over missing documents. Nathaniel accuses Lily of being absent from the lab, but she explains she was preparing for a competition. Melanie reveals Lily's past reluctance to participate in such events. Nathaniel checks the documents in a bag Lily holds, and they discuss the upcoming talent competition. Nathaniel insists Lily won't participate, but Lily feels betrayed. She calls Olivia, her assistant, who reports MN Inc. is well-prepared. At La Beauté Group, Edward briefs her on the situation. Alexander notices Lily's injury and lifts her, eliciting a mix of concern and tension.
Chapter 4: Alexander tends to Lily's wound, showing a new level of care she has never seen from Nathaniel. At La Beauté Group, Lily watches Nathaniel and Melanie's confident performance, feeling a mix of resentment and determination. During the competition, the host reveals a scandal involving identical perfumes from MN Inc. and Rebirth, potentially implicating MN Inc. in plagiarism. Lily's resolve hardens as she realizes her past work could be jeopardized.
Chapter 5: The competition host announces a delay in awarding results due to identical perfumes submitted by MN Inc. and Rebirth. Nathaniel protests the postponement, while Melanie eagerly speculates about the other company. The host reveals both companies are suspected of plagiarism, and Rebirth’s representative confirms submission data. Nathaniel asserts that Mel is the sole creator of First Love, but the host asks Rebirth’s perfumer to step forward, undermining Nathaniel’s claim. Alexander watches, impressed by Lily’s growing confidence and determination.
Chapter 6: Lily steps onto the stage, surprising Nathaniel and the audience, claiming to be the creator of First Love. Melanie tries to salvage the situation, but a foul odor from Lily causes confusion and disgust. Nathaniel accuses Lily of betrayal, deepening the tension between them. The host struggles to maintain order, throwing the competition into chaos.
Chapter 7: Lily faces a hostile crowd, with Nathaniel accusing her of betrayal and stealing MN Inc.’s product. Nathaniel tries to salvage the situation by claiming she never worked for MN Inc., but Lily counters by questioning the existence of any contract or paychecks. Melanie demands Lily leave. Nathaniel argues that the research data and samples are identical, causing tension. The host announces Rebirth’s First Love as the winner, shocking Nathaniel who accuses the organizing committee of unfairness. Lily reveals two identical bottles from Rebirth, further complicating the situation.
Chapter 8: Lily confronts Nathaniel and Melanie with evidence of a scent difference between samples from MN Inc. and Rebirth, challenging Nathaniel’s claim of plagiarism. She reveals Melanie’s sample was tampered with, causing a foul odor. The crowd is outraged, believing the competition was rigged. Alexander watches from the VIP room, impressed by Lily’s composure and determination, and considers trusting her more.
Chapter 9: Lily confronts Melanie’s accusations in front of the competition audience, revealing the tampered formula. Nathaniel tries to protect Melanie, but Lily exposes her and claims First Love as her own creation. A scandal erupts when Melanie fakes a fainting spell, and MN Inc. must leave the stage. Lily leaves the venue to reporters’ questions, maintaining composure. In the limousine, Alexander’s warm jacket offers comfort, and Lily reflects on her past and future. Olivia calls, excited about the events and offering support, but Lily declines to stay at her place. The call hints at the challenges ahead.
Chapter 10: Lily and Alexander arrive at a spa club for their wedding night, where Alexander reveals he had reserved the entire restaurant. After a romantic dinner, they enter a suite with a private hot tub. Lily, feeling nervous and out of her comfort zone, downs a glass of wine, which tips her into a state of uncertainty. Alexander, patient and understanding, gives her the chance to prepare. Lily undresses and joins Alexander in the hot tub, where she must confront her vulnerability and past traumas. Despite her hesitation, she leans into Alexander's embrace, finding a new kind of warmth and security. Their intimate connection deepens, but Lily remains guarded, aware that their relationship is built on mutual benefit rather than genuine affection.
Note: All outlines are normalized as the above format.
2.3 PRM Training Data
The training data for the outline-PRM is basically constructed as follows:
We assume that Qwen2.5-7B-generated outlines under such a simple prompt are ALWAYS inferior to ground-truth outlines, and can be regarded as LOW quality.
Starting from the SFT dataset, we generate rollouts of each outline by providing the same prompt and preceding ground-truth outlines. Each rollout is prompted to consist of similar number of words as the ground-truth. And every rollout is then treated as a negative sample.
This approach ensures a balanced distribution of positive and negative labels.
- Example negative samples:
{
"prompt":"你是一位专业小说写手。你的任务是基于以下故事灵感进行小说大纲创作。\n一个关于陈玄通过直播算命解决他人问题,展示超凡堪舆技能的故事。\n\n以下是你设计的角色:\n:\n\n角色1:陈玄,主角,一个18岁的道士,穿越后继承道观,拥有登峰造极的堪舆算命技能,通过直播算命帮助他人逆天改命。\n\n角色2:龙夏,配角,一个中年男子,直播间粉丝之一,对陈玄算命持怀疑态度,后被家人出事逼迫承认错误。\n\n角色3:汤巫山,配角,陈玄的道观所在之地,位于豫中,是陈玄施展法术的地方。\n\n角色4:飞仙观,配角,陈玄继承的道观,位于汤巫山,是陈玄进行算命活动的场所。\n\n角色5:九家军,配角,一个户外探险直播团队,成员包括“九家军神马东西”、“九家军先锋”等,因直播算命事件受到关注。\n\n角色6:阴九,配角,九家军成员之一,因直播间算命事件被家人出事逼迫承认错误,最终求助陈玄。\n\n角色7:杨晨,配角,一个高三学生,因梦遗问题求助陈玄,最终被诊断为中蛊,通过陈玄的帮助解决了问题。\n\n角色8:王浩,配角,杨晨的室友,因嫉妒杨晨而下蛊,最终被杨晨发现并处理。\n\n角色9:黄圆圆,配角,一个自称来自山河五仙的老奶奶,因杨晨的问题而威胁陈玄,最终被陈玄化解。\n\n角色10:宿管大叔,配角,杨晨所在宿舍楼的管理员,因杨晨的问题而参与处理,最终协助杨晨解决困扰。\n\n\n请基于以上信息为小说前9章设计故事大纲,每个大纲大概在75字左右。",
"completions":[
"第1章:陈玄意外穿越至另一个时空,继承了一座古旧道观,并发现自己拥有堪舆算命的超凡技能。他在直播算命时遭遇了一系列质疑,但当他展示付款码后,人气迅速飙升,引起了包括九家军在内的众多观众的注意。",
"第2章:在直播中,陈玄为阴九进行了算命。起初,他的预测遭到嘲笑和质疑。陈玄直接指出阴九将有大凶之兆,这一预言引发了激烈的争议。最终,陈玄凭借问及阴九奶奶的具体去世时间和病因,揭露了事实,证明了自己算命的准确性。",
"第3章:阴九在直播中质问陈玄,得知家人的安危后,他慌乱中不断打赏。陈玄揭示了阴九家人的凶兆,阴九接到妻子的电话,得知女儿病危。在绝望中,他跪地求救,但陈玄只能告知他已经发生,并警告阴九。",
"第4章:陈玄通过直播为阴九解难,揭示其面临的危机源自何处,并指导其进行一系列仪式以渡过劫难。直播间观众因为参与怂恿而面临巨大的阴德损失威胁。",
"第5章:陈玄通过直播算命帮助阴九等人解决了问题,面对大量粉丝的求助请求,他通过发放福袋的方式筛选出第一位求助者杨晨。陈玄成功诊断其为中蛊状态,并提供了解决方案。",
"第6章:杨晨在直播中向陈玄求助,陈玄通过直播为他解梦。杨晨因梦遗问题求助陈玄,陈玄发现他中蛊,通过直播向观众展示了解决方案,最终成功治愈杨晨。"
],
"labels":[
true,
true,
true,
true,
true,
false
]
}
{
"prompt":"Act as a novel writer. Your task is to craft novel outlines based on the following story idea:\nA story about Lily, who discovers her fiancé Nathaniel's affair, then gets unexpectedly married to Alexander Russell for a business deal. She confronts Nathaniel and Melanie's scandal at a perfume competition, proving her own creation and facing betrayal. The wedding night offers a moment of vulnerability and connection, but their relationship remains complex.\n\nHere's your character design:\nLily Christian, protagonist, a determined and skilled perfumer who faces betrayal and competition to reclaim her rightful place.\nNathaniel Hall, antagonist, a selfish businessman who uses and betrays Lily, only to find his plans backfiring.\nMelanie Thayer, antagonist, Nathaniel's unfaithful lover who steals Lily's work, unaware of the consequences.\nAlexander Russell, protagonist, a savvy businessman who marries Lily to help her and takes on MN Inc., though his motives are complex.\nAnthony Moore, supporting, Nathaniel's loyal secretary who helps hide the truth about Lily's whereabouts.\nOlivia Hart, supporting, Lily's assistant who supports her and helps orchestrate the plan to expose Nathaniel and Melanie.\n\nPlease create the story outlines for the first 10 chapters, with each chapter outline in 80 words",
"completions":[
"Chapter 1: Lily Christian, battling a headache and insomnia, overhears Nathaniel and his fiancé Melanie discussing an affair. Devastated, she recalls their years together and Nathaniel's betrayal through his business partnership with Melanie. Seeking validation, Lily receives an unexpected call from Alexander Russell, CEO of La Beauté Group, offering a meeting to discuss a business proposal. Rushing to the café, she boards a limousine, unsure of Russell's intentions.",
"Chapter 2: Lily arrives at the clerk's office, expecting to discuss a business proposal with Alexander Russell. Instead, he proposes marriage. Reluctantly, she accepts, and they quickly get married. Alexander directs her to pass on perfume information to Edward and schedules a meeting at La Beauté Group. Back at MN Inc., Lily encounters Nathaniel's secretary, Anthony, who informs her that Nathaniel is looking for her. In Nathaniel's office, she overhears his angry outburst at his assistant, Olivia, for not knowing her whereabouts.",
"Chapter 3: At MN Inc., Nathaniel and Melanie are agitated over missing documents. Nathaniel accuses Lily of being absent from the lab, but she explains she was preparing for a competition. Melanie reveals Lily's past reluctance to participate in such events. Nathaniel checks the documents in a bag Lily holds, and they discuss the upcoming talent competition. Nathaniel insists Lily won't participate, but Lily feels betrayed. She calls Olivia, her assistant, who reports MN Inc. is well-prepared. At La Beauté Group, Edward briefs her on the situation. Alexander notices Lily's injury and lifts her, eliciting a mix of concern and tension.",
"Chapter 4: Alexander tends to Lily's wound, showing a new level of care she has never seen from Nathaniel. At La Beauté Group, Lily watches Nathaniel and Melanie's confident performance, feeling a mix of resentment and determination. During the competition, the host reveals a scandal involving identical perfumes from MN Inc. and Rebirth, potentially implicating MN Inc. in plagiarism. Lily's resolve hardens as she realizes her past work could be jeopardized.",
"Chapter 5: The competition host announces a delay in awarding results due to identical perfumes submitted by MN Inc. and Rebirth. Nathaniel protests the postponement, while Melanie eagerly speculates about the other company. The host reveals both companies are suspected of plagiarism, and Rebirth’s representative confirms submission data. Nathaniel asserts that Mel is the sole creator of First Love, but the host asks Rebirth’s perfumer to step forward, undermining Nathaniel’s claim. Alexander watches, impressed by Lily’s growing confidence and determination.",
"Chapter 6: At MN Inc., Nathaniel confronts Melanie about the competition scandal. Melanie insists she’s the creator of First Love, but Nathaniel remains skeptical. Anthony, Nathaniel’s loyal secretary, tries to smooth things over, but Nathaniel’s anger grows. Lily returns to MN Inc., finding the lab in chaos. She discovers a note from Nathaniel, indicating he knows about her affair. Nathaniel accuses her of betrayal, but Lily denies any wrongdoing, insisting she’s focused on the competition. Alexander arrives, offering support and reassurance. Lily feels a mix of vulnerability and determination."
],
"labels":[
true,
true,
true,
true,
true,
false
]
}
Note: Use the train_on_last_step_only
flag to ensure to train on balanced positive and negative labels.
3. Model Training
We trained 2 models on the above dataset:
- NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by Llama-Factory.
- NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward: The PRM for outline generation task, trained by using TRL library (Refer to Doc).
- Note: This model is trained with
train_on_last_step_only
flag set toTrue
- Note: This model is trained with
4. Usage & Performance Evaluation
4.1 Accuracy Metric
- Case Study
4.2 Sequential Rejection Sampling
Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
- Test-Time Scaling
Since this experiment does not aim to achieve O1-like reasoning behavior, the test-time compute here can be defined simply as a function of rejection_sampling_size
. Increasing the sampling size during inference leads to higher computational cost, but as expected, it also improves performance according to our PRM.
The "Test-Time Scaling Performance" is visualized as follows:
Note:
- Since this PRM has a relatively small size of 0.5B, we do not expect it to generalize well. It is likely that this PRM can only differentiate between low-quality and high-quality outlines generated by Qwen-7B. Outlines produced by other high-performing LLMs or those crafted with careful prompt engineering may be able to bypass or "fool" our PRM.
- There is significant room for improvement in the training data construction. For example, it could be enhanced by introducing a variety of flaws (e.g., repetitive patterns, toxic content, instruction-following failures, etc.) and incorporating outputs from more diverse LLMs.
4.3 Generalization Concerns
- Case Study: Format affects the results
5. Discussion
There are many PRM related papers one can refer to, A Roadmap to Reproduce o1 can be a good start for understanding the current status of O1 reproduction works.
The main difference between a PRM for O1-like models and a PRM for this project, is that there is no reasoning process in this project at all. The process or step is defined directly as each line of the final results (with no CoT process).
This difference of PRM design choice arises because obtaining step-wise reward signals for reasoning in creative writing is inherently challenging, with frequent ambiguity and subjectivity. Annotators may struggle to determine whether a particular reasoning step in the creative process is good or bad. Unlike math problems, where correctness is well-defined, creative writing allows for open valid paths—each leading to a unique outcome, as "all roads lead to Rome."
On the other hand, however, it's relatively simple to automatically construct negative outlines for an outline PRM training, hence a fast hands-on experience, why not give it a shot?
Note: There are automatic ways of turning ORM into a PRM (e.g., Free Process Rewards without Process Labels), but it's beyond our discussion now.
5. Conclusion
This project provides some minimum hands-on experience with PRM in a specific domain. However, it's important to note that it is far from perfect in terms of training data, model design, evaluation, and the insights gained.