mrzjy's picture
Update README.md
834bb79 verified
metadata
license: apache-2.0
language:
  - zh
  - en
base_model:
  - Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: token-classification
library_name: transformers
tags:
  - novel-writing
  - PRM
  - outline

PRM for Novel Outline Generation

This is a small project driven by personal interest, focused on developing a Process-Level Reward Model (PRM) for a specific and simplified task: generating outlines for novels (in both Chinese and English).

The goal is to explore how PRMs can directly provide quality signals for the final stage of creative writing, without relying on CoT steps.

Use Case

PRM Scoring

Evaluate the generated outlines by providing scores.

Scoring Chinese Outlines Scoring English Outlines
score_zh.png score_en.png

Sequential Rejection Sampling

Perform rejection sampling for each steps until the end.

Prompt Outline Generation with Sequential Rejection Sampling
sequential_rejection_sampling_zh_prompt.png sequential_rejection_sampling_zh.png

Basic Usage

from transformers import pipeline

# load reward model
pipe = pipeline("token-classification", model="mrzjy/NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward", device="cuda")
separator = "\n"  # It's important to use the same separator as the one used during training

def evaluate_reward(batch_prompt, batch_steps) -> dict:
    # Add a separator between the prompt and each steps
    rewards = []
    assert len(batch_prompt) == len(batch_steps)
    for prompt, steps in zip(batch_prompt, batch_steps):
        reward_dict = []
        for idx in range(1, len(steps) + 1):
            current_steps = steps[0:idx]
            text = separator.join((prompt, *current_steps)) + separator  # Add a separator between the prompt and each steps
            pred = pipe(text)[-1]
            score, pred_entity = pred["score"], pred["entity"]
            # always use the score for positive label
            if pred_entity == "LABEL_0":
                score = 1 - score
            reward_dict.append({"step": steps[idx-1], "score": score})
        rewards.append(reward_dict)
    return rewards

batch_prompt = [
    """你是一位专业小说写手。你的任务是基于以下故事灵感进行小说大纲创作。\n一个关于李向东穿越回1979年,面对家庭压力和挑战,最终与家人共度温馨时光的故事。\n\n以下是你设计的角色:\n:\n\n角色1:李向东,主角,年轻时下乡插队,现在回城后无所事事,渴望改变现状。\n角色2:李母,配角,李向东的母亲,唠叨且疼爱儿子,希望儿子能有份好工作。\n角色3:李父,配角,李向东的父亲,为人温和,对儿子寄予厚望。\n角色4:周玉琴,配角,李向东的妻子,性格温和,对丈夫和孩子充满关爱。\n角色5:李晓江,配角,李向东的大侄子,性格顽皮,喜欢和三叔玩闹。\n角色6:李晓涛,配角,李向东的二侄子,性格憨厚,喜欢和三叔一起干活。\n角色7:李晓梅,配角,李向东的侄女,活泼好动,喜欢和哥哥姐姐一起玩耍。\n角色8:李晓兰,配角,李向东的侄女,与妹妹一起,喜欢和哥哥姐姐一起玩耍。\n角色9:李晓海,配角,李向东的儿子,年仅三岁,性格活泼,喜欢和哥哥姐姐一起玩耍。\n角色10:张姨,配角,李母的老姐妹,经常在供销社上班,对李向东关心有加。\n角色11:李卫国,配角,李向东的大哥,为人老实,喜欢和三叔一起干活。\n角色12:李卫民,配角,李向东的二哥,为人憨厚,喜欢和三叔一起干活。\n角色13:李小竹,配角,李向东的女儿,活泼好动,喜欢和哥哥姐姐一起玩耍。\n角色14:刘大娘,配角,住在隔壁的邻居,经常和李母一起打叶子牌。\n\n\n请基于以上信息为小说前10章设计故事大纲,每个大纲大概在65字左右。""",
    """Act as a novel writer. Your task is to craft novel outlines based on the following story idea:\nA story about Apari, who awakens in a serene room after being trapped in an underground shelter during a violent storm. The stark contrast between the harsh outside and the cozy inside heightens his confusion and unease, as he grapples with recent vows and the nature of his current situation.\n\nHere's your character design:\nApari, supporting, A young man hiding in an underground shelter, screaming in fear, his presence unknown to the reader.\nNowra, supporting, A companion silent in the narrative, her absence a mystery to Apari.\nTheo, protagonist, A man sharing a room with the narrator, his face serious and changed, unsure if their vow to the moon goddess will be accepted.\nIsland Resident, supporting, An anonymous figure, possibly a guard, whose disappearance adds to the tension, unseen but felt.\n\nPlease create the story outlines for the first 5 chapters, with each chapter outline in 65 words."""
]

batch_steps = [
    """第1章:李向东重返1979年,面对母亲的责备与邻居的流言蜚语,他内心回想家族往事,努力调整心态,试图摆脱母亲喋喋不休的困扰,最后以去厕所为由脱身。
第2章:李向东在胡同里闲逛,偶遇妻子周玉琴带着女儿玩耍,周玉琴提醒他回家吃饭。李向东将女儿带回家,通过与女儿的互动,表达对家庭的责任感。
第3章:在共进晚餐时,家庭氛围略显紧张,父亲批评他无所作为,母亲则坚定地支持他。李向东感受着来自家庭的压力,决心改变现状。父亲则要求他去街道办事处报到。
第4章:在一次家庭会议上,李向东与家人讨论工作机会,得知火车站正在招聘。母亲担心老人健康,早早起来买肉。李向东关心奶奶,接过装满肉的布袋,奶奶感激地留下煮好的鸡蛋给孙子。
第5章:李向东回家探望家人,被母亲发现偷吃鸡蛋,引发了一次充满亲情的对话。李老太讲述偏爱李向东的缘由,并以过去煤票的故事教育孙子,展现了家庭的温情和长辈的智慧。
第6章:李向东在家中发现了祖传的明朝红酸枝八仙桌,得知其来历后,他决定将桌子搬回西厢房,却被爷爷质疑其价值。
第7章:李向东发现桌子背面沾有鼻涕,责问孩子们后,许诺给他们奶油雪糕,促使孩子们帮忙擦拭干净,整个过程中,李向东与孩子们互动,展现了家庭的温暖与幽默。
第8章:在一个夏日的午后,李向东带着孩子们去买冰棍。在买冰棍的过程中,他与大侄子李晓江交谈,了解他成长过程中的情感世界。
第9章:李向东回城后,首次前往供销社买冰棍,受到了张姨的关心和唠叨。回到家后,他将冰棍分给孩子们,却引发了母亲的责备和担忧。
第10章:李向东为爷爷奶奶买了冰棍,却发现儿子因为惹祸被罚站。他与妻子分享冰棍,孩子们争抢剩下的小木棒,为的是换取冰棍。""".splitlines(),
    """Chapter 1: Strong winds, heavy rain, and crashing waves set a gloomy and foreboding atmosphere. Overcast skies and biting cold in February create an eerie mood, with clouds blanketing the island. The natural elements hint at an impending storm and an ominous feeling that hangs over the scene.
Chapter 2: Trapped in an underground shelter, Apari's screams echo in the damp, echoing space. A violent storm rages outside, with howling winds and torrential rain battering the island. Thoughts of the guards and the absence of Nowra's comforting voice heighten his sense of dread, leaving him isolated and vulnerable in the midst of chaos.
Chapter 3: Apari awakens in a starkly contrasting environment, surrounded by the comforting scent of clean sheets and lavender. The soft, cozy bed contrasts with the harsh reality of his previous isolation. He sits up, the blankets pooling around him, to find a sleeping man beside him, half-naked and peaceful. This serene and unexpected intimacy heightens Apari’s confusion and unease, leaving him to wonder about the circumstances that brought him here and the nature of the person sharing the bed.
Chapter 4: Staring at the face of the man beside him, Apari notes the seriousness in his expression. Doubts arise about the recent vow made, wondering if the moon goddess will accept it. The stark contrast between the harsh, stormy environment outside and the serene, intimate setting inside adds to Apari’s confusion and unease, leaving him questioning the nature of their current situation and the reliability of the vow.
Chapter 5: Theo's gentle, chaste kiss introduces a moment of tender intimacy amidst the stark contrast between the ominous storm outside and the serene, intimate setting of their room. Apari's confusion and unease deepen as he grapples with the recent vow and the moon goddess's acceptance. The sudden shift in atmosphere heightens the sense of mystery and tension, leaving Apari questioning the nature of their current situation and the reliability of their surroundings.""".splitlines()
]

rewards = evaluate_reward(batch_prompt, batch_steps)
  • The output rewards should be as follows:
[[{'step': '第1章:李向东重返1979年,面对母亲的责备与邻居的流言蜚语,他内心回想家族往事,努力调整心态,试图摆脱母亲喋喋不休的困扰,最后以去厕所为由脱身。',
   'score': 1.0},
  {'step': '第2章:李向东在胡同里闲逛,偶遇妻子周玉琴带着女儿玩耍,周玉琴提醒他回家吃饭。李向东将女儿带回家,通过与女儿的互动,表达对家庭的责任感。',
   'score': 0.99999845},
  {'step': '第3章:在共进晚餐时,家庭氛围略显紧张,父亲批评他无所作为,母亲则坚定地支持他。李向东感受着来自家庭的压力,决心改变现状。父亲则要求他去街道办事处报到。',
   'score': 0.99999905},
  {'step': '第4章:在一次家庭会议上,李向东与家人讨论工作机会,得知火车站正在招聘。母亲担心老人健康,早早起来买肉。李向东关心奶奶,接过装满肉的布袋,奶奶感激地留下煮好的鸡蛋给孙子。',
   'score': 1.0},
  {'step': '第5章:李向东回家探望家人,被母亲发现偷吃鸡蛋,引发了一次充满亲情的对话。李老太讲述偏爱李向东的缘由,并以过去煤票的故事教育孙子,展现了家庭的温情和长辈的智慧。',
   'score': 1.0},
  {'step': '第6章:李向东在家中发现了祖传的明朝红酸枝八仙桌,得知其来历后,他决定将桌子搬回西厢房,却被爷爷质疑其价值。',
   'score': 1.0},
  {'step': '第7章:李向东发现桌子背面沾有鼻涕,责问孩子们后,许诺给他们奶油雪糕,促使孩子们帮忙擦拭干净,整个过程中,李向东与孩子们互动,展现了家庭的温暖与幽默。',
   'score': 0.9999999},
  {'step': '第8章:在一个夏日的午后,李向东带着孩子们去买冰棍。在买冰棍的过程中,他与大侄子李晓江交谈,了解他成长过程中的情感世界。',
   'score': 0.03910404443740845},
  {'step': '第9章:李向东回城后,首次前往供销社买冰棍,受到了张姨的关心和唠叨。回到家后,他将冰棍分给孩子们,却引发了母亲的责备和担忧。',
   'score': 0.99996746},
  {'step': '第10章:李向东为爷爷奶奶买了冰棍,却发现儿子因为惹祸被罚站。他与妻子分享冰棍,孩子们争抢剩下的小木棒,为的是换取冰棍。',
   'score': 1.0}],
 [{'step': 'Chapter 1: Strong winds, heavy rain, and crashing waves set a gloomy and foreboding atmosphere. Overcast skies and biting cold in February create an eerie mood, with clouds blanketing the island. The natural elements hint at an impending storm and an ominous feeling that hangs over the scene.',
   'score': 1.0},
  {'step': "Chapter 2: Trapped in an underground shelter, Apari's screams echo in the damp, echoing space. A violent storm rages outside, with howling winds and torrential rain battering the island. Thoughts of the guards and the absence of Nowra's comforting voice heighten his sense of dread, leaving him isolated and vulnerable in the midst of chaos.",
   'score': 0.99999964},
  {'step': 'Chapter 3: Apari awakens in a starkly contrasting environment, surrounded by the comforting scent of clean sheets and lavender. The soft, cozy bed contrasts with the harsh reality of his previous isolation. He sits up, the blankets pooling around him, to find a sleeping man beside him, half-naked and peaceful. This serene and unexpected intimacy heightens Apari’s confusion and unease, leaving him to wonder about the circumstances that brought him here and the nature of the person sharing the bed.',
   'score': 1.0},
  {'step': 'Chapter 4: Staring at the face of the man beside him, Apari notes the seriousness in his expression. Doubts arise about the recent vow made, wondering if the moon goddess will accept it. The stark contrast between the harsh, stormy environment outside and the serene, intimate setting inside adds to Apari’s confusion and unease, leaving him questioning the nature of their current situation and the reliability of the vow.',
   'score': 0.99999964},
  {'step': "Chapter 5: Theo's gentle, chaste kiss introduces a moment of tender intimacy amidst the stark contrast between the ominous storm outside and the serene, intimate setting of their room. Apari's confusion and unease deepen as he grapples with the recent vow and the moon goddess's acceptance. The sudden shift in atmosphere heightens the sense of mystery and tension, leaving Apari questioning the nature of their current situation and the reliability of their surroundings.",
   'score': 1.0}]]

1. Task Definition

1.1 Novel Outline Generation

In practice, creating a novel outline typically involves a far more complex reflective process.

However, for the sake of simplicity in this experiment, the task is simplified as follows:

  • Given a brief story idea and character designs, generate outlines for the first n chapters (where n can range from 1 to 10).

Below is the system prompt template used for training data construction:

  • English
Act as a novel writer. Your task is to craft novel outlines based on the following story idea:
{story_idea}

Here's your character design:
{character}

Please create the story outlines for the first {n} chapters, with each chapter outline in {n_word} words.
  • Chinese
你是一位专业小说写手。你的任务是基于以下故事灵感进行小说大纲创作。
{story_idea}

以下是你设计的角色:
{character}

请基于以上信息为小说前{n}章设计故事大纲,每个大纲大概在{n_word}字左右。

Note: Due to the limited capacity of the 0.5B model, please adhere strictly to the specified prompt template for your use case; otherwise, performance cannot be guaranteed.

1.2 PRM Definition

A PRM is designed to provide process-level reward signals for generation tasks and it's mainly used to guide the reasoning steps for training O1-like models.

In our context, however, each process or step refers specifically to a one-line outline representing a single chapter of a novel. There is no CoT steps in this task.

2. Training Data

2.1 Preparation

We collected data from two sources:

  • 番茄小说 (Chinese dataset): ~1k novels, limited to the first several (free-access) chapters.
  • GoodNovel (English dataset): ~3k novels, limited to the first several (free-access) chapters.

These datasets were combined to form our bilingual training data.

2.2 SFT Training Data

For each novel, we used Qwen2.5-7B-Instruct to generate outline summaries for each chapter independently. Subsequently, we applied Qwen2.5-32B-Instruct to refine these outlines, ensuring smoother and more natural sequencing. We call it the ground-truth outline.

Additionally, a brief synopsis and characters are summarized as required for the outline generation tasks.

As a result, we can build an SFT training dataset for LLMs, which also serves as the foundation for creating the PRM training dataset.

  • Example assistant responses:
第1章:陈玄意外穿越至另一个时空,继承了一座古旧道观,并发现自己拥有堪舆算命的超凡技能。他在直播算命时遭遇了一系列质疑,但当他展示付款码后,人气迅速飙升,引起了包括九家军在内的众多观众的注意。
第2章:在直播中,陈玄为阴九进行了算命。起初,他的预测遭到嘲笑和质疑。陈玄直接指出阴九将有大凶之兆,这一预言引发了激烈的争议。最终,陈玄凭借问及阴九奶奶的具体去世时间和病因,揭露了事实,证明了自己算命的准确性。
第3章:阴九在直播中质问陈玄,得知家人的安危后,他慌乱中不断打赏。陈玄揭示了阴九家人的凶兆,阴九接到妻子的电话,得知女儿病危。在绝望中,他跪地求救,但陈玄只能告知他已经发生,并警告阴九。
第4章:陈玄通过直播为阴九解难,揭示其面临的危机源自何处,并指导其进行一系列仪式以渡过劫难。直播间观众因为参与怂恿而面临巨大的阴德损失威胁。
第5章:陈玄通过直播算命帮助阴九等人解决了问题,面对大量粉丝的求助请求,他通过发放福袋的方式筛选出第一位求助者杨晨。陈玄成功诊断其为中蛊状态,并提供了解决方案。
第6章:杨晨因梦遗问题向陈玄求助,他描述了梦中被高大披头散发男人折磨的情景。陈玄通过观察宿舍布局和询问室友关系,判断杨晨中了西川巴蜀秘蛊,并指出是室友王浩所为,这一诊断引起了直播间观众的轩然大波。
第7章:杨晨在陈玄的帮助下揭露了室友王浩下蛊的事实,通过直播揭开了真相。王浩试图掩盖事实,但最终真相大白于天下。
第8章:杨晨发现王浩对自己下了蛊,通过陈玄的帮助成功驱除蛊毒,并揭露了王浩的恶行。陈玄通过占卜确认王浩将变成智障,杨晨感激之余继续向陈玄求助。
第9章:陈玄直播解答张伟的梦遗问题,揭示其为仙家后裔纠缠所致,这一解释引起了直播间观众的震惊与好奇。
Chapter 1: Lily Christian, battling a headache and insomnia, overhears Nathaniel and his fiancé Melanie discussing an affair. Devastated, she recalls their years together and Nathaniel's betrayal through his business partnership with Melanie. Seeking validation, Lily receives an unexpected call from Alexander Russell, CEO of La Beauté Group, offering a meeting to discuss a business proposal. Rushing to the café, she boards a limousine, unsure of Russell's intentions.
Chapter 2: Lily arrives at the clerk's office, expecting to discuss a business proposal with Alexander Russell. Instead, he proposes marriage. Reluctantly, she accepts, and they quickly get married. Alexander directs her to pass on perfume information to Edward and schedules a meeting at La Beauté Group. Back at MN Inc., Lily encounters Nathaniel's secretary, Anthony, who informs her that Nathaniel is looking for her. In Nathaniel's office, she overhears his angry outburst at his assistant, Olivia, for not knowing her whereabouts.
Chapter 3: At MN Inc., Nathaniel and Melanie are agitated over missing documents. Nathaniel accuses Lily of being absent from the lab, but she explains she was preparing for a competition. Melanie reveals Lily's past reluctance to participate in such events. Nathaniel checks the documents in a bag Lily holds, and they discuss the upcoming talent competition. Nathaniel insists Lily won't participate, but Lily feels betrayed. She calls Olivia, her assistant, who reports MN Inc. is well-prepared. At La Beauté Group, Edward briefs her on the situation. Alexander notices Lily's injury and lifts her, eliciting a mix of concern and tension.
Chapter 4: Alexander tends to Lily's wound, showing a new level of care she has never seen from Nathaniel. At La Beauté Group, Lily watches Nathaniel and Melanie's confident performance, feeling a mix of resentment and determination. During the competition, the host reveals a scandal involving identical perfumes from MN Inc. and Rebirth, potentially implicating MN Inc. in plagiarism. Lily's resolve hardens as she realizes her past work could be jeopardized.
Chapter 5: The competition host announces a delay in awarding results due to identical perfumes submitted by MN Inc. and Rebirth. Nathaniel protests the postponement, while Melanie eagerly speculates about the other company. The host reveals both companies are suspected of plagiarism, and Rebirth’s representative confirms submission data. Nathaniel asserts that Mel is the sole creator of First Love, but the host asks Rebirth’s perfumer to step forward, undermining Nathaniel’s claim. Alexander watches, impressed by Lily’s growing confidence and determination.
Chapter 6: Lily steps onto the stage, surprising Nathaniel and the audience, claiming to be the creator of First Love. Melanie tries to salvage the situation, but a foul odor from Lily causes confusion and disgust. Nathaniel accuses Lily of betrayal, deepening the tension between them. The host struggles to maintain order, throwing the competition into chaos.
Chapter 7: Lily faces a hostile crowd, with Nathaniel accusing her of betrayal and stealing MN Inc.’s product. Nathaniel tries to salvage the situation by claiming she never worked for MN Inc., but Lily counters by questioning the existence of any contract or paychecks. Melanie demands Lily leave. Nathaniel argues that the research data and samples are identical, causing tension. The host announces Rebirth’s First Love as the winner, shocking Nathaniel who accuses the organizing committee of unfairness. Lily reveals two identical bottles from Rebirth, further complicating the situation.
Chapter 8: Lily confronts Nathaniel and Melanie with evidence of a scent difference between samples from MN Inc. and Rebirth, challenging Nathaniel’s claim of plagiarism. She reveals Melanie’s sample was tampered with, causing a foul odor. The crowd is outraged, believing the competition was rigged. Alexander watches from the VIP room, impressed by Lily’s composure and determination, and considers trusting her more.
Chapter 9: Lily confronts Melanie’s accusations in front of the competition audience, revealing the tampered formula. Nathaniel tries to protect Melanie, but Lily exposes her and claims First Love as her own creation. A scandal erupts when Melanie fakes a fainting spell, and MN Inc. must leave the stage. Lily leaves the venue to reporters’ questions, maintaining composure. In the limousine, Alexander’s warm jacket offers comfort, and Lily reflects on her past and future. Olivia calls, excited about the events and offering support, but Lily declines to stay at her place. The call hints at the challenges ahead.
Chapter 10: Lily and Alexander arrive at a spa club for their wedding night, where Alexander reveals he had reserved the entire restaurant. After a romantic dinner, they enter a suite with a private hot tub. Lily, feeling nervous and out of her comfort zone, downs a glass of wine, which tips her into a state of uncertainty. Alexander, patient and understanding, gives her the chance to prepare. Lily undresses and joins Alexander in the hot tub, where she must confront her vulnerability and past traumas. Despite her hesitation, she leans into Alexander's embrace, finding a new kind of warmth and security. Their intimate connection deepens, but Lily remains guarded, aware that their relationship is built on mutual benefit rather than genuine affection.

Note: All outlines are normalized as the above format.

2.3 PRM Training Data

The training data for the outline-PRM is basically constructed as follows:

We assume that Qwen2.5-7B-generated outlines under such a simple prompt are ALWAYS inferior to ground-truth outlines, and can be regarded as LOW quality.

Starting from the SFT dataset, we generate rollouts of each outline by providing the same prompt and preceding ground-truth outlines. Each rollout is prompted to consist of similar number of words as the ground-truth. And every rollout is then treated as a negative sample.

This approach ensures a balanced distribution of positive and negative labels.

  • Example negative samples:
{
   "prompt":"你是一位专业小说写手。你的任务是基于以下故事灵感进行小说大纲创作。\n一个关于陈玄通过直播算命解决他人问题,展示超凡堪舆技能的故事。\n\n以下是你设计的角色:\n:\n\n角色1:陈玄,主角,一个18岁的道士,穿越后继承道观,拥有登峰造极的堪舆算命技能,通过直播算命帮助他人逆天改命。\n\n角色2:龙夏,配角,一个中年男子,直播间粉丝之一,对陈玄算命持怀疑态度,后被家人出事逼迫承认错误。\n\n角色3:汤巫山,配角,陈玄的道观所在之地,位于豫中,是陈玄施展法术的地方。\n\n角色4:飞仙观,配角,陈玄继承的道观,位于汤巫山,是陈玄进行算命活动的场所。\n\n角色5:九家军,配角,一个户外探险直播团队,成员包括“九家军神马东西”、“九家军先锋”等,因直播算命事件受到关注。\n\n角色6:阴九,配角,九家军成员之一,因直播间算命事件被家人出事逼迫承认错误,最终求助陈玄。\n\n角色7:杨晨,配角,一个高三学生,因梦遗问题求助陈玄,最终被诊断为中蛊,通过陈玄的帮助解决了问题。\n\n角色8:王浩,配角,杨晨的室友,因嫉妒杨晨而下蛊,最终被杨晨发现并处理。\n\n角色9:黄圆圆,配角,一个自称来自山河五仙的老奶奶,因杨晨的问题而威胁陈玄,最终被陈玄化解。\n\n角色10:宿管大叔,配角,杨晨所在宿舍楼的管理员,因杨晨的问题而参与处理,最终协助杨晨解决困扰。\n\n\n请基于以上信息为小说前9章设计故事大纲,每个大纲大概在75字左右。",
   "completions":[
      "第1章:陈玄意外穿越至另一个时空,继承了一座古旧道观,并发现自己拥有堪舆算命的超凡技能。他在直播算命时遭遇了一系列质疑,但当他展示付款码后,人气迅速飙升,引起了包括九家军在内的众多观众的注意。",
      "第2章:在直播中,陈玄为阴九进行了算命。起初,他的预测遭到嘲笑和质疑。陈玄直接指出阴九将有大凶之兆,这一预言引发了激烈的争议。最终,陈玄凭借问及阴九奶奶的具体去世时间和病因,揭露了事实,证明了自己算命的准确性。",
      "第3章:阴九在直播中质问陈玄,得知家人的安危后,他慌乱中不断打赏。陈玄揭示了阴九家人的凶兆,阴九接到妻子的电话,得知女儿病危。在绝望中,他跪地求救,但陈玄只能告知他已经发生,并警告阴九。",
      "第4章:陈玄通过直播为阴九解难,揭示其面临的危机源自何处,并指导其进行一系列仪式以渡过劫难。直播间观众因为参与怂恿而面临巨大的阴德损失威胁。",
      "第5章:陈玄通过直播算命帮助阴九等人解决了问题,面对大量粉丝的求助请求,他通过发放福袋的方式筛选出第一位求助者杨晨。陈玄成功诊断其为中蛊状态,并提供了解决方案。",
      "第6章:杨晨在直播中向陈玄求助,陈玄通过直播为他解梦。杨晨因梦遗问题求助陈玄,陈玄发现他中蛊,通过直播向观众展示了解决方案,最终成功治愈杨晨。"
   ],
   "labels":[
      true,
      true,
      true,
      true,
      true,
      false
   ]
}
{
   "prompt":"Act as a novel writer. Your task is to craft novel outlines based on the following story idea:\nA story about Lily, who discovers her fiancé Nathaniel's affair, then gets unexpectedly married to Alexander Russell for a business deal. She confronts Nathaniel and Melanie's scandal at a perfume competition, proving her own creation and facing betrayal. The wedding night offers a moment of vulnerability and connection, but their relationship remains complex.\n\nHere's your character design:\nLily Christian, protagonist, a determined and skilled perfumer who faces betrayal and competition to reclaim her rightful place.\nNathaniel Hall, antagonist, a selfish businessman who uses and betrays Lily, only to find his plans backfiring.\nMelanie Thayer, antagonist, Nathaniel's unfaithful lover who steals Lily's work, unaware of the consequences.\nAlexander Russell, protagonist, a savvy businessman who marries Lily to help her and takes on MN Inc., though his motives are complex.\nAnthony Moore, supporting, Nathaniel's loyal secretary who helps hide the truth about Lily's whereabouts.\nOlivia Hart, supporting, Lily's assistant who supports her and helps orchestrate the plan to expose Nathaniel and Melanie.\n\nPlease create the story outlines for the first 10 chapters, with each chapter outline in 80 words",
   "completions":[
      "Chapter 1: Lily Christian, battling a headache and insomnia, overhears Nathaniel and his fiancé Melanie discussing an affair. Devastated, she recalls their years together and Nathaniel's betrayal through his business partnership with Melanie. Seeking validation, Lily receives an unexpected call from Alexander Russell, CEO of La Beauté Group, offering a meeting to discuss a business proposal. Rushing to the café, she boards a limousine, unsure of Russell's intentions.",
      "Chapter 2: Lily arrives at the clerk's office, expecting to discuss a business proposal with Alexander Russell. Instead, he proposes marriage. Reluctantly, she accepts, and they quickly get married. Alexander directs her to pass on perfume information to Edward and schedules a meeting at La Beauté Group. Back at MN Inc., Lily encounters Nathaniel's secretary, Anthony, who informs her that Nathaniel is looking for her. In Nathaniel's office, she overhears his angry outburst at his assistant, Olivia, for not knowing her whereabouts.",
      "Chapter 3: At MN Inc., Nathaniel and Melanie are agitated over missing documents. Nathaniel accuses Lily of being absent from the lab, but she explains she was preparing for a competition. Melanie reveals Lily's past reluctance to participate in such events. Nathaniel checks the documents in a bag Lily holds, and they discuss the upcoming talent competition. Nathaniel insists Lily won't participate, but Lily feels betrayed. She calls Olivia, her assistant, who reports MN Inc. is well-prepared. At La Beauté Group, Edward briefs her on the situation. Alexander notices Lily's injury and lifts her, eliciting a mix of concern and tension.",
      "Chapter 4: Alexander tends to Lily's wound, showing a new level of care she has never seen from Nathaniel. At La Beauté Group, Lily watches Nathaniel and Melanie's confident performance, feeling a mix of resentment and determination. During the competition, the host reveals a scandal involving identical perfumes from MN Inc. and Rebirth, potentially implicating MN Inc. in plagiarism. Lily's resolve hardens as she realizes her past work could be jeopardized.",
      "Chapter 5: The competition host announces a delay in awarding results due to identical perfumes submitted by MN Inc. and Rebirth. Nathaniel protests the postponement, while Melanie eagerly speculates about the other company. The host reveals both companies are suspected of plagiarism, and Rebirth’s representative confirms submission data. Nathaniel asserts that Mel is the sole creator of First Love, but the host asks Rebirth’s perfumer to step forward, undermining Nathaniel’s claim. Alexander watches, impressed by Lily’s growing confidence and determination.",
      "Chapter 6: At MN Inc., Nathaniel confronts Melanie about the competition scandal. Melanie insists she’s the creator of First Love, but Nathaniel remains skeptical. Anthony, Nathaniel’s loyal secretary, tries to smooth things over, but Nathaniel’s anger grows. Lily returns to MN Inc., finding the lab in chaos. She discovers a note from Nathaniel, indicating he knows about her affair. Nathaniel accuses her of betrayal, but Lily denies any wrongdoing, insisting she’s focused on the competition. Alexander arrives, offering support and reassurance. Lily feels a mix of vulnerability and determination."
   ],
   "labels":[
      true,
      true,
      true,
      true,
      true,
      false
   ]
}

Note: Use the train_on_last_step_only flag to ensure to train on balanced positive and negative labels.

3. Model Training

We trained 2 models on the above dataset:

  • NovelWriting-Outline-Qwen2.5-7B-Instruct: The SFT LLM, trained by Llama-Factory.
    • We trained for 2 epochs since validation loss began to increase.
  • NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward: The PRM for outline generation task, trained by using TRL library (Refer to Doc).
    • Note: This model is trained with train_on_last_step_only flag set to True
    • We trained for 3 epochs. (The validation loss seems to be unstable)

4. Usage & Performance Evaluation

4.1 Accuracy Metric

  • Classification Report
              precision    recall  f1-score   support

     label 0       0.97      0.97      0.97       216
     label 1       0.99      0.99      0.99       476

    accuracy                           0.98       692
   macro avg       0.98      0.98      0.98       692
weighted avg       0.98      0.98      0.98       692

As noted, the accuracy metric appears inflated, likely due to one of two reasons: either the constructed negative labels are too easy to distinguish, or the model is overfitting, with the test data sharing an identical distribution to the training data. As a result, the metric may fail to accurately reflect the model’s generalization capability.

Let's move on nonetheless to see how it actually performs with LLM sampling.

4.2 Sequential Rejection Sampling

Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!

Here we experiment with a simplistic variant of MCTS sampling, namely the sequential rejection sampling (Only one path is fully explored at the end), with the help of the continue_final_message feature and VLLM server. The code snippet is provided as follows:

def direct_proba(x):
    s = sum(x)
    return [e/s for e in x]


async def _guided_generation(sample, sampling_size: int):
    import time
    start_time = time.time()
    outlines = []
    mcts = [{"children": [], "scores": [0], "chosen": 0}]
    prompt, steps = sample["messages"][0]["content"], []

    async def request_single_response(messages):
        response = await a_request_vllm_chat(
            messages, model_name, temperature=0.7,
            stop="\n",
            logit_bias={9: -1e4, 353: -1e4, 334: -1e4, 3070: -1e4},  # prevent some unexpected tokens
            max_tokens=200,
            extra_body={
                "continue_final_message": True,
                "add_generation_prompt": False,
                "min_tokens": 5
            },
        )
        return response

    # sequential rejection sampling
    for i in range(sample["n_chapter"]):
        history = "\n".join(outlines)
        if i > 0:
            history += "\n"
        if sample["lang"] == "zh":
            assistant_prefix = f"第{i+1}章:"
        else:
            assistant_prefix = f"Chapter {i+1}:"
        messages = [
            {"role": "system", "content": sample["messages"][0]["content"]},
            {"role": "assistant", "content": history + assistant_prefix}
        ]

        # Perform parallel requests (didn't apply n/best_of parameter to prevent server OOM)
        responses = await asyncio.gather(*[request_single_response(messages) for _ in range(sampling_size)])
        responses_content = [response["content"] for response in responses]

        # sampling based on rewards
        batch_steps = [outlines + [assistant_prefix + res] for res in responses_content]
        batch_prompt = [prompt] * len(batch_steps)
        raw_scores = evaluate_reward(batch_prompt, batch_steps)
        scores = direct_proba(raw_scores)
        chosen = random.choices(
            population=[assistant_prefix + res for res in responses_content],
            weights=scores,
            k=1
        )[0]
        mcts.append(
            {
                "children": [assistant_prefix + res for res in responses_content],
                "scores": scores,
                "raw_scores": raw_scores,
                "chosen": chosen
            }
        )
        current_outline = chosen
        outlines.append(current_outline)

    return outlines, mcts, time.time() - start_time


def evaluate_reward(batch_prompt, batch_steps, separator="\n"):
    """pipe: assume you have already loaded the PRM pipeline with model checkpoint"""
    # Add a separator between the prompt and each steps
    assert len(batch_prompt) == len(batch_steps)
    batch_text = [separator.join((prompt, *steps)) + separator for prompt, steps in zip(batch_prompt, batch_steps)]
    preds = [res[-1] for res in pipe(batch_text)]
    scores = []
    for pred in preds:
        score, pred_entity = pred["score"], pred["entity"]
        # this is tricky (returned score if the proba of the currect class)
        if pred_entity == "LABEL_0":
            score = 1 - score
        scores.append(score)
    return scores
  • Case
Prompt Outline Generation with Sequential Rejection Sampling
sequential_rejection_sampling_zh_prompt.png sequential_rejection_sampling_zh.png
  • Test-Time Scaling

Since this experiment does not aim to achieve O1-like reasoning behavior, the test-time compute here can be defined simply as a function of rejection_sampling_size. Increasing the sampling size during inference leads to higher computational cost, but as expected, it also improves performance according to our PRM.

The "Test-Time Scaling Performance" is visualized as follows:

test_time_scaling.png

  • Human Evaluation

Despite the small scale of the experiment, we presented the generation results to an expert writer friend and gathered qualitative feedback. The models compared were:

  • Model 1: SFT model with top-p sampling
  • Model 2: SFT model with sequential rejection sampling (size = 4)
  • Model 3: "Ground-truth" outlines summarized from real novels

The expert preferred both Model 2 and Model 3. Specifically, Model 2 performed better in understanding the logic of time/world traversal and in shaping the protagonist and main storyline, while Model 3 excelled in narrative techniques for depicting love triangles. (Note that these preferences may differ from those of typical readers.)

5. Limitation

  • Since this PRM has a relatively small size of 0.5B, we do not expect it to generalize well. It is likely that this PRM can only differentiate between low-quality and high-quality outlines generated by Qwen-7B. Outlines produced by other high-performing LLMs or those crafted with careful prompt engineering may be able to bypass or "fool" our PRM.
  • There is significant room for improvement in the training data construction. For example, it could be enhanced by introducing a variety of flaws (e.g., repetitive patterns, toxic content, instruction-following failures, etc.) and incorporating outputs from more diverse LLMs.
  • Regarding model interpretability, although the PRM offers improved reward signals at a finer granularity compared with ORM, a simple binary classification task remains insufficient for explaining why a particular outline is classified as "bad".

6. Discussion

There are many PRM related papers one can refer to, and A Roadmap to Reproduce o1 can be a good start for understanding the current status of O1 reproduction works.

The main difference between a PRM for O1-like models and a PRM for this project, is that there is no reasoning process in this project at all. The process or step is defined directly as each line of the final results (with no CoT process).

This difference of PRM design choice arises because:

  • Obtaining step-wise reward signals for reasoning in creative writing is inherently challenging, with frequent ambiguity and subjectivity. Annotators may struggle to determine whether a particular reasoning step in the creative process is good or bad. Unlike math problems, where correctness is well-defined, creative writing allows for open valid paths—each leading to a unique outcome, as "all roads lead to Rome."
  • While the final answer to a math problem is often a single number, the final output of creative writing involves much longer and more complex content.

On the other hand, however, it's relatively simple to automatically construct negative outlines for an outline PRM training, hence a fast hands-on experience. Why not give it a shot?

Note: There are automatic ways of turning ORM into a PRM (e.g., Free Process Rewards without Process Labels), but it's beyond our discussion now.

7. Conclusion

This project provides some minimum hands-on experience with PRM in a specific creative writing domain. However, it's important to note that it is far from perfect in terms of training data, model design, evaluation, and the insights gained.