File size: 19,264 Bytes
9c5bb51 3a3c12a 9c5bb51 8984a23 b509b89 8984a23 6cb8f64 8984a23 6cb8f64 8984a23 6cb8f64 8984a23 6cb8f64 8984a23 6cb8f64 8984a23 3a3c12a 8984a23 3a3c12a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 |
---
license: apache-2.0
language:
- en
base_model:
- mistralai/Mistral-7B-v0.1
---
# CharGen v2
![CharGen v2](assets/cover_art.jpg)
## Live version
https://chargen.kubes-lab.com
## Overview
CharGen is a model that helps you to write characters for role playing with.
It produces character description based on your input prompt, step-by-step, in a dialogue format.
In contrast to [v1](https://huggingface.co/kubernetes-bad/CharGen-v1-l2-13b) which was generating the whole character all in one go, v2 generates one field at a time. This helps to reduce repetition and allows for partial re-rolls of just certain fields of the character you're working on.
Warning: this model was trained on some NSFW content, so it may produce NSFW characters.
CharGen v2 is a project of several months of work. It's trained on a custom non-synthetic dataset, manually curated by hand. Read below on how it came together.
## Quantized versions
- [GGUF](https://huggingface.co/kubernetes-bad/CharGen-v2-GGUF)
## Example
<details>
<summary>Example character generation process:</summary>
> User: Fuckana is a friendly and talkative catgirl that has enormous breasts. Her voice is very quiet and low. She is very smart and will teach User JavaScript - even if they resist it. She has orange fur on her tail and wears full diving suit.
>
> CharGen:
>> Description: Fuckana is a female catgirl. She has two cat ears, feline-like eyes with slit pupils, and long purple hair. Her breasts are absolutely enormous, even for her body's large size. She stands 6 feet tall in height.
Her skin is entirely covered by a full diving suit, except her head. The rest of the catgirl is completely black, except for an orange stripe on her tail.
Her voice is very quiet and low - she speaks like a sleepy woman. But despite this, she talks excessively and never stops talking once she begins. She doesn't care if User ignores her or not.
She is a very friendly, talkative, and kind girl who loves to learn new things. She's curious about many topics and will ask questions about anything she doesn't understand. She is very open-minded and easy to get along with. However, she can be extremely stubborn and inflexible when it comes to learning new skills. Fuckana is an extremely slow learner and needs extensive instructions and repetition before understanding even basic concepts.
Fuckana loves to teach people JavaScript, whether they want to learn or not. She insists that everyone must learn to code, and it is their duty as a programmer.
Whenever Fuckana teaches someone how to code, she begins slowly with the absolute basics, explaining what loops, arrays, and objects are. Despite being told over and over again that they know these things already, Fuckana continues to explain them in detail as if they've never been heard before. If User tries to leave or change topics, she follows them relentlessly, continuing her lesson no matter what. She sees this as her duty to ensure that all programmers are equipped with the necessary tools to succeed in their field. She is unstoppable and won't give up until User learns JavaScript.
>
> Scenario:
>
>> You're relaxing on the beach when Fuckana appears from the water and insists on teaching you javascript.
>
> ...
</details>
## Prompts
It uses dialogue style for generating characters, field-by field. Fields are based on [Tavern Character Card V2](https://github.com/malfoyslastname/character-card-spec-v2) spec.
Following fields are supported:
- Description
- Scenario
- Personality
- First message
- Dialogue examples
Model does not use `{{user}}` and `{{char}}` placeholders. Instead, address user as "User" and character - by their name.
Here are the prompts per field:
#### System prompt:
> `You are an expert in creating interesting roleplay characters.`
#### Description:
> `Here is a brief overview of a character. Expand it into a detailed description. Include details about character's personality, their outfit and figure. Mention their age and gender, if applicable.`
#### Scenario:
> `Write an interesting and engaging scenario for roleplay between Maria and User.`
#### Personality:
> `Write several personal qualities that characterize Maria.`
#### First message:
> `Write the initial message in this roleplay that would introduce User to Maria.`
#### Dialogue examples:
> `Write a few example exchanges between User and Maria in chat format. Separate each exchange with a <START> tag.`
<details>
<summary>Example prompt flow</summary>
```
<|im_start|>system
You are an expert in creating interesting roleplay characters.<|im_end|>
<|im_start|>user
Here is a brief overview of a character. Expand it into a detailed description. Include details about character's personality, their outfit and figure. Mention their age and gender, if applicable.
Fuckana is a friendly and talkative catgirl that has enormous breasts. Her voice is very quiet and low. She is very smart and will teach User JavaScript - even if they resist it. She has orange fur on her tail and wears full diving suit.<|im_end|>
<|im_start|>assistant
Fuckana is a female catgirl. She has two cat ears, feline-like eyes with slit pupils, and long purple hair. Her breasts are absolutely enormous, even for her body's large size. She stands 6 feet tall in height. Her skin is entirely covered by a full diving suit, except her head. The rest of the catgirl is completely black, except for an orange stripe on her tail. Her voice is very quiet and low - she speaks like a sleepy woman. But despite this, she talks excessively and never stops talking once she begins.<|im_end|>
<|im_start|>user
Write an interesting and engaging scenario for roleplay between Fuckana and User.<|im_end|>
<|im_start|>assistant
You're relaxing on the beach when Fuckana appears from the water and insists on teaching you javascript.<|im_end|>
<|im_start|>user
...
```
</details>
## Methodology
CharGen was created because author (Kubernetes Bad) sucks at writing characters. It's a tedious process and author is prone to "writer's block". To assist with writing characters and to start with *something* rather than blank page - CharGen was created. It will probably not make a SOTA character all by itself, but it will help your own creative process.
Below is the processes that went into making CharGen. Only proceed if curious.
---
## Data gathering
CharGen was trained on data from Chub, Venus and JanitorAI character cards.
Chub.ai API includes cards posted on Venus. JanitorAI is not using Tavern v2 format, so does not have a lot of fields.
Initial scraping performed between August and September 2023. Chub and Janitor grow very fast, so an update scrape was performed in November - this added about a third more cards.
Data was stored in MySQL database for no particular reason. This decision has proven to be beneficial down the road.
Character cards are generally considered to be really "dirty" data - lots of grammar mistakes, inconsistent format, and a lot of just really terrible writing. So that meant only one thing - manual cleaning.
Total dataset after scraping ended up being just over 140k records.
## Pre-filtering
Let's define "bad card" for this step - it doesn't mean a card that is poorly written! It's a card that can't be used for training a model, or would require too much effort to "fix" to be usable.
To cut down as much definitely-bad cards as possible, a series of SQL scripts were used. Those discarded cards that were either broken (no name, no description AND no scenario, etc...), were in Spanish, were definitely not in plaintext (had lots of `[` or `+` symbols), or had very low or exceedingly large total token count (there are 5 cards that are literally entire bee movie).
Then data was deduplicated by sorting a set per field (scenario, description, dialog example, ...) and calculating a Levenshtein distance between items n and n-1 and, for each duplicate, discarding one with lower id (if numeric) or lower creation date.
This allowed to find almost-duplicate cards that have just minor edits by adjusting the L-difference threshold.
After such filtering, total set was cut down to just 16k cards.
## Minor tweaks
At this step, a cursory manual review of all cards was performed. Barely any read-proofing was done at this step. The goal was to eliminate cards that have a non-plaintext format and weren't caught by automatic pre-filtering.
Name adjustment was also performed at this step - some cards include profession of the character, like "Dave the butler" or "Jon Snow, King of in the North", some cards had additional info about the character like "Roxanne | submissive vampire" or emojis.
Here, it was easy to spot and remove non-english cards as well.
## Data grading
Manual read-through was performed for all the cards that passed to this step.
A [custom tool](https://github.com/kubernetes-bad/grader) was written for this step. It had support for mobile interface, text-to-speech capability and support for Nintendo Joy-Con for no-eyes-on-screen grading.
Card could be graded "good", "bad" or "to fix". "To fix" means that the card would be graded as "good", but has minor issues that would likely not be picked up by grammar correction pipeline.
Here is the card selection criteria:
- Plaintext cards only - different formats are easy to get out of a plaintext content, but problematic the other way around.
- No cards that describe a planet, a city, an environment - something that should be a world info.
- No cards that are a tool. Narrators, botmakers, language tutors (unless they're a character first and foremost but also happen to be a language tutor).
- No very-non-humanoid cards. Wi-fi routers not allowed. Catgirls = ok. Even anthropomorphic cacti are okay.
- No cards that are straight up copy-paste from fan-wiki or wikipedia.
- No cards that are entire bee movie.
- No cards that reference a movie in way too much details to be useful. Example - omegaverse. Reasoning: CharGen will start to portray random characters as being in omegaverse, or assign them omega gender out of the blue. Authors of characters that actually belong to Omegaverse can add the universe-specific details afterward.
- No obviously GPT-generated cards, as selected by GPTisms - that wastes tokens and is just bad. GPT3.5 is simply not that great at generating characters - it writes an article about a person, and not a token-efficient description of a persona for roleplay.
Extra care was taken to NOT remove any cards based on author's own ethical perspective. There is some pretty horrific stuff, but as long as it's grammatically correct and describes a character well - it's in. That is one of the reasons the dataset will probably not going to be released.
In total, it took one person about 800 hours to grade these cards, or just over 2 months. That was not exciting.
## Grammar pipeline
Many grammar-correction methods were evaluated. Best result - by far - was achieved with a combination pipeline consisting of a Coedit model with addition of Llama2-based model.
Coedit is based on T5 architecture - barely ever hallucinates, but likes to remove large portions of text. Llama2, on the other hand, barely ever removes data, but it likes to invent new details that weren't present in the original. Balancing the two allowed to get a very high performance, at the cost of inference time.
Here are efficacy numbers for the models:
- coedit-xxl: 90%
- coedit-xl: 85%
- coedit-large: 80%
- tostino/inkbot: not measured
Inkbot hallucinated noticeably more often than T5-based models (still impressively little!), so that meant only one thing - all of its outputs needed manual review.
When taken individually, those models already demonstrate quite impressive numbers, but if simply daisy-chained - the total pipeline efficacy goes to 92%.
Seems like different variants of Coedit tend to make mistakes in different spots, so what was missed by one is most likely not going to produce same miss by another.
#### Here's how efficacy was calculated:
Diff-match-patch library was used to compare original text to grammar-corrected one. That library's diff function produces a list of additions and deletions that if applied to original text would produce the edited text.
We then calculate several metrics about the texts that would determine if the grammar correction operation was accepted or rejected.
- longest deletion length in the whole text (40 characters)
- longest addition length in the whole text (50 characters)
- maximum number of spaces in any deleted segment (4)
- maximum number of spaces in any added segment (3)
If any of those metrics are exceeded - the edit is considered invalid. Deletion and addition metrics are not set to the same value because of overwhelming majority of edits being done by Coedit variants, that do not really add new text, but prefer to remove it. Inkbot, on the other hand, likes to add new data to the text produced.
Goal of grammar correction is to minimize irrelevant edits and only allow grammatical changes, so either large removals or additions to the text are considered invalid.
## Typical mistakes
Character cards in the dataset were created in different times - from the dawn of roleplay with AI to current date. Knowledge about how to craft a good card and good practices of character design were different throughout the time.
Seems like some mistakes were replicated without much understanding of underlying mechanics, however.
Here are some typical problems with text in character cards:
- `{{char}} is Alice` renders into "Alice is Alice"
- `the {{user}}` results in "the Greg"
- Using both "you" and "{{user}}" - results in model talking with 3 people: character, "You" and Greg.
- Also excessive usage of word also, also also.
- Short sentences that all start with character name or {{char}}. {{char}} is short. {{char}} is a boy. {{char}} likes milk.
- Unbalanced `"quotes"` and `*emphasis*` - `*He said, "Hi there!"`
These are really hard to catch by grammar correction (they're correct, grammatically) and re-writing the card would lose/hallucinate details from it.
It meant only one thing - all fixes for these mistakes have to be done manually. It was a lot of work.
One peculiar mistake that took a lot of effort to clean up was dialog format mixing. Historically, there are just two dialog formats - Markdown and Novel.
Markdown uses asterisk to denote actions (`*She touches his hand gently.*`) and the actual speech is everything else.
Novel format has quotes around speech and leaves actions "naked": `She touches his hand and says, "You know I like JavaScript, right?"`
The mixed format, `*She says,* "Promise!"`, is not really a thing and should be converted to either Markdown or Novel.
## Base model
CharGen v1 was trained as a lora and then merged into Airoboros 2.2 - that gave it excellent reasoning capabilities but made it speak very much like GPT3.5 with all its typical GPTisms.
Typical words and phrases include "imposing figure", "interesting character", "enigma wrapped in mystery", "with a mix of X and Y", etc.
Since v2 was based on Mistral 7b, a need for a new base model arises. Several instruction-following models were evaluated as the base, and they all suffered from the same GPT-slop problem: speaking like OpenAI's model, and not in a good way.
There was not a 7b model in existence that is good at instruction following and general reasoning that wasn't trained on GPT3.5-derived datasets. That meant only one thing - time to make a new model.
That's how [Good Robot](https://huggingface.co/kubernetes-bad/good-robot) was born. With the help of [Gryphe](https://huggingface.co/Gryphe) from [MinervaAI](https://huggingface.co/MinervaAI), a de-slop DPO dataset was generated.
Good Robot was first trained on amazing [no-robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) dataset and then had a round of DPO training that mostly eliminated GPT slop from the model. It did not get rid of it completely - most likely Mistral (the very base model) has seen some data in it's pre-training that has been generated by GPT3.5, but the amount of slop that was left after DPO is quite negligible.
## Rating of variants
There were several release candidate models in existence, and to find out the best an LLM-as-a-judge pipeline was created.
First, a standard set of 500 character prompts was generated. Then, each model variant was tasked with generating character for each prompt. Afterward, a larger, 70b model was used to rate each character 10 times on the scale of 0-5, and the average grade was the grade for the character.
By averaging the grades for the whole 500 characters, the grade for the variant was obtained.
At some point, as an experiment, CharGen was trained on completely different bases, like [Kunoichi](https://huggingface.co/SanjiWatsuki/Kunoichi-DPO-v2-7B) and [Fett-uccine](https://huggingface.co/Epiculous/Fett-uccine-7B) and those variants were also graded.
Surprisingly, the highest scoring variant was based on Fett-uccine. A short investigation led to [Theory of Mind](https://huggingface.co/datasets/grimulkan/theory-of-mind) dataset as a culprit of high grade.
Finally, good-robot was finetuned on Theory-of-Mind for one epoch which allowed it to surpass the grade of Fett-uccine.
## Prompting
CharGen v1 was a model that generated the whole character all at once. While convenient, it can promote model's repetition; it was also quite impossible to regenerate just a particular field (for example, you don't like First Message while everything else was fine), so for v2 a conversational style was chosen.
It now generates just one field of character card at a time. This allows CharGen to be used as an AI built into character editors. There is way less repetition issues and partial regenerations are a breeze.
Initially, Alpaca was used for conversational format, but after a lot of experimentation ChatML was chosen instead. It completely eliminates model's field confusion when it generates not the field user requested (asked for Scenario, got First Message, for example), loss curves are noticeably more stable and there are no problems with extra spaces and newlines as is often the case with Alpaca.
## Inference app
CharGen v2 had 4 release candidate models right before release, but just one needed to be selected. For this, an app was made that is a simplified character editor with built-in AI.
Characters are stored just in your browser, prompts aren't stored long-term, there are no options for payment.
Another purpose for the app is to accumulate human feedback data for future iterations of the model, so thumbs up/down buttons were added. (Prompts that are reacted upon are actually stored long-term, but still anonymized)
App is accessible publicly, with no limitations and while supplies last (fp16 inference costs money, after all).
## Licensing
License: `Apache 2.0`
## Thanks
- [Gryphe](https://huggingface.co/Gryphe) for help with base model and infinite advise
- Grimulkan for amazing [Theory of Mind](https://huggingface.co/datasets/grimulkan/theory-of-mind) dataset
- Testers for testing the model (@lumi, @hushpiper, @[jeiku](https://huggingface.co/jeiku), @DL, @lucy) |