639 Downloads Updated 1 year ago
Updated 1 year ago
1 year ago
357b24281f25 · 7.7GB
CharGen is a model that helps you to write characters for role playing with.
It produces character description based on your input prompt, step-by-step, in a dialogue format.
In contrast to v1 which was generating the whole character all in one go, v2 generates one field at a time. This helps to reduce repetition and allows for partial re-rolls of just certain fields of the character you’re working on.
Warning: this model was trained on some NSFW content, so it may produce NSFW characters.
CharGen v2 is a project of several months of work. It’s trained on a custom non-synthetic dataset, manually curated by hand. Read below on how it came together.
It uses dialogue style for generating characters, field-by field. Fields are based on Tavern Character Card V2 spec. Following fields are supported: - Description - Scenario - Personality - First message - Dialogue examples
Model does not use {{user}}
and {{char}}
placeholders. Instead, address user as “User” and character - by their name.
Here are the prompts per field:
You are an expert in creating interesting roleplay characters.
Description:
Here is a brief overview of a character. Expand it into a detailed description. Include details about character's personality, their outfit and figure. Mention their age and gender, if applicable.
Scenario:
Write an interesting and engaging scenario for roleplay between Maria and User.
Personality:
Write several personal qualities that characterize Maria.
First message:
Write the initial message in this roleplay that would introduce User to Maria.
Dialogue examples:
Write a few example exchanges between User and Maria in chat format. Separate each exchange with a <START> tag.
CharGen was created because author (Kubernetes Bad) sucks at writing characters. It’s a tedious process and author is prone to “writer’s block”. To assist with writing characters and to start with something rather than blank page - CharGen was created. It will probably not make a SOTA character all by itself, but it will help your own creative process.
Below is the processes that went into making CharGen. Only proceed if curious.
CharGen was trained on data from Chub, Venus and JanitorAI character cards.
Chub.ai API includes cards posted on Venus. JanitorAI is not using Tavern v2 format, so does not have a lot of fields. Initial scraping performed between August and September 2023. Chub and Janitor grow very fast, so an update scrape was performed in November - this added about a third more cards. Data was stored in MySQL database for no particular reason. This decision has proven to be beneficial down the road.
Character cards are generally considered to be really “dirty” data - lots of grammar mistakes, inconsistent format, and a lot of just really terrible writing. So that meant only one thing - manual cleaning.
Total dataset after scraping ended up being just over 140k records.
Let’s define “bad card” for this step - it doesn’t mean a card that is poorly written! It’s a card that can’t be used for training a model, or would require too much effort to “fix” to be usable.
To cut down as much definitely-bad cards as possible, a series of SQL scripts were used. Those discarded cards that were either broken (no name, no description AND no scenario, etc…), were in Spanish, were definitely not in plaintext (had lots of [
or +
symbols), or had very low or exceedingly large total token count (there are 5 cards that are literally entire bee movie).
Then data was deduplicated by sorting a set per field (scenario, description, dialog example, …) and calculating a Levenshtein distance between items n and n-1 and, for each duplicate, discarding one with lower id (if numeric) or lower creation date.
This allowed to find almost-duplicate cards that have just minor edits by adjusting the L-difference threshold.
After such filtering, total set was cut down to just 16k cards.
At this step, a cursory manual review of all cards was performed. Barely any read-proofing was done at this step. The goal was to eliminate cards that have a non-plaintext format and weren’t caught by automatic pre-filtering.
Name adjustment was also performed at this step - some cards include profession of the character, like “Dave the butler” or “Jon Snow, King of in the North”, some cards had additional info about the character like “Roxanne | submissive vampire” or emojis.
Here, it was easy to spot and remove non-english cards as well.
Manual read-through was performed for all the cards that passed to this step.
A custom tool was written for this step. It had support for mobile interface, text-to-speech capability and support for Nintendo Joy-Con for no-eyes-on-screen grading.
Card could be graded “good”, “bad” or “to fix”. “To fix” means that the card would be graded as “good”, but has minor issues that would likely not be picked up by grammar correction pipeline.
Here is the card selection criteria:
Extra care was taken to NOT remove any cards based on author’s own ethical perspective. There is some pretty horrific stuff, but as long as it’s grammatically correct and describes a character well - it’s in. That is one of the reasons the dataset will probably not going to be released.
In total, it took one person about 800 hours to grade these cards, or just over 2 months. That was not exciting.
Many grammar-correction methods were evaluated. Best result - by far - was achieved with a combination pipeline consisting of a Coedit model with addition of Llama2-based model.
Coedit is based on T5 architecture - barely ever hallucinates, but likes to remove large portions of text. Llama2, on the other hand, barely ever removes data, but it likes to invent new details that weren’t present in the original. Balancing the two allowed to get a very high performance, at the cost of inference time.
Here are efficacy numbers for the models: - coedit-xxl: 90% - coedit-xl: 85% - coedit-large: 80% - tostino/inkbot: not measured
Inkbot hallucinated noticeably more often than T5-based models (still impressively little!), so that meant only one thing - all of its outputs needed manual review.
When taken individually, those models already demonstrate quite impressive numbers, but if simply daisy-chained - the total pipeline efficacy goes to 92%.
Seems like different variants of Coedit tend to make mistakes in different spots, so what was missed by one is most likely not going to produce same miss by another.
Diff-match-patch library was used to compare original text to grammar-corrected one. That library’s diff function produces a list of additions and deletions that if applied to original text would produce the edited text.
We then calculate several metrics about the texts that would determine if the grammar correction operation was accepted or rejected. - longest deletion length in the whole text (40 characters) - longest addition length in the whole text (50 characters) - maximum number of spaces in any deleted segment (4) - maximum number of spaces in any added segment (3)
If any of those metrics are exceeded - the edit is considered invalid. Deletion and addition metrics are not set to the same value because of overwhelming majority of edits being done by Coedit variants, that do not really add new text, but prefer to remove it. Inkbot, on the other hand, likes to add new data to the text produced.
Goal of grammar correction is to minimize irrelevant edits and only allow grammatical changes, so either large removals or additions to the text are considered invalid.
Character cards in the dataset were created in different times - from the dawn of roleplay with AI to current date. Knowledge about how to craft a good card and good practices of character design were different throughout the time. Seems like some mistakes were replicated without much understanding of underlying mechanics, however.
Here are some typical problems with text in character cards:
- {{char}} is Alice
renders into “Alice is Alice”
- the {{user}}
results in “the Greg”
- Using both “you” and “{{user}}” - results in model talking with 3 people: character, “You” and Greg.
- Also excessive usage of word also, also also.
- Short sentences that all start with character name or {{char}}. {{char}} is short. {{char}} is a boy. {{char}} likes milk.
- Unbalanced "quotes"
and *emphasis*
- *He said, "Hi there!"
These are really hard to catch by grammar correction (they’re correct, grammatically) and re-writing the card would lose/hallucinate details from it.
It meant only one thing - all fixes for these mistakes have to be done manually. It was a lot of work.
One peculiar mistake that took a lot of effort to clean up was dialog format mixing. Historically, there are just two dialog formats - Markdown and Novel.
Markdown uses asterisk to denote actions (*She touches his hand gently.*
) and the actual speech is everything else.
Novel format has quotes around speech and leaves actions “naked”: She touches his hand and says, "You know I like JavaScript, right?"
The mixed format, *She says,* "Promise!"
, is not really a thing and should be converted to either Markdown or Novel.
CharGen v1 was trained as a lora and then merged into Airoboros 2.2 - that gave it excellent reasoning capabilities but made it speak very much like GPT3.5 with all its typical GPTisms.
Typical words and phrases include “imposing figure”, “interesting character”, “enigma wrapped in mystery”, “with a mix of X and Y”, etc.
Since v2 was based on Mistral 7b, a need for a new base model arises. Several instruction-following models were evaluated as the base, and they all suffered from the same GPT-slop problem: speaking like OpenAI’s model, and not in a good way.
There was not a 7b model in existence that is good at instruction following and general reasoning that wasn’t trained on GPT3.5-derived datasets. That meant only one thing - time to make a new model.
That’s how Good Robot was born. With the help of Gryphe from MinervaAI, a de-slop DPO dataset was generated.
Good Robot was first trained on amazing no-robots dataset and then had a round of DPO training that mostly eliminated GPT slop from the model. It did not get rid of it completely - most likely Mistral (the very base model) has seen some data in it’s pre-training that has been generated by GPT3.5, but the amount of slop that was left after DPO is quite negligible.
There were several release candidate models in existence, and to find out the best an LLM-as-a-judge pipeline was created.
First, a standard set of 500 character prompts was generated. Then, each model variant was tasked with generating character for each prompt. Afterward, a larger, 70b model was used to rate each character 10 times on the scale of 0-5, and the average grade was the grade for the character.
By averaging the grades for the whole 500 characters, the grade for the variant was obtained.
At some point, as an experiment, CharGen was trained on completely different bases, like Kunoichi and Fett-uccine and those variants were also graded.
Surprisingly, the highest scoring variant was based on Fett-uccine. A short investigation led to Theory of Mind dataset as a culprit of high grade.
Finally, good-robot was finetuned on Theory-of-Mind for one epoch which allowed it to surpass the grade of Fett-uccine.
CharGen v1 was a model that generated the whole character all at once. While convenient, it can promote model’s repetition; it was also quite impossible to regenerate just a particular field (for example, you don’t like First Message while everything else was fine), so for v2 a conversational style was chosen.
It now generates just one field of character card at a time. This allows CharGen to be used as an AI built into character editors. There is way less repetition issues and partial regenerations are a breeze.
Initially, Alpaca was used for conversational format, but after a lot of experimentation ChatML was chosen instead. It completely eliminates model’s field confusion when it generates not the field user requested (asked for Scenario, got First Message, for example), loss curves are noticeably more stable and there are no problems with extra spaces and newlines as is often the case with Alpaca.
CharGen v2 had 4 release candidate models right before release, but just one needed to be selected. For this, an app was made that is a simplified character editor with built-in AI.
Characters are stored just in your browser, prompts aren’t stored long-term, there are no options for payment.
Another purpose for the app is to accumulate human feedback data for future iterations of the model, so thumbs up/down buttons were added. (Prompts that are reacted upon are actually stored long-term, but still anonymized)
App is accessible publicly, with no limitations and while supplies last (fp16 inference costs money, after all).
License: cc-by-nc-4.0
TL;DR: Free for you, unless you make money on it.
If you would like to use CharGen (or derivative) commercially - contact author to make a charitable donation with receipt.
CharGen is at heart an open-source project, but it is based on data that is not owned by CharGen’s author. Dataset cannot be released without data owners’ permission. Scripts, methods, tools and everything else will eventually be released under open-source licenses.
Model CharGen is available for you to do whatever you want with it, as long as you’re not using it for commercial purposes.