155 Downloads Updated 6 months ago
license: apache-2.0
datasets: - lars1234/story_writing_benchmark
base_model: - mistralai/Mistral-Small-24B-Instruct-2501
Mistral-Small-24B-Instruct-2501-writer is a fine-tuned version of mistralai/Mistral-Small-24B-Instruct-2501
, optimized specifically for creative writing tasks.
The following table was generated by creating 568 stories based on the same prompts as in the lars1234/story_writing_benchmark dataset and then evaluating them using the benchmark’s evaluator models.
Metric | Mistral-2501 | Mistral-Writer | Gemma-Ataraxy |
---|---|---|---|
Grammar & Spelling | 82.1% | 83.3% | 88.8% |
Clarity | 63.0% | 64.1% | 65.8% |
Logical Connection | 57.7% | 64.1% | 66.0% |
Scene Construction | 56.1% | 62.0% | 64.1% |
Internal Consistency | 67.2% | 73.1% | 75.1% |
Character Consistency | 50.7% | 54.0% | 54.3% |
Character Motivation | 44.6% | 49.8% | 49.2% |
Sentence Variety | 57.7% | 64.4% | 64.0% |
Avoiding Clichés | 24.6% | 33.3% | 31.2% |
Natural Dialogue | 42.9% | 51.9% | 48.3% |
Avoiding Tropes | 28.6% | 37.4% | 40.0% |
Character Depth | 35.7% | 46.4% | 45.4% |
Character Interactions | 45.0% | 52.0% | 51.7% |
Reader Interest | 54.1% | 63.1% | 63.0% |
Plot Resolution | 35.3% | 45.3% | 44.9% |
Average | 49.3% | 56.5% | 56.1% |
Mistral-Small-24B-Instruct-2501-writer outperforms the base Mistral model across all metrics. Gemma-2-Ataraxy still shows higher creativity in some categories, as seen for example in its better score on “Avoiding Tropes.”
The model was fine-tuned using Direct Preference Optimization (DPO), which requires pairs of responses where one is preferred over the other. The pairs were created from the lars1234/story_writing_benchmark dataset using two approaches:
The final JSONL dataset contained these pairs in the format:
{"prompt": "Write a story about...", "chosen": "High quality story text...", "rejected": "Lower quality story text..."}
See this script for the code.
The model was fine-tuned using Axolotl with the following parameters:
A grid search was performed on inference parameters to find optimal generation settings: - min_p: 0.05 (fixed) - temperature: 0.5, 0.75, 1.0, 1.25
The most significant quality improvement was observed when increasing temperature from 0.5 to 0.75. Beyond this point, other quality aspects began to suffer.