r3m8 / llama3-simpo

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability.

In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm’s performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3.

We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 44.7 length-controlled win rate on AlpacaEval 2 – surpassing Claude 3 Opus on the leaderboard, and a 33.8 win rate on Arena-Hard – making it the strongest 8B open-source model.

References

HuggingFace Repository

Github Repository

Twitter/X Announcement

WildBench Leaderboard

# Llama 3 SimPO : The most powerful <10B LLM to date on Chatbot leaderboards from Princeton-NLP

![screenshot_06_01_2024_16_56_56_284.png](https://ollama.com/assets/r3m8/llama3-simpo/84450e24-1a46-4f6e-947c-d376e5396293)

## CLI

Open the terminal and run `ollama run r3m8/llama3-simpo`

## Model quantizations

Q4_K_M, Q5_K_S and Q5_K_M are recommended by llama.cpp.

## SimPO: Simple Preference Optimization with a Reference-Free Reward

In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3.

We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 44.7 length-controlled win rate on AlpacaEval 2 -- surpassing Claude 3 Opus on the leaderboard, and a 33.8 win rate on Arena-Hard -- **making it the strongest 8B open-source model**.

## References

[HuggingFace Repository](https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO)

[Github Repository](https://github.com/princeton-nlp/SimPO)

[Twitter/X Announcement](https://twitter.com/yumeng0818/status/1794055094389948546)

[WildBench Leaderboard](https://huggingface.co/spaces/allenai/WildBench)

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)