emsi / zephyr-orpo-141b-a35b-v0.1

Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr 141B-A35B is the latest model in the series, and is a fine-tuned version of mistral-community/Mixtral-8x22B-v0.1 that was trained using a novel alignment algorithm called Odds Ratio Preference Optimization (ORPO) with 7k instances for 1.3 hours on 4 nodes of 8 x H100s. ORPO does not require an SFT step to achieve high performance and is thus much more computationally efficient than methods like DPO and PPO. To train Zephyr-141B-A35B, we used the argilla/distilabel-capybara-dpo-7k-binarized preference dataset, which consists of synthetic, high-quality, multi-turn preferences that have been scored via LLMs.

Converted from: https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1