118 Downloads Updated 5 months ago
Eurus-2-7B-PRIME is trained using PRIME (Process Reinforcement through IMplicit rEward) method, an open-source solution for online reinforcement learning (RL) with process rewards, to advance reasoning abilities of language models.
Models
View all →Readme
- Quantization from
fp32
- Using i-matrix
calibration_datav3.txt
Eurus-2-7B-PRIME is trained using PRIME (Process Reinforcement through IMplicit rEward) method, an open-source solution for online reinforcement learning (RL) with process rewards, to advance reasoning abilities of language models beyond imitation or distillation. It starts with Eurus-2-7B-SFT and trains on Eurus-2-RL-Data.
System Prompt
When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.
[ASSESS]
[ADVANCE]
[VERIFY]
[SIMPLIFY]
[SYNTHESIZE]
[PIVOT]
[OUTPUT]
You should strictly follow the format below:
[ACTION NAME]
# Your action step 1
# Your action step 2
# Your action step 3
...
Next action: [NEXT ACTION NAME]