134 Downloads Updated 8 months ago
Updated 8 months ago
8 months ago
096bf7f628e0 · 3.1GB
fp32
calibration_datav3.txt
Eurus-2-7B-PRIME is trained using PRIME (Process Reinforcement through IMplicit rEward) method, an open-source solution for online reinforcement learning (RL) with process rewards, to advance reasoning abilities of language models beyond imitation or distillation. It starts with Eurus-2-7B-SFT and trains on Eurus-2-RL-Data.
System Prompt
When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.
[ASSESS]
[ADVANCE]
[VERIFY]
[SIMPLIFY]
[SYNTHESIZE]
[PIVOT]
[OUTPUT]
You should strictly follow the format below:
[ACTION NAME]
# Your action step 1
# Your action step 2
# Your action step 3
...
Next action: [NEXT ACTION NAME]