135 Downloads Updated 9 months ago
Updated 9 months ago
9 months ago
bbdb258071c3 · 2.6GB ·
fp32calibration_datav3.txtEurus-2-7B-PRIME is trained using PRIME (Process Reinforcement through IMplicit rEward) method, an open-source solution for online reinforcement learning (RL) with process rewards, to advance reasoning abilities of language models beyond imitation or distillation. It starts with Eurus-2-7B-SFT and trains on Eurus-2-RL-Data.
System Prompt
When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.
[ASSESS]
[ADVANCE]
[VERIFY]
[SIMPLIFY]
[SYNTHESIZE]
[PIVOT]
[OUTPUT]
You should strictly follow the format below:
[ACTION NAME]
# Your action step 1
# Your action step 2
# Your action step 3
...
Next action: [NEXT ACTION NAME]