Eurus-2-7B-PRIME is trained using PRIME (Process Reinforcement through IMplicit rEward) method, an open-source solution for online reinforcement learning (RL) with process rewards, to advance reasoning abilities of language models.
111 Pulls Updated 3 months ago
Updated 3 months ago
3 months ago
3ab2510cf376 · 4.7GB
model
archqwen2
·
parameters7.62B
·
quantizationQ4_K_M
4.7GB
template
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
<|im_start|>{{ .R
255B
system
When tackling complex reasoning tasks, you have access to the following actions. Use them as needed
384B
license
Qwen RESEARCH LICENSE AGREEMENT
Qwen RESEARCH LICENSE AGREEMENT Release Date: September 19, 2024
B
7.4kB
Readme
- Quantization from
fp32
- Using i-matrix
calibration_datav3.txt
Eurus-2-7B-PRIME is trained using PRIME (Process Reinforcement through IMplicit rEward) method, an open-source solution for online reinforcement learning (RL) with process rewards, to advance reasoning abilities of language models beyond imitation or distillation. It starts with Eurus-2-7B-SFT and trains on Eurus-2-RL-Data.
System Prompt
When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.
[ASSESS]
[ADVANCE]
[VERIFY]
[SIMPLIFY]
[SYNTHESIZE]
[PIVOT]
[OUTPUT]
You should strictly follow the format below:
[ACTION NAME]
# Your action step 1
# Your action step 2
# Your action step 3
...
Next action: [NEXT ACTION NAME]