The latest in the Smaug series - a finetune of Qwen2-72B-Instruct

102 4 months ago

Readme

Smaug-Qwen2-72B-Instruct

image/png

  • Quantization from fp32
  • Using i-matrix calibration_datav3.txt

Introduction

We introduce the latest in the Smaug series - a finetune of Qwen2-72B-Instruct

Compared to Qwen2-72B-Instruct, Smaug has better BBH, LiveCodeBench, and Arena-Hard scores (see evaluation results below).

How to use

The prompt format is unchanged from Qwen2-72B-Instruct.

Use with transformers

See the snippet below for usage with Transformers:

import transformers
import torch

model_id = "abacusai/Smaug-Qwen2-72B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
		messages, 
		tokenize=False, 
		add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])

Evaluation Results

Big-Bench Hard (BBH)

Note: These results are with corrected parsing for BBH from Eleuther’s lm-evaluation-harness. See this PR.

Overall:

Model Groups Version Filter n-shot Metric Value Stderr
Smaug-Qwen2-72B-Instruct bbh N/A get-answer 3 exact_match 0.8241 ± 0.0042
Qwen2-72B-Instruct bbh N/A get-answer 3 exact_match 0.8036 ± 0.0044

Breakdown:

Smaug-Qwen2-72B-Instruct:

Tasks Version Filter n-shot Metric Value Stderr
bbh N/A get-answer 3 exact_match 0.8241 0.0042
- bbh_cot_fewshot_boolean_expressions 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_causal_judgement 2 get-answer 3 exact_match 0.6578 0.0348
- bbh_cot_fewshot_date_understanding 2 get-answer 3 exact_match 0.8360 0.0235
- bbh_cot_fewshot_disambiguation_qa 2 get-answer 3 exact_match 0.8280 0.0239
- bbh_cot_fewshot_dyck_languages 2 get-answer 3 exact_match 0.3360 0.0299
- bbh_cot_fewshot_formal_fallacies 2 get-answer 3 exact_match 0.7120 0.0287
- bbh_cot_fewshot_geometric_shapes 2 get-answer 3 exact_match 0.5320 0.0316
- bbh_cot_fewshot_hyperbaton 2 get-answer 3 exact_match 0.9880 0.0069
- bbh_cot_fewshot_logical_deduction_five_objects 2 get-answer 3 exact_match 0.7680 0.0268
- bbh_cot_fewshot_logical_deduction_seven_objects 2 get-answer 3 exact_match 0.5360 0.0316
- bbh_cot_fewshot_logical_deduction_three_objects 2 get-answer 3 exact_match 0.9720 0.0105
- bbh_cot_fewshot_movie_recommendation 2 get-answer 3 exact_match 0.8000 0.0253
- bbh_cot_fewshot_multistep_arithmetic_two 2 get-answer 3 exact_match 0.9720 0.0105
- bbh_cot_fewshot_navigate 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_object_counting 2 get-answer 3 exact_match 0.9200 0.0172
- bbh_cot_fewshot_penguins_in_a_table 2 get-answer 3 exact_match 0.8493 0.0297
- bbh_cot_fewshot_reasoning_about_colored_objects 2 get-answer 3 exact_match 0.7560 0.0272
- bbh_cot_fewshot_ruin_names 2 get-answer 3 exact_match 0.8520 0.0225
- bbh_cot_fewshot_salient_translation_error_detection 2 get-answer 3 exact_match 0.5920 0.0311
- bbh_cot_fewshot_snarks 2 get-answer 3 exact_match 0.9101 0.0215
- bbh_cot_fewshot_sports_understanding 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_temporal_sequences 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects 2 get-answer 3 exact_match 0.9800 0.0089
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects 2 get-answer 3 exact_match 0.9560 0.0130
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_web_of_lies 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_word_sorting 2 get-answer 3 exact_match 0.6560 0.0301

Qwen2-72B-Instruct:

Tasks Version Filter n-shot Metric Value Stderr
bbh N/A get-answer 3 exact_match 0.8036 0.0044
- bbh_cot_fewshot_boolean_expressions 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_causal_judgement 2 get-answer 3 exact_match 0.6684 0.0345
- bbh_cot_fewshot_date_understanding 2 get-answer 3 exact_match 0.8000 0.0253
- bbh_cot_fewshot_disambiguation_qa 2 get-answer 3 exact_match 0.8360 0.0235
- bbh_cot_fewshot_dyck_languages 2 get-answer 3 exact_match 0.3040 0.0292
- bbh_cot_fewshot_formal_fallacies 2 get-answer 3 exact_match 0.7480 0.0275
- bbh_cot_fewshot_geometric_shapes 2 get-answer 3 exact_match 0.4960 0.0317
- bbh_cot_fewshot_hyperbaton 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_logical_deduction_five_objects 2 get-answer 3 exact_match 0.6800 0.0296
- bbh_cot_fewshot_logical_deduction_seven_objects 2 get-answer 3 exact_match 0.4720 0.0316
- bbh_cot_fewshot_logical_deduction_three_objects 2 get-answer 3 exact_match 0.9200 0.0172
- bbh_cot_fewshot_movie_recommendation 2 get-answer 3 exact_match 0.7800 0.0263
- bbh_cot_fewshot_multistep_arithmetic_two 2 get-answer 3 exact_match 0.9760 0.0097
- bbh_cot_fewshot_navigate 2 get-answer 3 exact_match 0.9520 0.0135
- bbh_cot_fewshot_object_counting 2 get-answer 3 exact_match 0.9480 0.0141
- bbh_cot_fewshot_penguins_in_a_table 2 get-answer 3 exact_match 0.5753 0.0410
- bbh_cot_fewshot_reasoning_about_colored_objects 2 get-answer 3 exact_match 0.8120 0.0248
- bbh_cot_fewshot_ruin_names 2 get-answer 3 exact_match 0.8760 0.0209
- bbh_cot_fewshot_salient_translation_error_detection 2 get-answer 3 exact_match 0.5880 0.0312
- bbh_cot_fewshot_snarks 2 get-answer 3 exact_match 0.8764 0.0247
- bbh_cot_fewshot_sports_understanding 2 get-answer 3 exact_match 0.9080 0.0183
- bbh_cot_fewshot_temporal_sequences 2 get-answer 3 exact_match 0.9960 0.0040
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects 2 get-answer 3 exact_match 0.9160 0.0176
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects 2 get-answer 3 exact_match 0.9400 0.0151
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_web_of_lies 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_word_sorting 2 get-answer 3 exact_match 0.6680 0.0298

LiveCodeBench

Model Pass@1 Easy Pass@1 Medium Pass@1 Hard Pass@1
Smaug-Qwen2-72B-Instruct 0.3357 0.7286 0.1633 0.0000
Qwen2-72B-Instruct 0.3139 0.6810 0.1531 0.0000

Arena-Hard

Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.

Model Score 95% Confidence Interval Average Tokens
GPT-4-Turbo-2024-04-09 82.6 (-1.8, 1.6) 662
GPT-4o 78.3 (-2.4, 2.1) 685
Gemini-1.5-pro-latest 72.1 (-2.3, 2.2) 630
Claude-3-Opus-20240229 60.4 (-3.3, 2.4) 541
Smaug-Llama-3-70B-Instruct 56.7 (-2.2, 2.6) 661
GPT-4-0314 50.0 (-0.0, 0.0) 423
Smaug-Qwen2-72B-Instruct 48.0 (-1.8, 2.1) 628
Claude-3-Sonnet-20240229 46.8 (-2.1, 2.2) 552
Qwen2-72B-Instruct 43.5 (-2.6, 2.7) 531
Llama-3-70B-Instruct 41.1 (-2.5, 2.4) 583
GPT-4-0613 37.9 (-2.2, 2.0) 354
Mistral-Large-2402 37.7 (-1.9, 2.6) 400
Mixtral-8x22B-Instruct-v0.1 36.4 (-2.7, 2.9) 430
Qwen1.5-72B-Chat 36.1 (-2.5, 2.2) 474
Command-R-Plus 33.1 (-2.1, 2.2) 541
Mistral-Medium 31.9 (-2.3, 2.4) 485
GPT-3.5-Turbo-0613 24.8 (-1.6, 2.0) 401

MT-Bench

First turn

Model Turn Score
Qwen2-72B-Instruct 1 9.18125
Smaug-Qwen2-72B-Instruct 1 9.05625

Second turn

Model Turn Score
Qwen2-72B-Instruct 2 8.74684
Smaug-Qwen2-72B-Instruct 2 8.67500

Average

Model Score
Qwen2-72B-Instruct 8.96541
Smaug-Qwen2-72B-Instruct 8.86563