qwen3-next:80b-a3b-thinking-fp16

369.8K Downloads Updated 3 months ago

The first installment in the Qwen3-Next series with strong performance in terms of both parameter efficiency and inference speed.

tools thinking cloud 80b

ollama run qwen3-next:80b-a3b-thinking-fp16

curl http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3-next:80b-a3b-thinking-fp16",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='qwen3-next:80b-a3b-thinking-fp16',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'qwen3-next:80b-a3b-thinking-fp16',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Details

Updated 3 months ago

3 months ago

e98f2371bd81 · 159GB ·

model

archqwen3next

parameters79.7B

quantizationF16

159GB

template

{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la

1.5kB

params

{ "repeat_penalty": 1, "stop": [ "<|im_start|>", "<|im_end|>" ], "te

120B

license

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US

11kB

Readme

Qwen3-Next-80B-A3B is the first installment in the Qwen3-Next series and features the following key enhancements:

Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling for ultra-long context length.
High-Sparsity Mixture-of-Experts (MoE): Achieves an extreme low activation ratio in MoE layers, drastically reducing FLOPs per token while preserving model capacity.
Stability Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, and other stabilizing enhancements for robust pre-training and post-training.
Multi-Token Prediction (MTP): Boosts pretraining model performance and accelerates inference.