We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Qwen3-Max-Thinking state-of-the-art reasoning model at your fingertips!

Qwen/

Qwen3-Max-Thinking

$1.20 in $6.00 out $0.24 cached / 1M tokens

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

Partner

Public

256,000

JSON

Function

Qwen

api versions

Qwen3-Max-Thinking

Ask me anything

0.00s

Settings

Model Information

We present Qwen3-Max-Thinking, our latest flagship reasoning model. By scaling up model parameters and leveraging substantial computational resources for reinforcement learning, Qwen3-Max-Thinking achieves significant performance improvements across multiple dimensions, including factual knowledge, complex reasoning, instruction following, alignment with human preferences, and agent capabilities. On 19 established benchmarks, it demonstrates performance comparable to leading models such as GPT-5.2-Thinking, Claude-Opus-4.5, and Gemini 3 Pro.

We further enhance Qwen3-Max-Thinking with two key innovations: (1) adaptive tool-use capabilities that enable on-demand retrieval and code interpreter invocation; and (2) advanced test-time scaling techniques that significantly boost reasoning performance, surpassing Gemini 3 Pro on key reasoning benchmarks.

Adaptive Tool-Use Capabilities Unlike earlier approaches that required users to manually select tools before each task, Qwen3-Max-Thinking autonomously selects and leverages its built-in Search, Memory, and Code Interpreter capabilities during conversations. This capability emerges from a focused training process: after initial fine-tuning for tool use, the model underwent further training on diverse tasks using both rule-based and model-based feedback. Empirically, we observe that the Search and Memory tools effectively mitigate hallucinations, provide access to real-time information, and enable more personalized responses. The Code Interpreter allows users to execute code snippets and apply computational reasoning to solve complex problems. Together, these features deliver a seamless and capable conversational experience.

Test-time Scaling Strategy Test-time scaling refers to techniques that allocate additional computation during inference to improve model performance. We propose an experience-cumulative, multi-round test-time scaling strategy for the heavy mode. Instead of simply increasing parallel trajectories N, which often yields redundant reasoning, we limit N and redirect saved computation to iterative self-reflection guided by a “take-experience” mechanism. This mechanism distills key insights from past rounds, allowing the model to avoid re-deriving known conclusions and focus on unresolved uncertainties. Crucially, it achieves higher context efficiency than naively referencing raw trajectories, enabling richer integration of historical information within the same context window. This approach consistently outperforms standard parallel sampling and aggregation with roughly the same token consumption: GPQA (90.3 → 92.8), HLE (34.1 → 36.5), LiveCodeBench v6 (88.0 → 91.4), IMO-AnswerBench (89.5 → 91.5), and HLE (w/ tools) (55.8 → 58.3).