Due to low usage this model has been replaced by meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8. Your inference requests are still working but they are redirected. Please update your code to use another model.

Qwen/

QVQ-72B-Preview

QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark

Deploy Private Endpoint

Public

bfloat16

32,000

Multimodal

Project License

api versions

QVQ-72B-Preview

Ask me anything

0.00s

Attach

Settings

Model Information

QVQ-72B-Preview

Introduction

QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities.

Performance

	QVQ-72B-Preview	o1-2024-12-17	gpt-4o-2024-05-13	Claude3.5 Sonnet-20241022	Qwen2VL-72B
MMMU(val)	70.3	77.3	69.1	70.4	64.5
MathVista(mini)	71.4	71.0	63.8	65.3	70.5
MathVision(full)	35.9	–	30.4	35.6	25.9
OlympiadBench	20.4	–	25.9	–	11.2

QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark, showcasing QVQ's powerful ability in multidisciplinary understanding and reasoning. Furthermore, the significant improvements on MathVision highlight the model's progress in mathematical reasoning tasks. OlympiadBench also demonstrates the model's enhanced ability to tackle challenging problems.

But It's Not All Perfect: Acknowledging the Limitations

While QVQ-72B-Preview exhibits promising performance that surpasses expectations, it’s important to acknowledge several limitations:

Language Mixing and Code-Switching: The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses.
Recursive Reasoning Loops: There's a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer.
Safety and Ethical Considerations: Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model.
Performance and Benchmark Limitations: Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of Qwen2-VL-72B. During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over Qwen2-VL-72B in basic recognition tasks like identifying people, animals, or plants.

Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs.