Qwen/QVQ-72B-Preview cover image

Qwen/QVQ-72B-Preview

QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark

QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark

Public
$0.25/$0.50 in/out Mtoken
bfloat16
128,000
ProjectLicense
Qwen/QVQ-72B-Preview cover image

QVQ-72B-Preview

Ask me anything

0.00s

QVQ-72B-Preview

Introduction

QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities.

Performance

QVQ-72B-Previewo1-2024-12-17gpt-4o-2024-05-13Claude3.5 Sonnet-20241022Qwen2VL-72B
MMMU(val)70.377.369.170.464.5
MathVista(mini)71.471.063.865.370.5
MathVision(full)35.930.435.625.9
OlympiadBench20.425.911.2

QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark, showcasing QVQ's powerful ability in multidisciplinary understanding and reasoning. Furthermore, the significant improvements on MathVision highlight the model's progress in mathematical reasoning tasks. OlympiadBench also demonstrates the model's enhanced ability to tackle challenging problems.

But It's Not All Perfect: Acknowledging the Limitations

While QVQ-72B-Preview exhibits promising performance that surpasses expectations, it’s important to acknowledge several limitations:

  1. Language Mixing and Code-Switching: The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses.
  2. Recursive Reasoning Loops: There's a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer.
  3. Safety and Ethical Considerations: Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model.
  4. Performance and Benchmark Limitations: Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of Qwen2-VL-72B. During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over Qwen2-VL-72B in basic recognition tasks like identifying people, animals, or plants.

Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs.