Qwen3.5 122B A10B API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About Qwen3.5 122B A10B

Qwen3.5 122B A10B is Alibaba Cloud’s mid-tier multimodal foundation model, released in February 2026. It is a multimodal vision-language Mixture-of-Experts model supporting text, image, and video inputs, designed for native multimodal agent applications. It features 122 billion total parameters with 10 billion activated per token through a hybrid architecture that integrates Gated Delta Networks with sparse Mixture-of-Experts across 256 experts — delivering high-throughput inference with minimal latency and cost overhead.

The model supports a 262k token context window (extensible to 1M via YaRN), operates in both thinking and non-thinking modes, and offers expanded support for 201 languages and dialects. Qwen3.5 122B A10B scores 42 on the Artificial Analysis Intelligence Index — well above average among comparable models — and is released under the Apache 2.0 license, enabling commercial use and third-party hosting.

Key Architectural Innovations

Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.
Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.
Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.
Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.

Benchmark Performance

MMLU-Pro: 86.1%
GPQA Diamond: 85.5% (vs GPT-5-mini at 82.8%)
SWE-bench Verified: 72.0%
Terminal-Bench 2.0: 49.4%
TAU2-Bench: 79.5% (vs GPT-5-mini at 69.8%)
BrowseComp: 63.8%

Qwen3.5 122B A10B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Qwen3.5 122B A10B (Reasoning) API Review Summary

DeepInfra (FP8) is the overall leader: #1 in speed (155.5 t/s), latency (0.59s TTFT), and blended price ($0.94/1M) among all 4 tracked providers.
Fastest output speed: DeepInfra (FP8) at 155.5 t/s — approximately 1.9x faster than the slowest provider (Novita at 83.8 t/s).
Lowest latency: DeepInfra (FP8) at 0.59s TTFT — nearly 3x faster than the next best option (Novita at 1.72s).
Lowest blended price: DeepInfra (FP8) at $0.94/1M tokens — approximately 17% cheaper than all other providers at $1.10.
Lowest token prices: DeepInfra (FP8) at $0.29/1M input and $2.90/1M output (next best: $0.40 input, $3.20 output).
Feature note: All 4 providers support Function Calling. JSON mode is supported by 3 of 4 providers — DeepInfra (FP8) does not currently support JSON mode.

Qwen3.5 122B A10B — Best APIs

Provider	Blended ($/1M)	Input ($/1M)	Output ($/1M)	Speed (t/s)	Latency (TTFT)	E2E Response (s)	Func	JSON	Context
DeepInfra (FP8)	$0.94	$0.29	$2.90	155.5	0.59s	16.67 / 12.86	Yes	No	262k
Alibaba Cloud	$1.10	$0.40	$3.20	137.9	2.32s	20.44 / 14.50	Yes	Yes	262k
Novita	$1.10	$0.40	$3.20	83.8	1.72s	31.56 / 23.87	Yes	Yes	262k
GMI (FP8)	$1.10	$0.40	$3.20	90.7	2.42s	29.98 / 22.04	Yes	Yes	262k

Quick Verdict: Which Qwen3.5 122B A10B Provider is Best?

Based on benchmarks across 4 tracked providers, DeepInfra (FP8) is the recommended API for production-scale Qwen3.5 122B A10B deployment. It ranks #1 across all three primary metrics — speed, latency, and cost — while undercutting the market price by 17%. The only trade-off is the absence of JSON mode, which is worth noting for structured output workflows.

Overall Winner: DeepInfra (FP8)

DeepInfra’s FP8 implementation dominates across all key performance and pricing metrics, making it the clear recommendation for the vast majority of production use cases.

Blended Price: $0.94 / 1M tokens (cheapest on the market)
Input Price: $0.29 / 1M tokens
Output Price: $2.90 / 1M tokens
Output Speed: 155.5 t/s (#1 — approximately 1.9x faster than Novita)
Latency (TTFT): 0.59s (#1 — nearly 3x faster than the next best)
Context Window: 262k tokens
API Features: Function Calling supported; JSON mode not currently available

At $0.94 per 1M blended tokens, DeepInfra is approximately 17% cheaper than every other provider in the benchmark — all of which are priced at $1.10. Combined with the fastest output speed and the lowest TTFT in the field, it is the only provider that wins across all three critical dimensions simultaneously.

The one trade-off worth flagging: DeepInfra (FP8) does not currently support JSON mode. Developers requiring deterministic structured outputs should either use prompt engineering to enforce JSON structure or consider Alibaba Cloud as an alternative.

Official Provider: Alibaba Cloud

As the model’s creator, Alibaba Cloud offers a solid balance of performance and full feature support, making it the natural fallback for teams requiring JSON mode.

Output Speed: 137.9 t/s (#2 overall)
Latency (TTFT): 2.32s
Blended Price: $1.10 / 1M tokens
Context Window: 262k tokens
API Features: Function Calling + JSON Mode

Alibaba Cloud delivers competitive throughput (137.9 t/s) with complete feature support including JSON mode. Its latency (2.32s TTFT) is notably higher than DeepInfra, making it less suitable for real-time interactive applications. For batch workloads or structured output pipelines where JSON mode is required, it is the recommended alternative to DeepInfra.

Alternative Providers: Novita and GMI (FP8)

Both Novita and GMI are priced identically at $1.10/1M blended and offer full feature support (Function Calling + JSON Mode), but neither matches DeepInfra on performance.

Novita: 83.8 t/s output speed, 1.72s TTFT — better latency than GMI but slower throughput.
GMI (FP8): 90.7 t/s output speed, 2.42s TTFT — marginally faster throughput than Novita but higher latency.

For developers already integrated into either ecosystem or with specific regional availability requirements, both are viable options. However, given that DeepInfra outperforms both on every metric while costing 17% less, neither represents the optimal choice for new deployments.

Conclusion

For the vast majority of Qwen3.5 122B A10B deployments, DeepInfra (FP8) is the clear recommendation. It ranks #1 in speed, latency, and cost simultaneously — a rare combination in inference provider benchmarks.

Choose DeepInfra (FP8) for the best overall value — lowest cost, fastest speed, lowest latency, and Function Calling support.
Choose Alibaba Cloud if JSON mode is a hard requirement, or for teams preferring the first-party provider.
Choose Novita or GMI for ecosystem-specific integrations where DeepInfra is not an option.

LLM API Provider Performance KPIs 101: TTFT, Throughput & End-to-End Goals<p>Fast, predictable responses turn a clever demo into a dependable product. If you’re building on an LLM API provider like DeepInfra, three performance ideas will carry you surprisingly far: time-to-first-token (TTFT), throughput, and an explicit end-to-end (E2E) goal that blends speed, reliability, and cost into something users actually feel. This beginner-friendly guide explains each KPI […]</p>

The easiest way to build AI applications with Llama 2 LLMs.The long awaited Llama 2 models are finally here! We are excited to show you how to use them with DeepInfra. These collection of models represent the state of the art in open source language models. They are made available by Meta AI and the l...

NVIDIA Nemotron API Pricing Guide 2026<p>While everyone knows Llama 3 and Qwen, a quieter revolution has been happening in NVIDIA’s labs. They have been taking standard Llama models and “supercharging” them using advanced alignment techniques and pruning methods. The result is Nemotron—a family of models that frequently tops the “Helpfulness” leaderboards (like Arena Hard), often beating GPT-4o while being significantly […]</p>

View all