Qwen3.5 397B A17B API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About Qwen3.5 397B A17B

Qwen3.5 397B A17B is Alibaba Cloud’s largest and most capable multimodal foundation model, released in February 2026. It features a hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters and 17 billion active parameters per inference pass, utilizing 512 experts with a routing mechanism selecting a subset per token. This sparse activation design delivers high-throughput inference with minimal latency and cost overhead compared to dense models of equivalent capability.

Qwen3.5 397B is the first Qwen open-weights model with native vision input, supporting image and video through early fusion training on multimodal tokens. It unifies the previously separate Qwen3 (text) and Qwen3-VL (vision) model lines into a single architecture. The model supports both reasoning and non-reasoning modes, a 262k token context window (extendable to 1M in the hosted Qwen3.5-Plus version), and 201 languages and dialects. It scores 45 on the Artificial Analysis Intelligence Index, ranking #3 among open-weights models, with benchmark scores of 87.8% on MMLU-Pro, 88.4% on GPQA Diamond, and 76.4% on SWE-Bench Verified.

Qwen3.5 397B A17B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Qwen3.5 397B A17B (Reasoning) API Review Summary

DeepInfra (FP8) is the best overall value: lowest blended price ($1.25/1M tokens) combined with lowest latency (0.67s TTFT) among all 9 benchmarked providers.
DeepInfra (FP8) ranks #2 in output speed: 137.9 t/s while holding the #1 position in both price and latency.
DeepInfra (FP8) has the lowest token rates: $0.54/1M input and $3.40/1M output — the cheapest in the benchmark.
Clarifai leads on throughput: 268.4 t/s — nearly double DeepInfra and 4.8x faster than the slowest provider.
Together.ai has a significant latency issue: 46.50s TTFT makes it unsuitable for any interactive application.
JSON mode caveat: DeepInfra (FP8) is the only provider among the 9 that does not currently support JSON mode.
Benchmarks reflect sustained performance: median (P50) over the past 72 hours using a 10,000 input token workload.

Qwen3.5 397B A17B — Best APIs

Provider	Why Notable	Blended ($/1M)	Latency (TTFT)	Speed (t/s)	Context	Func	JSON	E2E (s)
DeepInfra (FP8)	Best cost + lowest latency; strong overall for production workloads	$1.25	0.67s	138	262k	Yes	No	27.39 / 23.10
Clarifai	Absolute fastest throughput; throughput-heavy workloads	$1.35	0.74s	268	256k	Yes	Yes	14.47 / 11.87
Alibaba Cloud	First-party availability; full feature support	$1.35	2.31s	94	262k	Yes	Yes	41.72 / 34.06
Eigen AI	Balanced speed; high-speed structured data extraction	$1.35	1.66s	136	262k	No	Yes	28.70 / 23.37
Novita	Balanced agentic workflows; full feature support	$1.35	1.49s	98	262k	Yes	Yes	39.13 / 32.53

Quick Verdict: Which Qwen3.5 397B A17B Provider is Best?

Based on benchmarks across 9 tracked providers, DeepInfra (FP8) is the recommended API for production-scale Qwen3.5 397B A17B deployment. It offers the lowest blended price ($1.25/1M), the lowest TTFT (0.67s), and ranks #2 in output speed (137.9 t/s) — a combination no other provider matches. For maximum throughput, Clarifai leads at 268.4 t/s. For teams requiring JSON mode, Clarifai or Alibaba Cloud are the recommended alternatives.

Overall Winner: DeepInfra (FP8)

DeepInfra is the most well-rounded and recommended API provider for Qwen3.5 397B A17B, combining the market’s lowest latency, top-tier throughput, and most aggressive pricing.

Blended Price: $1.25 / 1M tokens (#1 cheapest)
Input Price: $0.54 / 1M tokens
Output Price: $3.40 / 1M tokens
Output Speed: 137.9 t/s (#2 overall)
Latency (TTFT): 0.67s (#1 lowest)
Context Window: 262k tokens
API Features: Function Calling supported; JSON mode not currently available

DeepInfra’s FP8 quantization delivers the lowest latency on the market (0.67s) while maintaining near-top throughput (137.9 t/s, #2). Its pricing is 7.4% cheaper than the $1.35 market rate across all other providers — a meaningful advantage at production scale.

The one trade-off is the absence of JSON mode, which makes DeepInfra less suitable for applications requiring deterministic structured outputs. For those use cases, Clarifai or Alibaba Cloud are the recommended alternatives.

Best for Throughput: Clarifai

For throughput-intensive applications requiring generation of large volumes of text, Clarifai is the undisputed leader.

Blended Price: $1.35 / 1M tokens
Output Speed: 268.4 t/s (#1 — nearly double DeepInfra)
Latency (TTFT): 0.74s (#2 lowest)
Context Window: 256k tokens (slightly capped vs standard 262k)
API Features: Function Calling + JSON Mode
E2E (500 tokens): 14.47s (fastest in the benchmark)

Clarifai’s 268.4 t/s throughput is 4.8x faster than the slowest provider, resulting in the fastest end-to-end response time for a 500-token output (14.47s). It also maintains excellent latency (0.74s) and full feature support. The only limitations are the slightly smaller context window (256k vs 262k) and the higher price point compared to DeepInfra.

Best for Structured Data: Eigen AI

Eigen AI is a strong option for developers requiring JSON mode alongside competitive throughput.

Blended Price: $1.35 / 1M tokens
Output Speed: 136.3 t/s (#3 overall)
Latency (TTFT): 1.66s
API Features: JSON Mode supported; Function Calling not supported

Eigen AI closely trails DeepInfra in output speed (136.3 t/s vs 137.9 t/s) and fully supports JSON mode — making it the go-to choice for structured data extraction and parsing pipelines that need DeepInfra-level throughput with guaranteed JSON output. The notable caveat is that Eigen AI is the only provider in the benchmark that does not support function calling.

Balanced Option: Novita

Novita offers a stable, well-rounded technical profile for complex agentic applications requiring both tool calling and structured outputs.

Blended Price: $1.35 / 1M tokens
Output Speed: 97.9 t/s
Latency (TTFT): 1.49s
API Features: Function Calling + JSON Mode

Novita supports both JSON mode and function calling at the standard market price, making it a versatile fallback for complex agentic workflows requiring tool orchestration. Its throughput (97.9 t/s) and latency (1.49s) are mid-tier — acceptable for non-interactive workloads but not competitive with DeepInfra or Clarifai for interactive use cases.

Low-Latency Alternative: Parasail

Parasail has a unique performance profile suited for short-turn conversational applications.

Blended Price: $1.35 / 1M tokens
Output Speed: 56.3 t/s (#8 — second slowest)
Latency (TTFT): 0.95s (#3 lowest)
API Features: Function Calling + JSON Mode

Parasail delivers sub-second TTFT (0.95s), placing it third in the latency rankings. However, its output speed of 56.3 t/s is near the bottom of the benchmark — making it viable for short-turn conversational AI where immediate feedback matters, but not suitable for long-form generation or heavy reasoning tasks.

First-Party Provider: Alibaba Cloud

As the creator of the Qwen model family, Alibaba Cloud provides a reliable first-party hosting solution with full feature support.

Blended Price: $1.35 / 1M tokens
Output Speed: 94.0 t/s
Latency (TTFT): 2.31s
Context Window: 262k tokens (1M via Qwen3.5-Plus hosted version)
API Features: Function Calling + JSON Mode

Alibaba Cloud guarantees high compatibility, the full 262k context window, and native support for both JSON mode and tool calling. Its latency (2.31s) and throughput (94.0 t/s) trail third-party inference providers, but its first-party status and access to Qwen3.5-Plus production features (1M context, built-in tools) make it the natural choice for teams in the Alibaba Cloud ecosystem.

Non-Interactive Background Processing: Together.ai

Together.ai is a well-known provider, but its current infrastructure for this model shows a significant latency bottleneck.

Blended Price: $1.35 / 1M tokens
Output Speed: 100.6 t/s
Latency (TTFT): 46.50s — by far the highest in the benchmark
API Features: Function Calling + JSON Mode

A TTFT of 46.50 seconds makes Together.ai completely unsuitable for any user-facing, interactive, or real-time application. Once input processing completes, it maintains reasonable throughput (100.6 t/s), making it viable only for batch processing and background tasks where latency is not a constraint.

Low-VRAM Alternative: Nebius (Base, FP4)

Nebius utilizes FP4 quantization to reduce memory overhead, but the performance trade-offs are significant at the standard market price.

Blended Price: $1.35 / 1M tokens
Output Speed: 69.0 t/s
Latency (TTFT): 1.93s
API Features: Function Calling + JSON Mode

Despite aggressive FP4 quantization, Nebius sits near the bottom for output speed and offers mid-tier latency. At the same $1.35 price point as Clarifai, Eigen AI, and Novita — all of which offer significantly better performance — it is difficult to recommend unless a specific low-VRAM deployment constraint applies.

Redundancy Option: GMI (FP8)

Like DeepInfra, GMI runs an FP8 version of the model, but the performance difference between the two is substantial.

Blended Price: $1.35 / 1M tokens
Output Speed: 78.0 t/s
Latency (TTFT): 2.40s — nearly 4x slower than DeepInfra
API Features: Function Calling + JSON Mode

GMI’s latency is nearly 4x slower than DeepInfra’s (2.40s vs 0.67s) and its throughput is roughly half (78.0 t/s vs 137.9 t/s), at a 7.4% higher price point. It does not differentiate meaningfully against faster FP8 competitors. It may serve as a geographic redundancy option for FP8 deployments, but is not recommended for primary production traffic.

Frequently Asked Questions

Which Qwen3.5 397B A17B provider has the lowest latency?

DeepInfra (FP8) has the lowest TTFT at 0.67 seconds.

Which provider has the highest throughput?

Clarifai leads with 268.4 tokens per second — nearly double DeepInfra and 4.8x faster than the slowest provider.

Which provider is the cheapest?

DeepInfra (FP8) offers the lowest blended price at $1.25 per 1M tokens ($0.54 input / $3.40 output).

Which providers support JSON mode?

All providers except DeepInfra (FP8) support JSON mode: Clarifai, Eigen AI, Novita, Parasail, Alibaba Cloud, Together.ai, Nebius, and GMI.

Which providers support function calling?

All providers except Eigen AI support function calling.

What is the context window for Qwen3.5 397B A17B?

Most providers offer a 262k token context window. Clarifai is slightly capped at 256k. The Alibaba Cloud hosted Qwen3.5-Plus version supports up to 1M tokens.

What quantization formats are available?

DeepInfra and GMI use FP8 quantization. Nebius uses FP4 quantization. Other providers have not disclosed their quantization implementations.

Conclusion

For the vast majority of Qwen3.5 397B A17B deployments, DeepInfra (FP8) is the recommended provider. By combining the lowest latency (0.67s), top-tier throughput (137.9 t/s, #2), and the most competitive pricing ($1.25/1M blended), it provides a highly optimized foundation for deploying Alibaba’s flagship reasoning model at scale.

Choose DeepInfra (FP8) for the best overall value — lowest cost, lowest latency, and strong throughput.
Choose Clarifai for maximum throughput (268.4 t/s) or when JSON mode is required alongside high generation speed.
Choose Eigen AI for structured data extraction requiring JSON mode at near-DeepInfra throughput.
Choose Alibaba Cloud for first-party support or access to extended production features via Qwen3.5-Plus.
Avoid Together.ai for any interactive application — its 46.50s TTFT is prohibitive for user-facing workloads.

Enhancing Open-Source LLMs with Function Calling FeatureWe're excited to announce that the Function Calling feature is now available on DeepInfra. We're offering Mistral-7B and Mixtral-8x7B models with this feature. Other models will be available soon. LLM models are powerful tools for various tasks. However, they're limited in their ability to per...

Function Calling in DeepInfra: Extend Your AI with Real-World Logic<p>Modern large language models (LLMs) are incredibly powerful at understanding and generating text, but until recently they were largely static: they could only respond based on patterns in their training data. Function calling changes that. It lets language models interact with external logic — your own code, APIs, utilities, or business systems — while still […]</p>

LLM API Provider Performance KPIs 101: TTFT, Throughput & End-to-End Goals<p>Fast, predictable responses turn a clever demo into a dependable product. If you’re building on an LLM API provider like DeepInfra, three performance ideas will carry you surprisingly far: time-to-first-token (TTFT), throughput, and an explicit end-to-end (E2E) goal that blends speed, reliability, and cost into something users actually feel. This beginner-friendly guide explains each KPI […]</p>

View all