We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Qwen3.5 397B A17B API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
Qwen3.5 397B A17B API Benchmarks: Latency, Throughput & Cost

About Qwen3.5 397B A17B

Qwen3.5 397B A17B is Alibaba Cloud’s largest and most capable multimodal foundation model, released in February 2026. It features a hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters and 17 billion active parameters per inference pass, utilizing 512 experts with a routing mechanism selecting a subset per token. This sparse activation design delivers high-throughput inference with minimal latency and cost overhead compared to dense models of equivalent capability.

Qwen3.5 397B is the first Qwen open-weights model with native vision input, supporting image and video through early fusion training on multimodal tokens. It unifies the previously separate Qwen3 (text) and Qwen3-VL (vision) model lines into a single architecture. The model supports both reasoning and non-reasoning modes, a 262k token context window (extendable to 1M in the hosted Qwen3.5-Plus version), and 201 languages and dialects. It scores 45 on the Artificial Analysis Intelligence Index, ranking #3 among open-weights models, with benchmark scores of 87.8% on MMLU-Pro, 88.4% on GPQA Diamond, and 76.4% on SWE-Bench Verified.

Qwen3.5 397B A17B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Qwen3.5 397B A17B (Reasoning) API Review Summary

  • DeepInfra (FP8) is the best overall value: lowest blended price ($1.25/1M tokens) combined with lowest latency (0.67s TTFT) among all 9 benchmarked providers.
  • DeepInfra (FP8) ranks #2 in output speed: 137.9 t/s while holding the #1 position in both price and latency.
  • DeepInfra (FP8) has the lowest token rates: $0.54/1M input and $3.40/1M output — the cheapest in the benchmark.
  • Clarifai leads on throughput: 268.4 t/s — nearly double DeepInfra and 4.8x faster than the slowest provider.
  • Together.ai has a significant latency issue: 46.50s TTFT makes it unsuitable for any interactive application.
  • JSON mode caveat: DeepInfra (FP8) is the only provider among the 9 that does not currently support JSON mode.
  • Benchmarks reflect sustained performance: median (P50) over the past 72 hours using a 10,000 input token workload.

Qwen3.5 397B A17B — Best APIs

ProviderWhy NotableBlended ($/1M)Latency (TTFT)Speed (t/s)ContextFuncJSONE2E (s)
DeepInfra (FP8)Best cost + lowest latency; strong overall for production workloads$1.250.67s138262kYesNo27.39 / 23.10
ClarifaiAbsolute fastest throughput; throughput-heavy workloads$1.350.74s268256kYesYes14.47 / 11.87
Alibaba CloudFirst-party availability; full feature support$1.352.31s94262kYesYes41.72 / 34.06
Eigen AIBalanced speed; high-speed structured data extraction$1.351.66s136262kNoYes28.70 / 23.37
NovitaBalanced agentic workflows; full feature support$1.351.49s98262kYesYes39.13 / 32.53

Quick Verdict: Which Qwen3.5 397B A17B Provider is Best?

Based on benchmarks across 9 tracked providers, DeepInfra (FP8) is the recommended API for production-scale Qwen3.5 397B A17B deployment. It offers the lowest blended price ($1.25/1M), the lowest TTFT (0.67s), and ranks #2 in output speed (137.9 t/s) — a combination no other provider matches. For maximum throughput, Clarifai leads at 268.4 t/s. For teams requiring JSON mode, Clarifai or Alibaba Cloud are the recommended alternatives.

Overall Winner: DeepInfra (FP8)

DeepInfra is the most well-rounded and recommended API provider for Qwen3.5 397B A17B, combining the market’s lowest latency, top-tier throughput, and most aggressive pricing.

  • Blended Price: $1.25 / 1M tokens (#1 cheapest)
  • Input Price: $0.54 / 1M tokens
  • Output Price: $3.40 / 1M tokens
  • Output Speed: 137.9 t/s (#2 overall)
  • Latency (TTFT): 0.67s (#1 lowest)
  • Context Window: 262k tokens
  • API Features: Function Calling supported; JSON mode not currently available

DeepInfra’s FP8 quantization delivers the lowest latency on the market (0.67s) while maintaining near-top throughput (137.9 t/s, #2). Its pricing is 7.4% cheaper than the $1.35 market rate across all other providers — a meaningful advantage at production scale.

The one trade-off is the absence of JSON mode, which makes DeepInfra less suitable for applications requiring deterministic structured outputs. For those use cases, Clarifai or Alibaba Cloud are the recommended alternatives.

Best for Throughput: Clarifai

For throughput-intensive applications requiring generation of large volumes of text, Clarifai is the undisputed leader.

  • Blended Price: $1.35 / 1M tokens
  • Output Speed: 268.4 t/s (#1 — nearly double DeepInfra)
  • Latency (TTFT): 0.74s (#2 lowest)
  • Context Window: 256k tokens (slightly capped vs standard 262k)
  • API Features: Function Calling + JSON Mode
  • E2E (500 tokens): 14.47s (fastest in the benchmark)

Clarifai’s 268.4 t/s throughput is 4.8x faster than the slowest provider, resulting in the fastest end-to-end response time for a 500-token output (14.47s). It also maintains excellent latency (0.74s) and full feature support. The only limitations are the slightly smaller context window (256k vs 262k) and the higher price point compared to DeepInfra.

Best for Structured Data: Eigen AI

Eigen AI is a strong option for developers requiring JSON mode alongside competitive throughput.

  • Blended Price: $1.35 / 1M tokens
  • Output Speed: 136.3 t/s (#3 overall)
  • Latency (TTFT): 1.66s
  • API Features: JSON Mode supported; Function Calling not supported

Eigen AI closely trails DeepInfra in output speed (136.3 t/s vs 137.9 t/s) and fully supports JSON mode — making it the go-to choice for structured data extraction and parsing pipelines that need DeepInfra-level throughput with guaranteed JSON output. The notable caveat is that Eigen AI is the only provider in the benchmark that does not support function calling.

Balanced Option: Novita

Novita offers a stable, well-rounded technical profile for complex agentic applications requiring both tool calling and structured outputs.

  • Blended Price: $1.35 / 1M tokens
  • Output Speed: 97.9 t/s
  • Latency (TTFT): 1.49s
  • API Features: Function Calling + JSON Mode

Novita supports both JSON mode and function calling at the standard market price, making it a versatile fallback for complex agentic workflows requiring tool orchestration. Its throughput (97.9 t/s) and latency (1.49s) are mid-tier — acceptable for non-interactive workloads but not competitive with DeepInfra or Clarifai for interactive use cases.

Low-Latency Alternative: Parasail

Parasail has a unique performance profile suited for short-turn conversational applications.

  • Blended Price: $1.35 / 1M tokens
  • Output Speed: 56.3 t/s (#8 — second slowest)
  • Latency (TTFT): 0.95s (#3 lowest)
  • API Features: Function Calling + JSON Mode

Parasail delivers sub-second TTFT (0.95s), placing it third in the latency rankings. However, its output speed of 56.3 t/s is near the bottom of the benchmark — making it viable for short-turn conversational AI where immediate feedback matters, but not suitable for long-form generation or heavy reasoning tasks.

First-Party Provider: Alibaba Cloud

As the creator of the Qwen model family, Alibaba Cloud provides a reliable first-party hosting solution with full feature support.

  • Blended Price: $1.35 / 1M tokens
  • Output Speed: 94.0 t/s
  • Latency (TTFT): 2.31s
  • Context Window: 262k tokens (1M via Qwen3.5-Plus hosted version)
  • API Features: Function Calling + JSON Mode

Alibaba Cloud guarantees high compatibility, the full 262k context window, and native support for both JSON mode and tool calling. Its latency (2.31s) and throughput (94.0 t/s) trail third-party inference providers, but its first-party status and access to Qwen3.5-Plus production features (1M context, built-in tools) make it the natural choice for teams in the Alibaba Cloud ecosystem.

Non-Interactive Background Processing: Together.ai

Together.ai is a well-known provider, but its current infrastructure for this model shows a significant latency bottleneck.

  • Blended Price: $1.35 / 1M tokens
  • Output Speed: 100.6 t/s
  • Latency (TTFT): 46.50s — by far the highest in the benchmark
  • API Features: Function Calling + JSON Mode

A TTFT of 46.50 seconds makes Together.ai completely unsuitable for any user-facing, interactive, or real-time application. Once input processing completes, it maintains reasonable throughput (100.6 t/s), making it viable only for batch processing and background tasks where latency is not a constraint.

Low-VRAM Alternative: Nebius (Base, FP4)

Nebius utilizes FP4 quantization to reduce memory overhead, but the performance trade-offs are significant at the standard market price.

  • Blended Price: $1.35 / 1M tokens
  • Output Speed: 69.0 t/s
  • Latency (TTFT): 1.93s
  • API Features: Function Calling + JSON Mode

Despite aggressive FP4 quantization, Nebius sits near the bottom for output speed and offers mid-tier latency. At the same $1.35 price point as Clarifai, Eigen AI, and Novita — all of which offer significantly better performance — it is difficult to recommend unless a specific low-VRAM deployment constraint applies.

Redundancy Option: GMI (FP8)

Like DeepInfra, GMI runs an FP8 version of the model, but the performance difference between the two is substantial.

  • Blended Price: $1.35 / 1M tokens
  • Output Speed: 78.0 t/s
  • Latency (TTFT): 2.40s — nearly 4x slower than DeepInfra
  • API Features: Function Calling + JSON Mode

GMI’s latency is nearly 4x slower than DeepInfra’s (2.40s vs 0.67s) and its throughput is roughly half (78.0 t/s vs 137.9 t/s), at a 7.4% higher price point. It does not differentiate meaningfully against faster FP8 competitors. It may serve as a geographic redundancy option for FP8 deployments, but is not recommended for primary production traffic.

Frequently Asked Questions

Which Qwen3.5 397B A17B provider has the lowest latency?

DeepInfra (FP8) has the lowest TTFT at 0.67 seconds.

Which provider has the highest throughput?

Clarifai leads with 268.4 tokens per second — nearly double DeepInfra and 4.8x faster than the slowest provider.

Which provider is the cheapest?

DeepInfra (FP8) offers the lowest blended price at $1.25 per 1M tokens ($0.54 input / $3.40 output).

Which providers support JSON mode?

All providers except DeepInfra (FP8) support JSON mode: Clarifai, Eigen AI, Novita, Parasail, Alibaba Cloud, Together.ai, Nebius, and GMI.

Which providers support function calling?

All providers except Eigen AI support function calling.

What is the context window for Qwen3.5 397B A17B?

Most providers offer a 262k token context window. Clarifai is slightly capped at 256k. The Alibaba Cloud hosted Qwen3.5-Plus version supports up to 1M tokens.

What quantization formats are available?

DeepInfra and GMI use FP8 quantization. Nebius uses FP4 quantization. Other providers have not disclosed their quantization implementations.

Conclusion

For the vast majority of Qwen3.5 397B A17B deployments, DeepInfra (FP8) is the recommended provider. By combining the lowest latency (0.67s), top-tier throughput (137.9 t/s, #2), and the most competitive pricing ($1.25/1M blended), it provides a highly optimized foundation for deploying Alibaba’s flagship reasoning model at scale.

  • Choose DeepInfra (FP8) for the best overall value — lowest cost, lowest latency, and strong throughput.
  • Choose Clarifai for maximum throughput (268.4 t/s) or when JSON mode is required alongside high generation speed.
  • Choose Eigen AI for structured data extraction requiring JSON mode at near-DeepInfra throughput.
  • Choose Alibaba Cloud for first-party support or access to extended production features via Qwen3.5-Plus.
  • Avoid Together.ai for any interactive application — its 46.50s TTFT is prohibitive for user-facing workloads.
Related articles
Step 3.5 Flash API Benchmarks: Latency, Throughput & CostStep 3.5 Flash API Benchmarks: Latency, Throughput & Cost<p>About Step 3.5 Flash Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models. [&hellip;]</p>
GLM-4.6 vs DeepSeek-V3.2: Performance, Benchmarks & DeepInfra ResultsGLM-4.6 vs DeepSeek-V3.2: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM ecosystem has evolved rapidly, and two models stand out as leaders in capability, efficiency, and practical usability: GLM-4.6, Zhipu AI’s high-capacity reasoning model with a 200k-token context window, and DeepSeek-V3.2, a sparsely activated Mixture-of-Experts architecture engineered for exceptional performance per dollar. Both models are powerful. Both are versatile. Both are widely adopted [&hellip;]</p>
Building Efficient AI Inference on NVIDIA Blackwell PlatformBuilding Efficient AI Inference on NVIDIA Blackwell PlatformDeepInfra delivers up to 20x cost reductions on NVIDIA Blackwell by combining MoE architectures, NVFP4 quantization, and inference optimizations — with a Latitude case study.