NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Qwen3.5 35B A3B is a native vision-language model released by Alibaba Cloud in February 2026. It uses a hybrid architecture that integrates Gated Delta Networks with a sparse Mixture-of-Experts model, achieving higher inference efficiency. With 35 billion total parameters and only 3 billion activated per token through 256 experts (8 routed + 1 shared), it outperforms previous-generation models more than 6x its size.
The model supports a 262k token context window (extensible to 1M via YaRN), dual thinking and non-thinking modes, tool calling, and 201 languages and dialects. Qwen3.5-Flash is the hosted API version corresponding to Qwen3.5-35B-A3B, offering additional production features including 1M context length by default and official built-in tools. The model is released under the Apache 2.0 license.
Qwen3.5 35B A3B is being offered by multiple providers, but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Blended ($/1M) | Speed (t/s) | Latency (TTFT) | E2E 500 Tokens (s) |
|---|---|---|---|---|
| DeepInfra (FP8) | $0.71 | 175 | 0.60s | 14.86s |
| Novita | $0.69 | 184 | 1.73s | 15.35s |
| GMI (FP8) | $0.69 | 190 | 2.39s | 15.57s |
| Alibaba Cloud | $0.69 | 162 | 2.04s | 17.48s |
Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale Qwen3.5 35B A3B deployment. Its unmatched TTFT of 0.60s and best-in-class end-to-end response time (14.86s) make it the top choice for interactive and user-facing applications. For workloads prioritising maximum raw throughput, GMI (FP8) leads at 190 t/s. For the most cost-sensitive deployments, Novita, GMI, and Alibaba Cloud are all tied at $0.69/1M blended.
DeepInfra stands out as the overall recommended provider, delivering the lowest latency and the fastest end-to-end response time among all evaluated options.
While its token generation pace remains highly competitive at 175 t/s, it is the exceptionally brief TTFT of 0.60s that makes DeepInfra the superior choice for interactive applications such as real-time assistants, conversational AI, and coding tools. Its blended price of $0.71/1M is marginally above the $0.69 floor, but the performance advantage more than justifies the difference for user-facing workloads.
For workloads where the primary requirement is the highest volume of tokens generated per second, GMI (FP8) leads the benchmark.
GMI’s 190 t/s throughput is the highest measured, making it the natural choice for batch processing, offline data generation, or summarization tasks where the initial latency is not a critical constraint. Its TTFT of 2.39s, however, makes it less suitable for real-time user-facing applications where perceived responsiveness matters.
Novita offers a compelling middle ground between generation speed and initial responsiveness, making it a versatile option for mixed workloads.
Novita ranks as a strong runner-up in both throughput (184 t/s, #2) and end-to-end response time (15.35s, #2). Combined with the lowest blended price in the benchmark ($0.69/1M), it is an excellent choice for developers who need a reliable middle ground between DeepInfra’s interactivity and GMI’s raw throughput, without paying a premium.
As the model creator, Alibaba Cloud offers a reliable first-party baseline with full production support via the Qwen3.5-Flash hosted API.
Alibaba Cloud’s pricing matches the market floor at $0.69/1M, but its throughput (162 t/s, lowest in the benchmark) and end-to-end response time (17.48s, slowest) make it the least performant option for pure inference speed. It remains the natural fallback for teams already in the Alibaba Cloud ecosystem or needing the extended production features of Qwen3.5-Flash.
Selecting the right provider for Qwen3.5 35B A3B comes down to your application’s primary bottleneck. For interactive, user-facing applications where every millisecond counts, DeepInfra’s unmatched TTFT (0.60s) and end-to-end performance (14.86s) make it the standout choice. For batch workloads requiring maximum sustained throughput, GMI (FP8) leads at 190 t/s. For teams needing a cost-efficient balance of both, Novita delivers strong performance across the board at the market’s lowest price.
Power the Next Era of Image Generation with FLUX.2 Visual Intelligence on DeepInfraDeepInfra is excited to support FLUX.2 from day zero, bringing the newest visual intelligence model from Black Forest Labs to our platform at launch. We make it straightforward for developers, creators, and enterprises to run the model with high performance, transparent pricing, and an API designed for productivity.
From Precision to Quantization: A Practical Guide to Faster, Cheaper LLMs<p>Large language models live and die by numbers—literally trillions of them. How finely we store those numbers (their precision) determines how much memory a model needs, how fast it runs, and sometimes how good its answers are. This article walks from the basics to the deep end: we’ll start with how computers even store a […]</p>
NVIDIA Nemotron API Pricing Guide 2026<p>While everyone knows Llama 3 and Qwen, a quieter revolution has been happening in NVIDIA’s labs. They have been taking standard Llama models and “supercharging” them using advanced alignment techniques and pruning methods. The result is Nemotron—a family of models that frequently tops the “Helpfulness” leaderboards (like Arena Hard), often beating GPT-4o while being significantly […]</p>
© 2026 Deep Infra. All rights reserved.