We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Qwen3.5 9B API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
Qwen3.5 9B API Benchmarks: Latency, Throughput & Cost

About Qwen3.5 9B

Qwen3.5 9B is the flagship of Alibaba’s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes a 3:1 ratio of linear attention to full attention, maintaining a 262,144-token context window while remaining efficient enough to run on standard hardware.

Unlike previous generations that added vision capabilities post-hoc, Qwen3.5 9B was trained using early fusion on multimodal tokens, allowing the model to process visual and textual tokens within the same latent space from the start of training. This results in better spatial reasoning, improved OCR accuracy, and more cohesive visual-grounded responses. The model’s performance is largely attributed to Scaled Reinforcement Learning, which optimizes for correct reasoning paths rather than mimicking high-quality text — producing improved instruction following, fewer hallucinations, and higher reliability in fact-retrieval and mathematical reasoning.

Qwen3.5 9B is released under the Apache 2.0 license, enabling commercial use and fine-tuning. It is now being offered by different providers. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Qwen3.5 9B (Reasoning) API Review Summary

  • DeepInfra (FP8) is the fastest provider: 205.7 t/s vs Together.ai at 92.3 t/s — approximately 2.2x higher throughput.
  • DeepInfra (FP8) is the lowest-cost option: $0.08 blended / 1M tokens vs $0.11, a ~1.4x price spread.
  • DeepInfra has the cheapest input pricing: $0.04 / 1M input tokens vs Together.ai’s $0.10 — especially beneficial for long-context (10k input token) workloads.
  • DeepInfra has the fastest end-to-end response time: 13.19s vs Together.ai’s 27.84s for a 500-token output.
  • Together.ai wins on TTFT: 0.75s vs DeepInfra’s 1.04s — the only metric where Together.ai leads.
  • Both providers support function/tool calling and the full 262k context window.

Qwen3.5 9B (Reasoning) — Best APIs

ProviderQuant.Blended ($/1M)Input ($/1M)Output ($/1M)Speed (t/s)TTFT (s)E2E (s)ContextWhy Notable
DeepInfra (FP8)FP8$0.08$0.04$0.20205.71.04s13.19 / 9.72262kBest throughput + blended cost; best for long inputs and fastest generation
Together.ai (FP8)FP8$0.11$0.10$0.1592.30.75s27.84 / 21.67262kBest TTFT latency; slower throughput and higher blended cost

Quick Verdict: Which Qwen3.5 9B Provider is Best?

Based on benchmarks across 2 tracked providers, DeepInfra is the recommended API for production-scale Qwen3.5 9B deployment. It delivers 2.2x faster output speed, the lowest blended price ($0.08/1M), and resolves tasks in less than half the end-to-end time of Together.ai. Together.ai remains a viable alternative for highly interactive, conversational applications where sub-second TTFT (0.75s) is the primary requirement.

Output Speed: DeepInfra Leads by 2.2x

Output speed measures how quickly tokens are generated after the model begins its response — the primary metric for throughput-intensive tasks.

  • DeepInfra: 205.7 t/s
  • Together.ai: 92.3 t/s

DeepInfra operates at approximately 2.2x the speed of Together.ai. For applications generating long-form content, analyzing large datasets, or requiring rapid data extraction, this throughput advantage translates directly into reduced wait times. The gap is large enough to be decisive for any workload where generation volume is the primary bottleneck.

Latency: Together.ai Has the Edge

TTFT measures the initial responsiveness of an application. For reasoning models like Qwen3.5 9B, this includes the model’s internal thinking time before outputting the first user-facing answer token.

  • Together.ai: 0.75s (sub-second)
  • DeepInfra: 1.04s

Together.ai wins the latency category with a sub-second TTFT of 0.75s. For highly interactive applications — real-time chatbots or voice-to-text assistants — this edge creates a snappier perceived experience. DeepInfra at 1.04s is still highly performant and will be imperceptible to most users in practice, but the 290ms gap is measurable and relevant for latency-critical applications.

Cost Efficiency: DeepInfra Is Cheaper Across the Board

Pricing is evaluated per 1 million tokens, with the blended rate assuming a standard 3:1 input-to-output ratio.

  • Blended Price: DeepInfra $0.08 vs Together.ai $0.11 — DeepInfra is 27% cheaper overall.
  • Input Price: DeepInfra $0.04 vs Together.ai $0.10 — DeepInfra is 60% cheaper on input tokens.
  • Output Price: Together.ai $0.15 vs DeepInfra $0.20 — Together.ai is cheaper on output tokens only.

Because most reasoning and RAG workloads are heavily weighted toward input tokens (large system prompts, document context, retrieval results), DeepInfra’s aggressively priced input tier ($0.04/1M) makes it the more cost-effective choice for the vast majority of real-world usage patterns. Together.ai’s cheaper output pricing ($0.15 vs $0.20) only becomes advantageous for workloads with very short inputs and very long outputs — a less common pattern for reasoning models.

End-to-End Response Time: DeepInfra Is Nearly 2x Faster

End-to-end response time combines initial latency, reasoning time, and output speed to measure the complete lifecycle of a request — specifically, how long it takes to deliver a 500-token response from a 10,000 input token prompt.

  • DeepInfra: 13.19s
  • Together.ai: 27.84s

DeepInfra resolves tasks in less than half the time of Together.ai. Despite Together.ai’s slight TTFT advantage, DeepInfra’s 2.2x throughput lead entirely eclipses that edge when measuring total task completion time. For any workload beyond a single short exchange, DeepInfra delivers a substantially faster experience end-to-end.

Context Window and API Features

Both providers support the full 262,144-token (262k) context window natively available to Qwen3.5 9B, and both fully support Function (Tool) Calling. This means provider selection can rest entirely on performance and pricing metrics — neither provider imposes a technical ceiling on what you can build.

Conclusion

For the vast majority of Qwen3.5 9B deployments, DeepInfra is the recommended provider. With 205.7 t/s output speed, an end-to-end response time of just 13.19s, and the lowest blended price on the market at $0.08 per million tokens, DeepInfra delivers an unmatched combination of speed and cost-effectiveness.

  • Choose DeepInfra for the best overall value — fastest throughput, lowest cost, and best end-to-end response times.
  • Choose Together.ai strictly for highly interactive applications where sub-second TTFT (0.75s) is the primary architectural requirement.
Related articles
A Milestone on Our Journey Building Deep Infra and Scaling Open Source AI InfrastructureA Milestone on Our Journey Building Deep Infra and Scaling Open Source AI InfrastructureToday we're excited to share that Deep Infra has raised $18 million in Series A funding, led by Felicis and our earliest believer and advisor Georges Harik.
NVIDIA Nemotron API Pricing Guide 2026NVIDIA Nemotron API Pricing Guide 2026<p>While everyone knows Llama 3 and Qwen, a quieter revolution has been happening in NVIDIA&#8217;s labs. They have been taking standard Llama models and &#8220;supercharging&#8221; them using advanced alignment techniques and pruning methods. The result is Nemotron—a family of models that frequently tops the &#8220;Helpfulness&#8221; leaderboards (like Arena Hard), often beating GPT-4o while being significantly [&hellip;]</p>
Pricing 101: Token Math & Cost-Per-Completion ExplainedPricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is [&hellip;]</p>