We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Qwen3.5 0.8B API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by han
Qwen3.5 0.8B API Benchmarks: Latency, Throughput & Cost

About Qwen3.5 0.8B (Reasoning)

Qwen3.5 0.8B is part of Alibaba Cloud’s Qwen3.5 Small Model Series, released on March 2, 2026. Designed under the philosophy of “More Intelligence, Less Compute,” it targets edge devices, mobile phones, and low-latency applications where battery life and memory constraints are critical. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a 3:1 ratio of linear to full attention layers) with sparse Mixture-of-Experts, enabling high output quality while controlling memory growth — supporting a 262,000-token context window despite its compact footprint.

Unlike earlier small models that added vision capabilities post-hoc, Qwen3.5 0.8B features native multimodal capabilities through early fusion training on multimodal tokens. The model supports 201 languages and dialects, uses extended chain-of-thought reasoning to work through complex problems before providing an answer, and supports function calling for agentic workflows. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats, and is released under the Apache 2.0 license enabling commercial use and fine-tuning.

Qwen3.5 0.8B is now available via DeepInfra — this analysis breaks down the key performance metrics developers need to evaluate before deploying.

Qwen3.5 0.8B (Reasoning) API Review Summary

  • DeepInfra (FP8) is the only benchmarked provider for Qwen3.5 0.8B (Reasoning), leading across all key metrics: speed, latency, and price.
  • Fastest output speed: 403.5 tokens/sec (P50 median over the past 72 hours, 10,000 input token workload).
  • Lowest latency: 0.37s TTFT — sub-half-second initial response.
  • Lowest blended price: $0.02 per 1M tokens (3:1 input:output blend).
  • Lowest token rates: $0.01 / 1M input tokens and $0.05 / 1M output tokens.
  • End-to-end response time: 6.56s for a 500-token output (thinking time: 4.96s; answer generation: ~1.60s).
  • Context window: 262k tokens.
  • Function Calling and JSON Mode: Both supported.

Quick Summary of DeepInfra

DeepInfra is the only provider for Qwen3.5 0.8B deployment. It delivers 403.5 t/s output speed, a 0.37s TTFT, and a blended price of $0.02/1M tokens. The combination of sub-half-second latency, high throughput, and native JSON mode and function calling support makes it well suited for both real-time and batch workloads.

Latency: 0.37s Time to First Token

For interactive AI applications, chatbots, and real-time agentic workflows, TTFT is the most critical user-facing metric. DeepInfra records a median TTFT of 0.37 seconds — measured after processing a 10,000 input token workload, which for a reasoning model includes initial input processing and generation of the first reasoning token.

A sub-half-second TTFT effectively eliminates cold start delays for real-time applications. It is the recommended inference choice for applications requiring immediate perceived responsiveness, from conversational interfaces to coding assistants.

Output Speed: 403.5 Tokens per Second

Inference output speed dictates how quickly a model can stream its generated response after the first token is received. DeepInfra achieves 403.5 tokens per second — a sustained P50 measurement over a 72-hour period.

At 403.5 t/s, a standard 500-token response is generated in approximately 1.2 seconds. For throughput-intensive tasks such as bulk summarization, automated report generation, long-form content creation, or complex programmatic reasoning, this generation speed ensures token output never becomes a pipeline bottleneck.

End-to-End Response Time: 6.56 Seconds for 500 Tokens

End-to-end response time provides the most accurate view of total API transaction duration. DeepInfra completes a full 500-token output in 6.56 seconds, composed of the 0.37s TTFT, a 4.96-second internal reasoning time, and approximately 1.60 seconds of pure output time.

This predictable and stable E2E latency prevents client-side request timeouts during multi-step prompt executions and makes it well suited for complex, multi-step agentic workflows.

Cost Efficiency: $0.02 Blended Price per 1M Tokens

DeepInfra offers the following pricing for Qwen3.5 0.8B inference:

  • Input Price: $0.01 per 1M tokens
  • Output Price: $0.05 per 1M tokens
  • Blended Price: $0.02 per 1M tokens (3:1 input:output ratio)

The heavily discounted input pricing ($0.01/1M) makes it particularly cost-effective for RAG architectures, where large context payloads are sent to the API prior to generation. For high-volume deployments processing millions of tokens per day, this pricing structure delivers strong operational economics.

Context Window and API Features

DeepInfra’s deployment of Qwen3.5 0.8B supports a 262k token context window alongside native Function Calling (Tool Use) and JSON Mode. A 262k context limit allows developers to pass hundreds of pages of documentation, extensive codebases, or long conversation histories in a single API request. Native function calling and JSON mode support enables the model to reliably trigger external APIs, return structured outputs, and interact with complex agentic workflows.

Frequently Asked Questions

What is the cheapest API for Qwen3.5 0.8B?

DeepInfra (FP8) offers the lowest pricing at $0.01 per 1M input tokens and $0.05 per 1M output tokens, with a blended rate of $0.02 per 1M tokens.

How fast is Qwen3.5 0.8B’s time-to-first-token?

On DeepInfra (FP8), the median TTFT is 0.37 seconds on a 10,000 input token workload, measured as P50 over 72 hours.

What is the context window for Qwen3.5 0.8B?

The model supports a 262,000-token (262k) context window, enabling extensive RAG use cases and processing of large documents or codebases.

Does Qwen3.5 0.8B support function calling?

Yes. DeepInfra’s API provides native support for both function (tool) calling and JSON mode, making it suitable for autonomous agent development.

What is the output speed of Qwen3.5 0.8B on DeepInfra?

DeepInfra (FP8) delivers 403.5 tokens per second, allowing a standard 500-token response to be generated in approximately 1.2 seconds.

Can I run Qwen3.5 0.8B locally?

Yes. The model is available under the Apache 2.0 license on Hugging Face and ModelScope. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats via llama.cpp or Ollama.

Conclusion

For developers deploying Qwen3.5 0.8B (Reasoning), DeepInfra (FP8) is the way to go. It combines a sub-half-second TTFT (0.37s), high output throughput (403.5 t/s), and a blended price of just $0.02 per million tokens — delivering strong performance for both latency-sensitive and throughput-intensive production workloads, with native JSON mode and function calling support included.

Related articles
Qwen3.5 2B via DeepInfra: Latency, Throughput & CostQwen3.5 2B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 2B (Reasoning) Qwen3.5 2B is a compact 2-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud&#8217;s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural [&hellip;]</p>
Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)<p>This guide walks you from zero to working: you’ll learn what OCR is (and why PDFs can be tricky), how to turn any PDF—including those with screenshots of tables—into text, and how to let an LLM do the heavy lifting to clean OCR noise, reconstruct tables, and summarize the document. We’ll use DeepInfra’s OpenAI-compatible API [&hellip;]</p>
Qwen3.5 4B via DeepInfra: Latency, Throughput & CostQwen3.5 4B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 4B (Reasoning) Qwen3.5 4B is a compact 4-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud&#8217;s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural [&hellip;]</p>