Qwen3.5 0.8B API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About Qwen3.5 0.8B (Reasoning)

Qwen3.5 0.8B is part of Alibaba Cloud’s Qwen3.5 Small Model Series, released on March 2, 2026. Designed under the philosophy of “More Intelligence, Less Compute,” it targets edge devices, mobile phones, and low-latency applications where battery life and memory constraints are critical. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a 3:1 ratio of linear to full attention layers) with sparse Mixture-of-Experts, enabling high output quality while controlling memory growth — supporting a 262,000-token context window despite its compact footprint.

Unlike earlier small models that added vision capabilities post-hoc, Qwen3.5 0.8B features native multimodal capabilities through early fusion training on multimodal tokens. The model supports 201 languages and dialects, uses extended chain-of-thought reasoning to work through complex problems before providing an answer, and supports function calling for agentic workflows. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats, and is released under the Apache 2.0 license enabling commercial use and fine-tuning.

Qwen3.5 0.8B is now available via DeepInfra — this analysis breaks down the key performance metrics developers need to evaluate before deploying.

Qwen3.5 0.8B (Reasoning) API Review Summary

DeepInfra (FP8) is the only benchmarked provider for Qwen3.5 0.8B (Reasoning), leading across all key metrics: speed, latency, and price.
Fastest output speed: 403.5 tokens/sec (P50 median over the past 72 hours, 10,000 input token workload).
Lowest latency: 0.37s TTFT — sub-half-second initial response.
Lowest blended price: $0.02 per 1M tokens (3:1 input:output blend).
Lowest token rates: $0.01 / 1M input tokens and $0.05 / 1M output tokens.
End-to-end response time: 6.56s for a 500-token output (thinking time: 4.96s; answer generation: ~1.60s).
Context window: 262k tokens.
Function Calling and JSON Mode: Both supported.

Quick Summary of DeepInfra

DeepInfra is the only provider for Qwen3.5 0.8B deployment. It delivers 403.5 t/s output speed, a 0.37s TTFT, and a blended price of $0.02/1M tokens. The combination of sub-half-second latency, high throughput, and native JSON mode and function calling support makes it well suited for both real-time and batch workloads.

Latency: 0.37s Time to First Token

For interactive AI applications, chatbots, and real-time agentic workflows, TTFT is the most critical user-facing metric. DeepInfra records a median TTFT of 0.37 seconds — measured after processing a 10,000 input token workload, which for a reasoning model includes initial input processing and generation of the first reasoning token.

A sub-half-second TTFT effectively eliminates cold start delays for real-time applications. It is the recommended inference choice for applications requiring immediate perceived responsiveness, from conversational interfaces to coding assistants.

Output Speed: 403.5 Tokens per Second

Inference output speed dictates how quickly a model can stream its generated response after the first token is received. DeepInfra achieves 403.5 tokens per second — a sustained P50 measurement over a 72-hour period.

At 403.5 t/s, a standard 500-token response is generated in approximately 1.2 seconds. For throughput-intensive tasks such as bulk summarization, automated report generation, long-form content creation, or complex programmatic reasoning, this generation speed ensures token output never becomes a pipeline bottleneck.

End-to-End Response Time: 6.56 Seconds for 500 Tokens

End-to-end response time provides the most accurate view of total API transaction duration. DeepInfra completes a full 500-token output in 6.56 seconds, composed of the 0.37s TTFT, a 4.96-second internal reasoning time, and approximately 1.60 seconds of pure output time.

This predictable and stable E2E latency prevents client-side request timeouts during multi-step prompt executions and makes it well suited for complex, multi-step agentic workflows.

Cost Efficiency: $0.02 Blended Price per 1M Tokens

DeepInfra offers the following pricing for Qwen3.5 0.8B inference:

Input Price: $0.01 per 1M tokens
Output Price: $0.05 per 1M tokens
Blended Price: $0.02 per 1M tokens (3:1 input:output ratio)

The heavily discounted input pricing ($0.01/1M) makes it particularly cost-effective for RAG architectures, where large context payloads are sent to the API prior to generation. For high-volume deployments processing millions of tokens per day, this pricing structure delivers strong operational economics.

Context Window and API Features

DeepInfra’s deployment of Qwen3.5 0.8B supports a 262k token context window alongside native Function Calling (Tool Use) and JSON Mode. A 262k context limit allows developers to pass hundreds of pages of documentation, extensive codebases, or long conversation histories in a single API request. Native function calling and JSON mode support enables the model to reliably trigger external APIs, return structured outputs, and interact with complex agentic workflows.

Frequently Asked Questions

What is the cheapest API for Qwen3.5 0.8B?

DeepInfra (FP8) offers the lowest pricing at $0.01 per 1M input tokens and $0.05 per 1M output tokens, with a blended rate of $0.02 per 1M tokens.

How fast is Qwen3.5 0.8B’s time-to-first-token?

On DeepInfra (FP8), the median TTFT is 0.37 seconds on a 10,000 input token workload, measured as P50 over 72 hours.

What is the context window for Qwen3.5 0.8B?

The model supports a 262,000-token (262k) context window, enabling extensive RAG use cases and processing of large documents or codebases.

Does Qwen3.5 0.8B support function calling?

Yes. DeepInfra’s API provides native support for both function (tool) calling and JSON mode, making it suitable for autonomous agent development.

What is the output speed of Qwen3.5 0.8B on DeepInfra?

DeepInfra (FP8) delivers 403.5 tokens per second, allowing a standard 500-token response to be generated in approximately 1.2 seconds.

Can I run Qwen3.5 0.8B locally?

Yes. The model is available under the Apache 2.0 license on Hugging Face and ModelScope. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats via llama.cpp or Ollama.

Conclusion

For developers deploying Qwen3.5 0.8B (Reasoning), DeepInfra (FP8) is the way to go. It combines a sub-half-second TTFT (0.37s), high output throughput (403.5 t/s), and a blended price of just $0.02 per million tokens — delivering strong performance for both latency-sensitive and throughput-intensive production workloads, with native JSON mode and function calling support included.

Best Kimi K2.6 API Providers for Developers (2026)Kimi K2.6 is available across a range of hosted API providers, and the right choice depends on what your workload optimizes for — latency, throughput, cost, deployment flexibility, or native feature support. This guide covers the top options by use case. For a detailed cost breakdown across workload types, see the Kimi K2.6 pricing guide. […]

NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & CostAbout NVIDIA Nemotron 3 Super 120B A12B NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging. The model uses a […]

Kimi K2.6 API Benchmarks: Latency, TPS & Cost Analysis (2026)About Kimi K2.6 Kimi K2.6 is an open-source frontier model from Moonshot AI, released on April 20, 2026. It is a native multimodal agentic model built for long-horizon coding, autonomous execution, and swarm-based task orchestration. The model uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters per token, using […]

View all