Qwen3.5 4B via DeepInfra: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About Qwen3.5 4B (Reasoning)

Qwen3.5 4B is a compact 4-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud’s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural departure from standard Transformers.

Unlike earlier small models that added vision capabilities post-hoc, Qwen3.5 4B features native multimodal capabilities through early fusion training on multimodal tokens. This allows the model to process text, image, and video inputs within the same latent space, resulting in superior spatial reasoning, improved OCR accuracy, and more cohesive visual-grounded responses. The model supports 201 languages and dialects, features a 262,144-token native context window (extensible to 1M via YaRN), and uses extended chain-of-thought reasoning to work through complex problems before providing an answer.

All Qwen3.5 open-weight models are released under the Apache 2.0 license, enabling commercial use and fine-tuning. Qwen3.5 4B is now available via DeepInfra — this analysis breaks down the key performance metrics developers need to evaluate before deploying.

Qwen3.5 4B (Reasoning) API Review Summary

DeepInfra (FP8) is the only benchmarked provider for Qwen3.5 4B (Reasoning), leading across all key metrics: speed, latency, and price.
Fastest output speed: 250.0 tokens/sec (measured as P50 median over the past 72 hours, 10,000 input token workload).
Lowest latency: 0.45s TTFT — sub-half-second initial response.
Lowest blended price: $0.06 per 1M tokens (3:1 input:output blend).
Lowest token rates: $0.03 / 1M input tokens and $0.15 / 1M output tokens.
Context window: 262k tokens.
Function Calling: Supported.

Quick Summary of DeepInfra

DeepInfra is the only option for Qwen3.5 4B deployment. It delivers 250.0 t/s output speed, a 0.45s TTFT, and a blended price of $0.06/1M tokens. The combination of sub-half-second latency and high throughput makes it well suited for both interactive and batch workloads.

Latency: 0.45s Time to First Token

For interactive AI applications, chatbots, and real-time agentic workflows, TTFT is the most critical user-facing metric. DeepInfra records a median TTFT of 0.45 seconds — measured after processing a 10,000 input token workload, which for a reasoning model includes initial input processing and generation of the first reasoning token.

A sub-half-second TTFT effectively eliminates cold start delays for real-time applications. It is the recommended inference choice for applications requiring immediate perceived responsiveness, from conversational interfaces to coding assistants.

Output Speed: 250.0 Tokens per Second

Inference output speed dictates how quickly a model can stream its generated response after the first token is received. DeepInfra achieves 250.0 tokens per second — a sustained P50 measurement over a 72-hour period.

At 250 t/s, a 4-billion parameter model can generate extensive reasoning chains and final answers rapidly. For throughput-intensive tasks such as bulk summarization, automated report generation, long-form content creation, or complex programmatic reasoning, this generation speed ensures token output never becomes a pipeline bottleneck.

End-to-End Response Time: 10.45 Seconds for 500 Tokens

End-to-end response time provides the most accurate view of total API transaction duration. DeepInfra completes a full 500-token output generation in 10.45 seconds, composed of the 0.45s TTFT, the model’s standardized internal reasoning time, and an 8.00-second pure output time.

This predictable and stable E2E latency prevents client-side request timeouts during multi-step prompt executions and makes it well suited for complex, multi-step agentic workflows.

Cost Efficiency: $0.06 Blended Price per 1M Tokens

DeepInfra offers the following pricing for Qwen3.5 4B inference:

Input Price: $0.03 per 1M tokens
Output Price: $0.15 per 1M tokens
Blended Price: $0.06 per 1M tokens (3:1 input:output ratio)

The heavily discounted input pricing ($0.03/1M) makes it particularly cost-effective for RAG architectures, where large context payloads are sent to the API prior to generation. For high-volume deployments processing millions of tokens per day, this pricing structure delivers strong operational economics.

Context Window and API Features

DeepInfra’s deployment of Qwen3.5 4B supports a 262k token context window alongside native Function Calling (Tool Use). A 262k context limit allows developers to pass hundreds of pages of documentation, extensive codebases, or long conversation histories in a single API request. Native function calling support enables the model to reliably trigger external APIs, query databases, and interact with structured workflows — making it a practical foundation for autonomous agents.

Conclusion

For developers deploying Qwen3.5 4B (Reasoning), DeepInfra (FP8) is the way to go. It combines a sub-half-second TTFT (0.45s), high output throughput (250.0 t/s), and a market-competitive blended price of $0.06 per million tokens — delivering strong performance for both latency-sensitive and throughput-intensive production workloads.

Gemma 4 Pricing, Benchmarks & Real-World Cost AnalysisGemma 4 puts a serious open-weight reasoning model into a genuinely competitive provider market. The same Gemma 4 26B A4B model is available across seven API providers, with blended pricing ranging from $0.10 to $0.70 per 1M tokens — real variation that changes production economics. Released April 3, 2026 by Google DeepMind under Apache 2.0, […]

Qwen3.5 27B API Benchmarks: Latency, Throughput & CostAbout Qwen3.5 27B (Reasoning) Qwen3.5 27B is part of Alibaba Cloud’s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench […]

From Precision to Quantization: A Practical Guide to Faster, Cheaper LLMsLarge language models live and die by numbers—literally trillions of them. How finely we store those numbers (their precision) determines how much memory a model needs, how fast it runs, and sometimes how good its answers are. This article walks from the basics to the deep end: we’ll start with how computers even store a […]

View all