NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Qwen3.5 4B is a compact 4-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud’s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural departure from standard Transformers.
Unlike earlier small models that added vision capabilities post-hoc, Qwen3.5 4B features native multimodal capabilities through early fusion training on multimodal tokens. This allows the model to process text, image, and video inputs within the same latent space, resulting in superior spatial reasoning, improved OCR accuracy, and more cohesive visual-grounded responses. The model supports 201 languages and dialects, features a 262,144-token native context window (extensible to 1M via YaRN), and uses extended chain-of-thought reasoning to work through complex problems before providing an answer.
All Qwen3.5 open-weight models are released under the Apache 2.0 license, enabling commercial use and fine-tuning. Qwen3.5 4B is now available via DeepInfra — this analysis breaks down the key performance metrics developers need to evaluate before deploying.
DeepInfra is the only option for Qwen3.5 4B deployment. It delivers 250.0 t/s output speed, a 0.45s TTFT, and a blended price of $0.06/1M tokens. The combination of sub-half-second latency and high throughput makes it well suited for both interactive and batch workloads.
For interactive AI applications, chatbots, and real-time agentic workflows, TTFT is the most critical user-facing metric. DeepInfra records a median TTFT of 0.45 seconds — measured after processing a 10,000 input token workload, which for a reasoning model includes initial input processing and generation of the first reasoning token.
A sub-half-second TTFT effectively eliminates cold start delays for real-time applications. It is the recommended inference choice for applications requiring immediate perceived responsiveness, from conversational interfaces to coding assistants.
Inference output speed dictates how quickly a model can stream its generated response after the first token is received. DeepInfra achieves 250.0 tokens per second — a sustained P50 measurement over a 72-hour period.
At 250 t/s, a 4-billion parameter model can generate extensive reasoning chains and final answers rapidly. For throughput-intensive tasks such as bulk summarization, automated report generation, long-form content creation, or complex programmatic reasoning, this generation speed ensures token output never becomes a pipeline bottleneck.
End-to-end response time provides the most accurate view of total API transaction duration. DeepInfra completes a full 500-token output generation in 10.45 seconds, composed of the 0.45s TTFT, the model’s standardized internal reasoning time, and an 8.00-second pure output time.
This predictable and stable E2E latency prevents client-side request timeouts during multi-step prompt executions and makes it well suited for complex, multi-step agentic workflows.
DeepInfra offers the following pricing for Qwen3.5 4B inference:
The heavily discounted input pricing ($0.03/1M) makes it particularly cost-effective for RAG architectures, where large context payloads are sent to the API prior to generation. For high-volume deployments processing millions of tokens per day, this pricing structure delivers strong operational economics.
DeepInfra’s deployment of Qwen3.5 4B supports a 262k token context window alongside native Function Calling (Tool Use). A 262k context limit allows developers to pass hundreds of pages of documentation, extensive codebases, or long conversation histories in a single API request. Native function calling support enables the model to reliably trigger external APIs, query databases, and interact with structured workflows — making it a practical foundation for autonomous agents.
For developers deploying Qwen3.5 4B (Reasoning), DeepInfra (FP8) is the way to go. It combines a sub-half-second TTFT (0.45s), high output throughput (250.0 t/s), and a market-competitive blended price of $0.06 per million tokens — delivering strong performance for both latency-sensitive and throughput-intensive production workloads.
Deploy Custom LLMs on DeepInfraDid you just finetune your favorite model and are wondering where to run it?
Well, we have you covered. Simple API and predictable pricing.
Put your model on huggingface
Use a private repo, if you wish, we don't mind. Create a hf access token just
for the repo for better security.
Create c...
How to OpenAI Whisper with per-sentence and per-word timestamp segmentation using DeepInfraWhisper is a Speech-To-Text model from OpenAI.© 2026 Deep Infra. All rights reserved.