NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Qwen3.5 0.8B is part of Alibaba Cloud’s Qwen3.5 Small Model Series, released on March 2, 2026. Designed under the philosophy of “More Intelligence, Less Compute,” it targets edge devices, mobile phones, and low-latency applications where battery life and memory constraints are critical. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a 3:1 ratio of linear to full attention layers) with sparse Mixture-of-Experts, enabling high output quality while controlling memory growth — supporting a 262,000-token context window despite its compact footprint.
Unlike earlier small models that added vision capabilities post-hoc, Qwen3.5 0.8B features native multimodal capabilities through early fusion training on multimodal tokens. The model supports 201 languages and dialects, uses extended chain-of-thought reasoning to work through complex problems before providing an answer, and supports function calling for agentic workflows. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats, and is released under the Apache 2.0 license enabling commercial use and fine-tuning.
Qwen3.5 0.8B is now available via DeepInfra — this analysis breaks down the key performance metrics developers need to evaluate before deploying.
DeepInfra is the only provider for Qwen3.5 0.8B deployment. It delivers 403.5 t/s output speed, a 0.37s TTFT, and a blended price of $0.02/1M tokens. The combination of sub-half-second latency, high throughput, and native JSON mode and function calling support makes it well suited for both real-time and batch workloads.
For interactive AI applications, chatbots, and real-time agentic workflows, TTFT is the most critical user-facing metric. DeepInfra records a median TTFT of 0.37 seconds — measured after processing a 10,000 input token workload, which for a reasoning model includes initial input processing and generation of the first reasoning token.
A sub-half-second TTFT effectively eliminates cold start delays for real-time applications. It is the recommended inference choice for applications requiring immediate perceived responsiveness, from conversational interfaces to coding assistants.
Inference output speed dictates how quickly a model can stream its generated response after the first token is received. DeepInfra achieves 403.5 tokens per second — a sustained P50 measurement over a 72-hour period.
At 403.5 t/s, a standard 500-token response is generated in approximately 1.2 seconds. For throughput-intensive tasks such as bulk summarization, automated report generation, long-form content creation, or complex programmatic reasoning, this generation speed ensures token output never becomes a pipeline bottleneck.
End-to-end response time provides the most accurate view of total API transaction duration. DeepInfra completes a full 500-token output in 6.56 seconds, composed of the 0.37s TTFT, a 4.96-second internal reasoning time, and approximately 1.60 seconds of pure output time.
This predictable and stable E2E latency prevents client-side request timeouts during multi-step prompt executions and makes it well suited for complex, multi-step agentic workflows.
DeepInfra offers the following pricing for Qwen3.5 0.8B inference:
The heavily discounted input pricing ($0.01/1M) makes it particularly cost-effective for RAG architectures, where large context payloads are sent to the API prior to generation. For high-volume deployments processing millions of tokens per day, this pricing structure delivers strong operational economics.
DeepInfra’s deployment of Qwen3.5 0.8B supports a 262k token context window alongside native Function Calling (Tool Use) and JSON Mode. A 262k context limit allows developers to pass hundreds of pages of documentation, extensive codebases, or long conversation histories in a single API request. Native function calling and JSON mode support enables the model to reliably trigger external APIs, return structured outputs, and interact with complex agentic workflows.
DeepInfra (FP8) offers the lowest pricing at $0.01 per 1M input tokens and $0.05 per 1M output tokens, with a blended rate of $0.02 per 1M tokens.
On DeepInfra (FP8), the median TTFT is 0.37 seconds on a 10,000 input token workload, measured as P50 over 72 hours.
The model supports a 262,000-token (262k) context window, enabling extensive RAG use cases and processing of large documents or codebases.
Yes. DeepInfra’s API provides native support for both function (tool) calling and JSON mode, making it suitable for autonomous agent development.
DeepInfra (FP8) delivers 403.5 tokens per second, allowing a standard 500-token response to be generated in approximately 1.2 seconds.
Yes. The model is available under the Apache 2.0 license on Hugging Face and ModelScope. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats via llama.cpp or Ollama.
For developers deploying Qwen3.5 0.8B (Reasoning), DeepInfra (FP8) is the way to go. It combines a sub-half-second TTFT (0.37s), high output throughput (403.5 t/s), and a blended price of just $0.02 per million tokens — delivering strong performance for both latency-sensitive and throughput-intensive production workloads, with native JSON mode and function calling support included.
Qwen3.5 2B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 2B (Reasoning) Qwen3.5 2B is a compact 2-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud’s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural […]</p>
Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)<p>This guide walks you from zero to working: you’ll learn what OCR is (and why PDFs can be tricky), how to turn any PDF—including those with screenshots of tables—into text, and how to let an LLM do the heavy lifting to clean OCR noise, reconstruct tables, and summarize the document. We’ll use DeepInfra’s OpenAI-compatible API […]</p>
Qwen3.5 4B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 4B (Reasoning) Qwen3.5 4B is a compact 4-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud’s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural […]</p>
© 2026 Deep Infra. All rights reserved.