NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models.
Scoring 38 on the Artificial Analysis Intelligence Index — well above the comparable open-weights median of 27 — Step 3.5 Flash features a 256k token context window (roughly 384 A4 pages), extended chain-of-thought reasoning controllable via a reasoning_effort parameter, native tool calling with parallel function invocation, and JSON mode for structured output. The model is released under the Apache 2.0 license, enabling commercial use and third-party hosting on platforms like DeepInfra.
It’s a highly verbose model during reasoning — generating an average of 200M tokens during intelligence evaluations versus a median of 17M for comparable models — which makes cost efficiency a critical factor when selecting an inference provider.
Step 3.5 Flash is now available across multiple API providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Why Notable | Input ($/1M) | Output ($/1M) | Latency (TTFT) | Speed (t/s) | Best Use Case |
|---|---|---|---|---|---|---|
| DeepInfra | Industry-leading TTFT (~0.32s) with competitive pricing; JSON mode + function calling | $0.10 | $0.30 | ~0.32s | 77–88 | Real-time applications, conversational agents |
| SiliconFlow (FP8) | Highest raw throughput at 100.4 t/s for batch workloads | ~$0.15 blended | ~$0.15 blended | 2.17s | 100.4 | High-volume generation, batch processing |
| StepFun (first-party) | Primary reference API from the model creator; high throughput baseline | $0.10 | $0.30 | 3.19s | 95.2 | Batch workloads, non-interactive applications |
| OpenRouter | API aggregator routing across providers for maximum uptime and redundancy | $0.10 | $0.30 | Varies | Varies | Enterprise uptime requirements, API routing |
Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale Step 3.5 Flash deployment. It offers an industry-leading TTFT of ~0.32 seconds — nearly 10x faster than StepFun’s first-party API — at matching competitive pricing ($0.10 input / $0.30 output). For maximum raw throughput, SiliconFlow leads at 100.4 t/s. For enterprise uptime requirements, OpenRouter provides routing redundancy across providers.
DeepInfra stands out as the overall recommended provider for Step 3.5 Flash, striking the optimal balance between ultra-low latency, competitive pricing, and full feature support.
Reasoning models like Step 3.5 Flash require thinking time before outputting an answer, which inherently increases end-to-end response times. DeepInfra mitigates this with a TTFT of ~0.32 seconds — compared to the 2–3 second averages seen at other providers. Given the model’s verbose reasoning behavior, this latency advantage compounds significantly for interactive applications where users are waiting for the first token.
DeepInfra also matches the baseline competitive pricing of $0.10/$0.30 for input/output tokens while adding full JSON Mode and Function Calling support — making it the most cost-efficient and responsive choice for developers building real-time agentic applications.
For workloads where raw output speed is prioritized over initial response time, SiliconFlow running FP8 quantization is the leading alternative.
At 100.4 tokens/sec, SiliconFlow surpasses the Step 3.5 Flash baseline average of 82.2 t/s. For workloads involving large-scale code generation, long-context reasoning tasks, or batch document processing where the 2.17-second initial latency is acceptable, SiliconFlow provides the highest throughput available. For conversational agents requiring immediate user feedback, the higher TTFT makes it a less optimal choice than DeepInfra.
Using the model creator’s first-party API is a standard route for enterprise developers prioritizing reliability and direct vendor support.
The StepFun API offers solid throughput at 95.2 t/s and competitive pricing that matches DeepInfra. The primary drawback is latency — a TTFT of over 3.2 seconds means end-users will experience a noticeable delay before the model begins generating. For batch workloads or non-interactive applications, StepFun remains a solid choice as the authoritative first-party provider. For interactive applications, DeepInfra’s 10x latency advantage is decisive.
For enterprise applications with strict uptime requirements, OpenRouter serves as a routing layer rather than a standalone inference host.
OpenRouter does not host Step 3.5 Flash directly but routes API requests to the best available providers — including DeepInfra and StepFun — to maintain operational redundancy. It passes through the standard $0.10/$0.30 pricing structure while natively supporting the model’s full context window. For production environments where API redundancy is a strict requirement, OpenRouter is a practical choice.
Step 3.5 Flash features a 256k token context window, equivalent to processing approximately 384 standard A4 pages of text in a single prompt.
While StepFun is the model creator, DeepInfra offers a significantly lower TTFT (~0.32 seconds vs. StepFun’s 3.19 seconds) at the same price point, making it far better suited for real-time and conversational applications. DeepInfra also supports both JSON Mode and Function Calling.
No. Step 3.5 Flash is a text-only model supporting text input and text output. It does not support image input or other multimodal capabilities.
Step 3.5 Flash is released under the Apache 2.0 license, which permits commercial use and enables third-party hosting on platforms like DeepInfra.
Step 3.5 Flash uses a Mixture of Experts (MoE) architecture with 196 billion total parameters and approximately 11 billion active parameters per token during inference.
Step 3.5 Flash is a highly capable open-weights reasoning model that competes aggressively on both intelligence metrics and operational cost. Scoring 38 on the Artificial Analysis Intelligence Index — well above the open-weights median of 27 — it delivers enterprise-grade reasoning at a fraction of the cost of comparable closed-source models.
For the vast majority of Step 3.5 Flash deployments, DeepInfra is the clear overall recommendation. Its unmatched TTFT of ~0.32 seconds combined with competitive pricing ($0.10 input / $0.30 output) and full JSON Mode and Function Calling support makes it the optimal infrastructure for real-time agentic applications.
Fork of Text Generation Inference.The text generation inference open source project by huggingface looked like a promising
framework for serving large language models (LLM). However, huggingface announced that they
will change the license of code with version v1.0.0. While the previous license Apache 2.0
was permissive, the new on...
Build a Streaming Chat Backend in 10 Minutes<p>When large language models move from demos into real systems, expectations change. The goal is no longer to produce clever text, but to deliver predictable latency, responsive behavior, and reliable infrastructure characteristics. In chat-based systems, especially, how fast a response starts often matters more than how fast it finishes. This is where token streaming becomes […]</p>
GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra<p>GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the […]</p>
© 2026 Deep Infra. All rights reserved.