We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Step 3.5 Flash API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
Step 3.5 Flash API Benchmarks: Latency, Throughput & Cost

About Step 3.5 Flash

Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models.

Scoring 38 on the Artificial Analysis Intelligence Index — well above the comparable open-weights median of 27 — Step 3.5 Flash features a 256k token context window (roughly 384 A4 pages), extended chain-of-thought reasoning controllable via a reasoning_effort parameter, native tool calling with parallel function invocation, and JSON mode for structured output. The model is released under the Apache 2.0 license, enabling commercial use and third-party hosting on platforms like DeepInfra.

It’s a highly verbose model during reasoning — generating an average of 200M tokens during intelligence evaluations versus a median of 17M for comparable models — which makes cost efficiency a critical factor when selecting an inference provider.

Step 3.5 Flash is now available across multiple API providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Step 3.5 Flash API Review Summary

  • DeepInfra is the overall recommended provider: lowest latency (~0.32s TTFT) and competitive pricing ($0.10 input / $0.30 output per 1M tokens), with full JSON Mode and Function Calling support.
  • High intelligence for its class: Artificial Analysis Intelligence Index score of 38 (vs. comparable open-weights median of 27).
  • Very fast generation: 82.2 output tokens/sec baseline (vs. comparable median of 52.8 t/s).
  • Aggressively low pricing: $0.10 / 1M input tokens (median $0.60) and $0.30 / 1M output tokens (median $2.20); blended (3:1) $0.15 / 1M.
  • Large context window: 256k tokens (~384 A4 pages).
  • Open weights + Apache 2.0 license: commercial use permitted; enables third-party hosting options in addition to the first-party StepFun API.
  • Key trade-off: Very verbose model behavior (200M output tokens in benchmarking vs. median 17M) means output costs accumulate quickly — provider pricing is especially important for this model.

Step 3.5 Flash — Best APIs

ProviderWhy NotableInput ($/1M)Output ($/1M)Latency (TTFT)Speed (t/s)Best Use Case
DeepInfraIndustry-leading TTFT (~0.32s) with competitive pricing; JSON mode + function calling$0.10$0.30~0.32s77–88Real-time applications, conversational agents
SiliconFlow (FP8)Highest raw throughput at 100.4 t/s for batch workloads~$0.15 blended~$0.15 blended2.17s100.4High-volume generation, batch processing
StepFun (first-party)Primary reference API from the model creator; high throughput baseline$0.10$0.303.19s95.2Batch workloads, non-interactive applications
OpenRouterAPI aggregator routing across providers for maximum uptime and redundancy$0.10$0.30VariesVariesEnterprise uptime requirements, API routing

Quick Verdict: Which Step 3.5 Flash Provider is Best?

Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale Step 3.5 Flash deployment. It offers an industry-leading TTFT of ~0.32 seconds — nearly 10x faster than StepFun’s first-party API — at matching competitive pricing ($0.10 input / $0.30 output). For maximum raw throughput, SiliconFlow leads at 100.4 t/s. For enterprise uptime requirements, OpenRouter provides routing redundancy across providers.

Overall Winner: DeepInfra

DeepInfra stands out as the overall recommended provider for Step 3.5 Flash, striking the optimal balance between ultra-low latency, competitive pricing, and full feature support.

  • Input Price: $0.10 per 1M tokens
  • Output Price: $0.30 per 1M tokens
  • Latency (TTFT): ~0.32 seconds (fastest in the benchmark)
  • Output Speed: 77–88 tokens/sec
  • Context Window: 262k tokens
  • API Features: JSON Mode + Function Calling

Reasoning models like Step 3.5 Flash require thinking time before outputting an answer, which inherently increases end-to-end response times. DeepInfra mitigates this with a TTFT of ~0.32 seconds — compared to the 2–3 second averages seen at other providers. Given the model’s verbose reasoning behavior, this latency advantage compounds significantly for interactive applications where users are waiting for the first token.

DeepInfra also matches the baseline competitive pricing of $0.10/$0.30 for input/output tokens while adding full JSON Mode and Function Calling support — making it the most cost-efficient and responsive choice for developers building real-time agentic applications.

Best for Raw Throughput: SiliconFlow (FP8)

For workloads where raw output speed is prioritized over initial response time, SiliconFlow running FP8 quantization is the leading alternative.

  • Blended Price: ~$0.15 per 1M tokens
  • Latency (TTFT): 2.17 seconds
  • Output Speed: 100.4 tokens/sec (fastest in the benchmark)

At 100.4 tokens/sec, SiliconFlow surpasses the Step 3.5 Flash baseline average of 82.2 t/s. For workloads involving large-scale code generation, long-context reasoning tasks, or batch document processing where the 2.17-second initial latency is acceptable, SiliconFlow provides the highest throughput available. For conversational agents requiring immediate user feedback, the higher TTFT makes it a less optimal choice than DeepInfra.

First-Party Baseline: StepFun

Using the model creator’s first-party API is a standard route for enterprise developers prioritizing reliability and direct vendor support.

  • Input Price: $0.10 per 1M tokens
  • Output Price: $0.30 per 1M tokens
  • Latency (TTFT): 3.19–3.21 seconds
  • Output Speed: 95.2 tokens/sec

The StepFun API offers solid throughput at 95.2 t/s and competitive pricing that matches DeepInfra. The primary drawback is latency — a TTFT of over 3.2 seconds means end-users will experience a noticeable delay before the model begins generating. For batch workloads or non-interactive applications, StepFun remains a solid choice as the authoritative first-party provider. For interactive applications, DeepInfra’s 10x latency advantage is decisive.

Best for Uptime: OpenRouter

For enterprise applications with strict uptime requirements, OpenRouter serves as a routing layer rather than a standalone inference host.

  • Input Price: $0.10 per 1M tokens (pass-through)
  • Output Price: $0.30 per 1M tokens (pass-through)
  • Context Window: 262.1k tokens supported
  • Latency / Speed: Varies by routed provider

OpenRouter does not host Step 3.5 Flash directly but routes API requests to the best available providers — including DeepInfra and StepFun — to maintain operational redundancy. It passes through the standard $0.10/$0.30 pricing structure while natively supporting the model’s full context window. For production environments where API redundancy is a strict requirement, OpenRouter is a practical choice.

Frequently Asked Questions

What is the context window for Step 3.5 Flash?

Step 3.5 Flash features a 256k token context window, equivalent to processing approximately 384 standard A4 pages of text in a single prompt.

Why is DeepInfra recommended over StepFun for Step 3.5 Flash?

While StepFun is the model creator, DeepInfra offers a significantly lower TTFT (~0.32 seconds vs. StepFun’s 3.19 seconds) at the same price point, making it far better suited for real-time and conversational applications. DeepInfra also supports both JSON Mode and Function Calling.

Is Step 3.5 Flash multimodal?

No. Step 3.5 Flash is a text-only model supporting text input and text output. It does not support image input or other multimodal capabilities.

What license is Step 3.5 Flash released under?

Step 3.5 Flash is released under the Apache 2.0 license, which permits commercial use and enables third-party hosting on platforms like DeepInfra.

How many parameters does Step 3.5 Flash have?

Step 3.5 Flash uses a Mixture of Experts (MoE) architecture with 196 billion total parameters and approximately 11 billion active parameters per token during inference.

Conclusion

Step 3.5 Flash is a highly capable open-weights reasoning model that competes aggressively on both intelligence metrics and operational cost. Scoring 38 on the Artificial Analysis Intelligence Index — well above the open-weights median of 27 — it delivers enterprise-grade reasoning at a fraction of the cost of comparable closed-source models.

For the vast majority of Step 3.5 Flash deployments, DeepInfra is the clear overall recommendation. Its unmatched TTFT of ~0.32 seconds combined with competitive pricing ($0.10 input / $0.30 output) and full JSON Mode and Function Calling support makes it the optimal infrastructure for real-time agentic applications.

  • Choose DeepInfra for the best overall value — lowest latency, competitive pricing, and full feature support.
  • Choose SiliconFlow for high-volume batch processing where maximum throughput (100.4 t/s) is the priority.
  • Choose StepFun for a reliable first-party baseline or batch workloads where latency is not a concern.
  • Choose OpenRouter for enterprise environments requiring API routing and uptime redundancy.
Related articles
Fork of Text Generation Inference.Fork of Text Generation Inference.The text generation inference open source project by huggingface looked like a promising framework for serving large language models (LLM). However, huggingface announced that they will change the license of code with version v1.0.0. While the previous license Apache 2.0 was permissive, the new on...
Build a Streaming Chat Backend in 10 MinutesBuild a Streaming Chat Backend in 10 Minutes<p>When large language models move from demos into real systems, expectations change. The goal is no longer to produce clever text, but to deliver predictable latency, responsive behavior, and reliable infrastructure characteristics. In chat-based systems, especially, how fast a response starts often matters more than how fast it finishes. This is where token streaming becomes [&hellip;]</p>
GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep InfraGLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra<p>GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the [&hellip;]</p>