Step 3.5 Flash API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About Step 3.5 Flash

Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models.

Scoring 38 on the Artificial Analysis Intelligence Index — well above the comparable open-weights median of 27 — Step 3.5 Flash features a 256k token context window (roughly 384 A4 pages), extended chain-of-thought reasoning controllable via a reasoning_effort parameter, native tool calling with parallel function invocation, and JSON mode for structured output. The model is released under the Apache 2.0 license, enabling commercial use and third-party hosting on platforms like DeepInfra.

It’s a highly verbose model during reasoning — generating an average of 200M tokens during intelligence evaluations versus a median of 17M for comparable models — which makes cost efficiency a critical factor when selecting an inference provider.

Step 3.5 Flash is now available across multiple API providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Step 3.5 Flash API Review Summary

DeepInfra is the overall recommended provider: lowest latency (~0.32s TTFT) and competitive pricing ($0.10 input / $0.30 output per 1M tokens), with full JSON Mode and Function Calling support.
High intelligence for its class: Artificial Analysis Intelligence Index score of 38 (vs. comparable open-weights median of 27).
Very fast generation: 82.2 output tokens/sec baseline (vs. comparable median of 52.8 t/s).
Aggressively low pricing: $0.10 / 1M input tokens (median $0.60) and $0.30 / 1M output tokens (median $2.20); blended (3:1) $0.15 / 1M.
Large context window: 256k tokens (~384 A4 pages).
Open weights + Apache 2.0 license: commercial use permitted; enables third-party hosting options in addition to the first-party StepFun API.
Key trade-off: Very verbose model behavior (200M output tokens in benchmarking vs. median 17M) means output costs accumulate quickly — provider pricing is especially important for this model.

Step 3.5 Flash — Best APIs

Provider	Why Notable	Input ($/1M)	Output ($/1M)	Latency (TTFT)	Speed (t/s)	Best Use Case
DeepInfra	Industry-leading TTFT (~0.32s) with competitive pricing; JSON mode + function calling	$0.10	$0.30	~0.32s	77–88	Real-time applications, conversational agents
SiliconFlow (FP8)	Highest raw throughput at 100.4 t/s for batch workloads	~$0.15 blended	~$0.15 blended	2.17s	100.4	High-volume generation, batch processing
StepFun (first-party)	Primary reference API from the model creator; high throughput baseline	$0.10	$0.30	3.19s	95.2	Batch workloads, non-interactive applications
OpenRouter	API aggregator routing across providers for maximum uptime and redundancy	$0.10	$0.30	Varies	Varies	Enterprise uptime requirements, API routing

Quick Verdict: Which Step 3.5 Flash Provider is Best?

Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale Step 3.5 Flash deployment. It offers an industry-leading TTFT of ~0.32 seconds — nearly 10x faster than StepFun’s first-party API — at matching competitive pricing ($0.10 input / $0.30 output). For maximum raw throughput, SiliconFlow leads at 100.4 t/s. For enterprise uptime requirements, OpenRouter provides routing redundancy across providers.

Overall Winner: DeepInfra

DeepInfra stands out as the overall recommended provider for Step 3.5 Flash, striking the optimal balance between ultra-low latency, competitive pricing, and full feature support.

Input Price: $0.10 per 1M tokens
Output Price: $0.30 per 1M tokens
Latency (TTFT): ~0.32 seconds (fastest in the benchmark)
Output Speed: 77–88 tokens/sec
Context Window: 262k tokens
API Features: JSON Mode + Function Calling

Reasoning models like Step 3.5 Flash require thinking time before outputting an answer, which inherently increases end-to-end response times. DeepInfra mitigates this with a TTFT of ~0.32 seconds — compared to the 2–3 second averages seen at other providers. Given the model’s verbose reasoning behavior, this latency advantage compounds significantly for interactive applications where users are waiting for the first token.

DeepInfra also matches the baseline competitive pricing of $0.10/$0.30 for input/output tokens while adding full JSON Mode and Function Calling support — making it the most cost-efficient and responsive choice for developers building real-time agentic applications.

Best for Raw Throughput: SiliconFlow (FP8)

For workloads where raw output speed is prioritized over initial response time, SiliconFlow running FP8 quantization is the leading alternative.

Blended Price: ~$0.15 per 1M tokens
Latency (TTFT): 2.17 seconds
Output Speed: 100.4 tokens/sec (fastest in the benchmark)

At 100.4 tokens/sec, SiliconFlow surpasses the Step 3.5 Flash baseline average of 82.2 t/s. For workloads involving large-scale code generation, long-context reasoning tasks, or batch document processing where the 2.17-second initial latency is acceptable, SiliconFlow provides the highest throughput available. For conversational agents requiring immediate user feedback, the higher TTFT makes it a less optimal choice than DeepInfra.

First-Party Baseline: StepFun

Using the model creator’s first-party API is a standard route for enterprise developers prioritizing reliability and direct vendor support.

Input Price: $0.10 per 1M tokens
Output Price: $0.30 per 1M tokens
Latency (TTFT): 3.19–3.21 seconds
Output Speed: 95.2 tokens/sec

The StepFun API offers solid throughput at 95.2 t/s and competitive pricing that matches DeepInfra. The primary drawback is latency — a TTFT of over 3.2 seconds means end-users will experience a noticeable delay before the model begins generating. For batch workloads or non-interactive applications, StepFun remains a solid choice as the authoritative first-party provider. For interactive applications, DeepInfra’s 10x latency advantage is decisive.

Best for Uptime: OpenRouter

For enterprise applications with strict uptime requirements, OpenRouter serves as a routing layer rather than a standalone inference host.

Input Price: $0.10 per 1M tokens (pass-through)
Output Price: $0.30 per 1M tokens (pass-through)
Context Window: 262.1k tokens supported
Latency / Speed: Varies by routed provider

OpenRouter does not host Step 3.5 Flash directly but routes API requests to the best available providers — including DeepInfra and StepFun — to maintain operational redundancy. It passes through the standard $0.10/$0.30 pricing structure while natively supporting the model’s full context window. For production environments where API redundancy is a strict requirement, OpenRouter is a practical choice.

Frequently Asked Questions

What is the context window for Step 3.5 Flash?

Step 3.5 Flash features a 256k token context window, equivalent to processing approximately 384 standard A4 pages of text in a single prompt.

Why is DeepInfra recommended over StepFun for Step 3.5 Flash?

While StepFun is the model creator, DeepInfra offers a significantly lower TTFT (~0.32 seconds vs. StepFun’s 3.19 seconds) at the same price point, making it far better suited for real-time and conversational applications. DeepInfra also supports both JSON Mode and Function Calling.

Is Step 3.5 Flash multimodal?

No. Step 3.5 Flash is a text-only model supporting text input and text output. It does not support image input or other multimodal capabilities.

What license is Step 3.5 Flash released under?

Step 3.5 Flash is released under the Apache 2.0 license, which permits commercial use and enables third-party hosting on platforms like DeepInfra.

How many parameters does Step 3.5 Flash have?

Step 3.5 Flash uses a Mixture of Experts (MoE) architecture with 196 billion total parameters and approximately 11 billion active parameters per token during inference.

Conclusion

Step 3.5 Flash is a highly capable open-weights reasoning model that competes aggressively on both intelligence metrics and operational cost. Scoring 38 on the Artificial Analysis Intelligence Index — well above the open-weights median of 27 — it delivers enterprise-grade reasoning at a fraction of the cost of comparable closed-source models.

For the vast majority of Step 3.5 Flash deployments, DeepInfra is the clear overall recommendation. Its unmatched TTFT of ~0.32 seconds combined with competitive pricing ($0.10 input / $0.30 output) and full JSON Mode and Function Calling support makes it the optimal infrastructure for real-time agentic applications.

Choose DeepInfra for the best overall value — lowest latency, competitive pricing, and full feature support.
Choose SiliconFlow for high-volume batch processing where maximum throughput (100.4 t/s) is the priority.
Choose StepFun for a reliable first-party baseline or batch workloads where latency is not a concern.
Choose OpenRouter for enterprise environments requiring API routing and uptime redundancy.

DeepSeek V4 Pro: Model Overview, Features & Performance GuideDeepSeek V4 Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model from DeepSeek, released on April 24, 2026 under the MIT license. It is designed for advanced reasoning, complex software engineering, and long-running agentic tasks, and arrives alongside DeepSeek-V4-Flash, a lighter 284B-parameter variant built for faster, lower-cost inference. The V4 series is DeepSeek’s first two-tier lineup […]

Best API for Kimi K2.5: Why DeepInfra Leads in Speed, TTFT, and ScalabilityKimi K2.5 is positioned as Moonshot AI’s “do-it-all” model for modern product workflows: native multimodality (text + vision/video), Instant vs. Thinking modes, and support for agentic / multi-agent (“swarm”) execution patterns. In real applications, though, model capability is only half the story. The provider’s inference stack determines the things your users actually feel: time-to-first-token (TTFT), […]

How Mixture of Experts Models Changed LLM EconomicsEvery open-weight model that has closed the gap with GPT-5.5 and Claude Opus 4.7 this year has one thing in common. DeepSeek V4-Pro: 1.6 trillion parameters, 49 billion active per token. Kimi K2.6: 1 trillion parameters, 32 billion active. GLM-5.1: 744 billion parameters, 40 billion active. MiniMax M2.7: large total parameter count, 10 billion active […]

View all