DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Qwen3.5 9B is the flagship of Alibaba’s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes a 3:1 ratio of linear attention to full attention, maintaining a 262,144-token context window while remaining efficient enough to run on standard hardware.
Unlike previous generations that added vision capabilities post-hoc, Qwen3.5 9B was trained using early fusion on multimodal tokens, allowing the model to process visual and textual tokens within the same latent space from the start of training. This results in better spatial reasoning, improved OCR accuracy, and more cohesive visual-grounded responses. The model’s performance is largely attributed to Scaled Reinforcement Learning, which optimizes for correct reasoning paths rather than mimicking high-quality text — producing improved instruction following, fewer hallucinations, and higher reliability in fact-retrieval and mathematical reasoning.
Qwen3.5 9B is released under the Apache 2.0 license, enabling commercial use and fine-tuning. It is now being offered by different providers. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Quant. | Blended ($/1M) | Input ($/1M) | Output ($/1M) | Speed (t/s) | TTFT (s) | E2E (s) | Context | Why Notable |
|---|---|---|---|---|---|---|---|---|---|
| DeepInfra (FP8) | FP8 | $0.08 | $0.04 | $0.20 | 205.7 | 1.04s | 13.19 / 9.72 | 262k | Best throughput + blended cost; best for long inputs and fastest generation |
| Together.ai (FP8) | FP8 | $0.11 | $0.10 | $0.15 | 92.3 | 0.75s | 27.84 / 21.67 | 262k | Best TTFT latency; slower throughput and higher blended cost |
Based on benchmarks across 2 tracked providers, DeepInfra is the recommended API for production-scale Qwen3.5 9B deployment. It delivers 2.2x faster output speed, the lowest blended price ($0.08/1M), and resolves tasks in less than half the end-to-end time of Together.ai. Together.ai remains a viable alternative for highly interactive, conversational applications where sub-second TTFT (0.75s) is the primary requirement.
Output speed measures how quickly tokens are generated after the model begins its response — the primary metric for throughput-intensive tasks.
DeepInfra operates at approximately 2.2x the speed of Together.ai. For applications generating long-form content, analyzing large datasets, or requiring rapid data extraction, this throughput advantage translates directly into reduced wait times. The gap is large enough to be decisive for any workload where generation volume is the primary bottleneck.
TTFT measures the initial responsiveness of an application. For reasoning models like Qwen3.5 9B, this includes the model’s internal thinking time before outputting the first user-facing answer token.
Together.ai wins the latency category with a sub-second TTFT of 0.75s. For highly interactive applications — real-time chatbots or voice-to-text assistants — this edge creates a snappier perceived experience. DeepInfra at 1.04s is still highly performant and will be imperceptible to most users in practice, but the 290ms gap is measurable and relevant for latency-critical applications.
Pricing is evaluated per 1 million tokens, with the blended rate assuming a standard 3:1 input-to-output ratio.
Because most reasoning and RAG workloads are heavily weighted toward input tokens (large system prompts, document context, retrieval results), DeepInfra’s aggressively priced input tier ($0.04/1M) makes it the more cost-effective choice for the vast majority of real-world usage patterns. Together.ai’s cheaper output pricing ($0.15 vs $0.20) only becomes advantageous for workloads with very short inputs and very long outputs — a less common pattern for reasoning models.
End-to-end response time combines initial latency, reasoning time, and output speed to measure the complete lifecycle of a request — specifically, how long it takes to deliver a 500-token response from a 10,000 input token prompt.
DeepInfra resolves tasks in less than half the time of Together.ai. Despite Together.ai’s slight TTFT advantage, DeepInfra’s 2.2x throughput lead entirely eclipses that edge when measuring total task completion time. For any workload beyond a single short exchange, DeepInfra delivers a substantially faster experience end-to-end.
Both providers support the full 262,144-token (262k) context window natively available to Qwen3.5 9B, and both fully support Function (Tool) Calling. This means provider selection can rest entirely on performance and pricing metrics — neither provider imposes a technical ceiling on what you can build.
For the vast majority of Qwen3.5 9B deployments, DeepInfra is the recommended provider. With 205.7 t/s output speed, an end-to-end response time of just 13.19s, and the lowest blended price on the market at $0.08 per million tokens, DeepInfra delivers an unmatched combination of speed and cost-effectiveness.
Function Calling in DeepInfra: Extend Your AI with Real-World Logic<p>Modern large language models (LLMs) are incredibly powerful at understanding and generating text, but until recently they were largely static: they could only respond based on patterns in their training data. Function calling changes that. It lets language models interact with external logic — your own code, APIs, utilities, or business systems — while still […]</p>
Build a Streaming Chat Backend in 10 Minutes<p>When large language models move from demos into real systems, expectations change. The goal is no longer to produce clever text, but to deliver predictable latency, responsive behavior, and reliable infrastructure characteristics. In chat-based systems, especially, how fast a response starts often matters more than how fast it finishes. This is where token streaming becomes […]</p>
A Milestone on Our Journey Building DeepInfra and Scaling Open Source AI InfrastructureToday we're excited to share that DeepInfra has raised $18 million in Series A funding, led by Felicis and our earliest believer and advisor Georges Harik.© 2026 DeepInfra. All rights reserved.