DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Qwen3.5 27B is part of Alibaba Cloud’s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench Verified (72.4%).
The model features a 262,144-token native context window (extensible to 1M via YaRN), support for 201 languages, both thinking and non-thinking modes, tool calling, and multimodal input processing through early fusion training. It is released under the Apache 2.0 license, enabling commercial use and third-party hosting.
Qwen3.5 27B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Speed (t/s) | Latency (TTFT) | Blended ($/1M) | Input ($/1M) | Output ($/1M) | Context | Func | JSON | Positioning |
|---|---|---|---|---|---|---|---|---|---|
| DeepInfra (FP8) | 153.3 | 0.91s | $0.84 | $0.26 | — | 262k | Yes | No | Best overall performance — fastest speed + lowest latency |
| Alibaba Cloud | 86.3 | 5.57s | $0.82 | $0.30 | $2.40 | 262k | Yes | Yes | Lowest blended price (tied); slower speed and higher TTFT |
| Novita | 66.7 | 5.08s | $0.82 | $0.30 | $2.40 | 262k | Yes | Yes | Low blended price (tied); mid-tier speed; ~5s TTFT |
| GMI (FP8) | 57.9 | 5.51s | $0.82 | — | $2.40 | 262k | Yes | Yes | Low blended price (tied); slowest output speed in benchmark |
Based on benchmarks across 4 tracked providers, DeepInfra (FP8) is the recommended API for production-scale Qwen3.5 27B deployment. It delivers the fastest output speed (153.3 t/s), the lowest latency (0.91s TTFT), and the lowest input token price ($0.26/1M) — all at a blended cost of just $0.84/1M, only 2.4% above the market floor. For teams requiring JSON mode, Alibaba Cloud or Novita are the recommended alternatives at identical pricing.
DeepInfra is the clear performance leader for Qwen3.5 27B, delivering industry-leading speed and latency at a near-floor price point.
DeepInfra’s FP8 quantization delivers a sub-second TTFT (0.91s) that is 5-6x lower than competitors hovering around 5+ seconds — a decisive advantage for interactive applications where user experience depends on perceived responsiveness. The platform uses Multi-Token Prediction and Eagle speculative decoding to accelerate generation throughput, providing an OpenAI-compatible API for straightforward migration.
The one trade-off is the absence of JSON mode. Developers requiring deterministic structured outputs should use Alibaba Cloud or Novita, or rely on prompt engineering to enforce JSON structure when using DeepInfra.
As the creator of the Qwen model family, Alibaba Cloud offers native hosting with full feature support at the lowest blended price in the benchmark.
Alibaba Cloud delivers the second-fastest throughput (86.3 t/s) among the four providers and full JSON mode support, making it the best cost-optimized option for structured output workloads. Its TTFT of 5.57s reflects the trade-off between cost optimisation and raw speed — acceptable for batch processing but not for real-time interactive applications.
Novita offers the lowest blended price (tied) with full feature support, making it a solid option for cost-sensitive deployments that can tolerate moderate latency.
Novita matches Alibaba Cloud on price and features, with slightly better TTFT (5.08s vs 5.57s) but lower throughput (66.7 t/s vs 86.3 t/s). It is a viable choice for teams seeking the lowest blended cost with full feature support who are running non-interactive, batch-oriented workloads.
GMI offers FP8 quantization at the market floor price, but its performance metrics trail the other three providers significantly.
GMI delivers the lowest throughput in the benchmark (57.9 t/s) and high latency (5.51s TTFT) at the same $0.82/1M floor price as Alibaba Cloud and Novita. It is difficult to recommend over those alternatives at the same price point, unless GMI offers a specific regional availability or redundancy benefit for a particular deployment.
For most production deployments of Qwen3.5 27B (Reasoning), DeepInfra (FP8) is the recommended provider. Its combination of industry-leading speed (153.3 t/s), sub-second latency (0.91s TTFT), and lowest input token pricing ($0.26/1M) delivers the strongest overall value proposition — at a blended cost only 2.4% above the market floor.
Nemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM landscape is becoming increasingly diverse, with models optimized for reasoning, throughput, cost-efficiency, and real-world agentic applications. Two models that stand out in this new generation are NVIDIA’s Nemotron 3 Nano and OpenAI’s GPT-OSS-20B, both of which offer strong performance while remaining openly available and deployable across cloud and edge systems. Although both […]</p>
Best API for Kimi K2.5: Why DeepInfra Leads in Speed, TTFT, and Scalability<p>Kimi K2.5 is positioned as Moonshot AI’s “do-it-all” model for modern product workflows: native multimodality (text + vision/video), Instant vs. Thinking modes, and support for agentic / multi-agent (“swarm”) execution patterns. In real applications, though, model capability is only half the story. The provider’s inference stack determines the things your users actually feel: time-to-first-token (TTFT), […]</p>
Langchain improvements: async and streamingStarting from langchain
v0.0.322 you
can make efficient async generation and streaming tokens with deepinfra.
Async generation
The deepinfra wrapper now supports native async calls, so you can expect more
performance (no more t...© 2026 DeepInfra. All rights reserved.