Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!
Published on 2026.04.03 by DeepInfraQwen3.5 35B A3B API Benchmarks: Latency, Throughput & CostAbout Qwen3.5 35B A3B Qwen3.5 35B A3B is a native vision-language model released by Alibaba Cloud in February 2026. It uses a hybrid architecture that integrates Gated Delta Networks with a sparse Mixture-of-Experts model, achieving higher inference efficiency. With 35 billion total parameters and only 3 billion activated per token through 256 experts (8 routed […]
Published on 2026.04.03 by DeepInfraQwen3.5 122B A10B API Benchmarks: Latency, Throughput & CostAbout Qwen3.5 122B A10B Qwen3.5 122B A10B is Alibaba Cloud’s mid-tier multimodal foundation model, released in February 2026. It is a multimodal vision-language Mixture-of-Experts model supporting text, image, and video inputs, designed for native multimodal agent applications. It features 122 billion total parameters with 10 billion activated per token through a hybrid architecture that integrates […]
Published on 2026.04.03 by DeepInfraQwen3.5 397B A17B API Benchmarks: Latency, Throughput & CostAbout Qwen3.5 397B A17B Qwen3.5 397B A17B is Alibaba Cloud’s largest and most capable multimodal foundation model, released in February 2026. It features a hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters and 17 billion active parameters per inference pass, utilizing 512 experts with a routing mechanism selecting a subset per token. This sparse […]
Published on 2026.04.03 by DeepInfraStep 3.5 Flash API Benchmarks: Latency, Throughput & CostAbout Step 3.5 Flash Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models. […]
Published on 2026.04.03 by DeepInfraNVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & CostAbout NVIDIA Nemotron 3 Super 120B A12B NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging. The model uses a […]
Published on 2026.04.03 by DeepInfraGLM-4.7-Flash API Benchmarks: Latency, Throughput & CostAbout GLM-4.7-Flash GLM-4.7-Flash is Z.AI’s open-weights reasoning model released in January 2026. Built on a Mixture-of-Experts (MoE) Transformer architecture, it features 30 billion total parameters with only ~3 billion active per inference — making it exceptionally efficient for its capability class. The model is designed as a lightweight, cost-effective alternative to Z.AI’s flagship GLM-4.7, optimized […]
© 2026 Deep Infra. All rights reserved.