We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

GLM-4.7-Flash API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
GLM-4.7-Flash API Benchmarks: Latency, Throughput & Cost

About GLM-4.7-Flash

GLM-4.7-Flash is Z.AI’s open-weights reasoning model released in January 2026. Built on a Mixture-of-Experts (MoE) Transformer architecture, it features 30 billion total parameters with only ~3 billion active per inference — making it exceptionally efficient for its capability class. The model is designed as a lightweight, cost-effective alternative to Z.AI’s flagship GLM-4.7, optimized for coding, agentic workflows, and multi-step reasoning tasks.

GLM-4.7-Flash supports up to 200K context tokens and achieves state-of-the-art performance among open-source models in its size category on benchmarks like SWE-Bench and GPQA. Its efficient architecture enables deployment on consumer hardware while still delivering competitive performance against larger proprietary models.

GLM-4.7-Flash is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

GLM-4.7-Flash (Reasoning) API Review Summary

  • DeepInfra is the overall value leader: lowest latency (0.75s TTFT) and lowest blended price ($0.14 / 1M tokens) among tracked providers.
  • DeepInfra has the lowest input token price: $0.06 / 1M input tokens (vs $0.07 for Amazon Bedrock and Novita).
  • DeepInfra supports key API features: Function Calling and JSON Mode — only 2 of 3 tracked providers support JSON mode.
  • DeepInfra offers the largest context window: 203k tokens (vs 200k for Amazon Bedrock and Novita).
  • Amazon Bedrock leads on throughput: 228.6 tokens/sec output speed, though at a higher blended price than DeepInfra.

GLM-4.7-Flash (Reasoning) — Best APIs

ProviderWhy NotableBlended ($/1M)Input ($/1M)Output ($/1M)Latency (TTFT)Speed (t/s)ContextJSONFuncE2E (s)
DeepInfraBest overall: lowest cost + lowest latency; JSON mode supported$0.14$0.06$0.400.75s74.6203kYesYes34.28 / 26.82
Amazon BedrockBest for throughput-intensive workloads; fastest generation speed$0.15$0.07$0.400.90s228.6200kNoYes11.83 / 8.75
NovitaJSON mode + function calling; not recommended due to high latency$0.15$0.07$0.409.49s50.1200kYesYes59.35 / 39.90

Quick Verdict: Which GLM-4.7-Flash Provider is Best?

Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale GLM-4.7-Flash deployment. It offers the lowest latency (0.75s TTFT), the lowest blended cost ($0.14/1M tokens), and full support for both JSON Mode and Function Calling. For use cases requiring maximum throughput, Amazon Bedrock leads at 228.6 t/s.

Overall Winner: DeepInfra

DeepInfra holds the top spot for developers prioritizing interactivity and cost-efficiency. In the context of reasoning models — which inherently require thinking time — minimizing network and processing latency is critical. DeepInfra excels here, offering the snappiest response times in the benchmark.

  • Time to First Token (TTFT): 0.75s (#1 fastest)
  • Output Speed: 74.6 tokens/sec
  • Blended Price: $0.14 per 1M tokens
  • Context Window: 203k tokens
  • API Features: Function Calling, JSON Mode

With a TTFT of 0.75s, DeepInfra is the only provider suitable for real-time conversational agents — a decisive advantage for chatbots and interactive applications where user retention drops as wait times increase. It also offers the lowest Total Cost of Ownership with an input price of $0.06/1M tokens and a blended rate of $0.14.

Unlike Amazon Bedrock, DeepInfra supports JSON Mode, ensuring reliable structured data extraction for agentic workflows. Combined with Function Calling support, it provides a robust environment for building structured data extraction pipelines. While its throughput (74.6 t/s) is lower than Amazon’s, it is sufficiently fast for most reading-speed applications and significantly outperforms Novita.

Best for Throughput: Amazon Bedrock

Amazon Bedrock demonstrates superior raw computational power, making it the ideal choice for offline tasks where immediate latency is less critical than total completion time.

  • Time to First Token (TTFT): 0.90s (#2)
  • Output Speed: 228.6 tokens/sec (#1)
  • Blended Price: $0.15 per 1M tokens
  • Context Window: 200k tokens
  • API Features: Function Calling (no JSON Mode)

At 228.6 tokens per second, Bedrock is 3x faster than DeepInfra and 4.5x faster than Novita. Its end-to-end response time for a 500-token output is just 11.83 seconds. For applications requiring bulk text generation where the user is not waiting for the first word to appear — such as background report generation — Amazon’s infrastructure is unmatched.

The key trade-off is the lack of native JSON Mode support. This requires additional prompt engineering and increases the risk of malformed outputs in structured workflows, making it less ideal for complex agentic integrations despite the raw speed advantage.

Not Recommended: Novita

While Novita offers a feature set comparable to DeepInfra — supporting both JSON Mode and Function Calling — it currently struggles with infrastructure performance.

  • Time to First Token (TTFT): 9.49s (#3 — last place)
  • Output Speed: 50.1 tokens/sec
  • Blended Price: $0.15 per 1M tokens
  • Context Window: 200k tokens
  • API Features: Function Calling, JSON Mode

A TTFT of 9.49 seconds makes this provider unusable for real-time user-facing applications — nearly 13x slower than DeepInfra. It also ranks last in generation speed (50.1 t/s) while matching Amazon’s higher price point ($0.15 blended). Novita is best reserved as a backup provider or for specific non-time-sensitive workloads where other providers are unavailable.

Technical Metric Comparison

MetricDeepInfraAmazon BedrockNovita
Latency (TTFT)0.75s ✓ Winner0.90s9.49s
Output Speed (t/s)74.6228.6 ✓ Winner50.1
Blended Price (/1M)$0.14 ✓ Winner$0.15$0.15
JSON ModeYes ✓NoYes
Function CallingYesYesYes
Context Window203k ✓200k200k

Frequently Asked Questions

Which GLM-4.7-Flash provider is the cheapest?

DeepInfra is the cheapest provider with a blended price of $0.14 per million tokens and an input price of $0.06 per million tokens.

Does Amazon Bedrock support JSON mode for GLM-4.7-Flash?

No. As of the current benchmark, Amazon Bedrock supports Function Calling but does not list native JSON Mode support for this model, unlike DeepInfra and Novita.

What is the fastest provider for GLM-4.7-Flash?

It depends on the metric. DeepInfra is the fastest to start (lowest latency/TTFT at 0.75s), making it best for chat and interactive applications. Amazon Bedrock is the fastest to finish (highest throughput at 228.6 t/s), making it best for generating long documents or code blocks.

What makes GLM-4.7-Flash efficient?

GLM-4.7-Flash uses a Mixture-of-Experts (MoE) architecture with 30B total parameters but only ~3B active per inference. This design allows it to deliver strong performance while requiring significantly less compute than dense models of comparable capability.

Conclusion

For the vast majority of GLM-4.7-Flash use cases, DeepInfra is the recommended provider. It successfully balances the key pillars of API performance: fastest to start (0.75s latency), cheapest to run ($0.14/1M tokens), and full support for JSON Mode and Function Calling.

  • Choose DeepInfra for the best overall value — lowest cost, lowest latency, and full feature support.
  • Choose Amazon Bedrock for high-volume batch processing where latency is not a concern and maximum throughput (228.6 t/s) is the priority.
  • Avoid Novita for production use cases until its latency issues are resolved.
Related articles
Langchain improvements: async and streamingLangchain improvements: async and streamingStarting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra. Async generation The deepinfra wrapper now supports native async calls, so you can expect more performance (no more t...
Qwen3.5 4B via DeepInfra: Latency, Throughput & CostQwen3.5 4B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 4B (Reasoning) Qwen3.5 4B is a compact 4-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud&#8217;s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural [&hellip;]</p>
Qwen3.5 9B API Benchmarks: Latency, Throughput & CostQwen3.5 9B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 9B Qwen3.5 9B is the flagship of Alibaba&#8217;s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes [&hellip;]</p>