DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Real-time inference on DeepInfra is fast — but when a popular model is under heavy load, requests queue up and some get shed with an HTTP 429. The new Priority service tier lets your latency-critical traffic jump to the front of that queue and stay admitted through contention, for 1.5× the real-time price. It's a single OpenAI-compatible field on the request — no separate endpoint, no new API to learn.
Most traffic is happy to retry a 429 and move on. Some isn't. Priority is built for the workloads where waiting in line is the problem:
You opt in per request by setting service_tier to "priority". Leave it off and your request runs at the standard real-time rate, exactly as it does today — nothing changes for traffic that doesn't need to skip the line.
One field. Add "service_tier": "priority" to any chat or completions request. Priority requests:
service_tier field comes back "priority" when (and only when) priority was actually applied.When the model is idle, priority and normal requests look the same — there's no queue to jump. The difference shows up exactly when it matters: under load.
/v1/chat/completions/v1/completionsPriority-rated billing also applies to embeddings. Everything works through the standard OpenAI-compatible API you're already using.
Priority is billed at 1.5× the corresponding real-time price — applied automatically, with no extra configuration.
Here's the part worth reading twice: you only pay the priority rate when priority is actually delivered. If you request priority on a model that doesn't support it, the request is served normally, billed at the normal rate, and the response's service_tier comes back "default". What's billed always equals what's echoed — so you can verify exactly what you paid for by reading the service_tier field on the response. No silent upcharge for a tier you didn't get.
Priority is live now on models served on our vLLM stack, which covers the bulk of our text-generation catalog. Support for models served on SGLang and TensorRT-LLM is rolling out.
Priority is enabled on a per-model basis, and the set of priority-enabled models is growing continuously. You don't have to guess which ones qualify: every model that supports priority already carries a Priority tag on its model page, so you can see at a glance whether a model honors the tier before you send a request.
And you can always confirm it programmatically: send a request with service_tier="priority" and check the echoed service_tier on the response. If it comes back "priority", you're at the front of the queue; if it comes back "default", the model isn't priority-enabled yet and you weren't charged the premium.
Using the OpenAI Python client — just add service_tier="priority" and read it back off the response:
from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
resp = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Summarize this support ticket: ..."}],
service_tier="priority",
)
print(resp.choices[0].message.content)
print("served as:", resp.service_tier) # "priority" when priority was applied
Or with curl — service_tier is just another field in the JSON body:
curl https://api.deepinfra.com/v1/openai/chat/completions \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Summarize this support ticket: ..."}],
"service_tier": "priority"
}'
The service_tier field on the response tells you which tier actually served the request.
See the Service Tier documentation for the full reference, and start sending your latency-critical traffic to the front of the queue.
Qwen3.5 35B A3B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 35B A3B Qwen3.5 35B A3B is a native vision-language model released by Alibaba Cloud in February 2026. It uses a hybrid architecture that integrates Gated Delta Networks with a sparse Mixture-of-Experts model, achieving higher inference efficiency. With 35 billion total parameters and only 3 billion activated per token through 256 experts (8 routed […]</p>
Use OpenAI API clients with LLaMasGetting started
# create a virtual environment
python3 -m venv .venv
# activate environment in current shell
. .venv/bin/activate
# install openai python client
pip install openai
Choose a model
meta-llama/Llama-2-70b-chat-hf
[meta-llama/L...
GLM-5.1 API Benchmarks: Latency, Throughput & Cost<p>Z.ai’s GLM-5.1 is an April 2026 open-weight reasoning model built for long-horizon agentic engineering — and accessing it effectively means navigating a real spread of provider options. Across 10 benchmarked API providers, blended pricing ranges from $0.74 to $1.70 per 1M tokens, output speed from 33.8 to 175.2 t/s, and the fastest provider is 5.2x […]</p>
© 2026 DeepInfra. All rights reserved.