DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

At DeepInfra we host the best open source LLM models. We are always working hard to make our APIs simple and easy to use.
Today we are excited to announce a very easy way to quickly try our models like Llama2 70b and Mistral 7b and compare them to OpenAI's models. You only need to change the API endpoint URL and the model name to quickly see if these models are a good fit for your application.
Here is a quick example of how to use the OpenAI Python client with our models:
import openai
# Point OpenAI client to our endpoint
openai.api_base = "https://api.deepinfra.com/v1/openai"
# Just leave the API key empty. You don't need it to try our models.
openai.api_key = ""
# Your chosen model here
MODEL_DI = "meta-llama/Llama-2-70b-chat-hf"
chat_completion = openai.ChatCompletion.create(
model="meta-llama/Llama-2-70b-chat-hf",
messages=[{"role": "user", "content": "Hello world"}],
stream=True,
)
# print the chat completion
for event in chat_completion:
print(event.choices)
To make it as simple as possible you don't even have to create an account with DeepInfra to
try our models. Just pass empty string as api_key and you are good to go. We rate limit the
unauthenticated requests by IP address.
When you are ready to use our models in production, you can create an account at DeepInfra and get an API key. We offer the best pricing for the llama 2 70b model at just $1 per 1M tokens. If you need any help, just reach out to us on our Discord server.
Qwen3.5 0.8B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 0.8B (Reasoning) Qwen3.5 0.8B is part of Alibaba Cloud’s Qwen3.5 Small Model Series, released on March 2, 2026. Designed under the philosophy of “More Intelligence, Less Compute,” it targets edge devices, mobile phones, and low-latency applications where battery life and memory constraints are critical. It employs an Efficient Hybrid Architecture combining Gated Delta […]</p>
Best API Providers for NVIDIA Nemotron 3 Super 120B<p>Nemotron 3 Super 120B is available across a growing number of hosted APIs and deployment platforms. At 120B total parameters with 12B active per inference pass, the right provider matters: latency, throughput, and cost vary significantly depending on where you run it. This guide covers the top options by use case — from fully managed […]</p>
Best SaaS Tools and API Providers for MiMo-V2.5<p>As LLM architectures grow increasingly complex, the introduction of the MiMo-V2.5 series represents a significant step forward in multimodal capabilities and massive context handling. Integrating a model with a 1M-token context window and native multimodal support (image, video, audio, text) introduces substantial infrastructure considerations. For developers and enterprise architects, the priorities are clear: managing inference […]</p>
© 2026 DeepInfra. All rights reserved.