NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Flan-UL2 is probably the best open source model available right now for chatbots. In this post we will show you how to get started with it very easily. Flan-UL2 is large - 20B parameters. It is fine tuned version of the UL2 model using Flan dataset. Because this is quite a large model it is not easy to deploy it on your own machine. If you rent a GPU in AWS, it will cost you around $1.5 per hour or $1080 per month. Using DeepInfra model deployments you only pay for the inference time, and we do not charge for cold starts. Our pricing is $0.0005 per second of running inference on Nvidia A100. Which translates to about $0.0001 per token generated by Flan-UL2.
Also check out the model page https://deepinfra.com/google/flan-ul2. You can
run inferences, check the docs/API for running inferences via curl.
First, you'll need to get an API key from the DeepInfra dashboard.
You can deploy the google/flan-ul2 model easily through the web dashboard or API. The model will be automatically deployed when you first make an inference request.
You can use it with our REST API. Here's how to use it with curl:
curl -X POST \
-d '{"prompt": "Hello, how are you?"}' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer YOUR_API_KEY" \
'https://api.deepinfra.com/v1/inference/google/flan-ul2'
To see the full documentation of how to call this model, check out the model page on the DeepInfra website or the API documentation.
If you want a list of all the models you can use on DeepInfra, you can visit the models page on our website or use the API to get a list of available models.
There is no easier way to get started with arguably one of the best open source LLM. This was quite easy right? You did not have to deal with docker, transformers, pytorch, etc. If you have any question, just reach out to us on our Discord server.
Step 3.5 Flash API Benchmarks: Latency, Throughput & Cost<p>About Step 3.5 Flash Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models. […]</p>
DeepSeek V3.2 API Benchmarks: Latency, Throughput & Cost<p>About DeepSeek V3.2 DeepSeek V3.2 is a state-of-the-art large language model that unifies conversational speed and deep reasoning in a single 685B parameter Mixture of Experts (MoE) architecture with 37B parameters activated per token. It is built around three key technical breakthroughs: DeepSeek V3.2 achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) and […]</p>
Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)<p>This guide walks you from zero to working: you’ll learn what OCR is (and why PDFs can be tricky), how to turn any PDF—including those with screenshots of tables—into text, and how to let an LLM do the heavy lifting to clean OCR noise, reconstruct tables, and summarize the document. We’ll use DeepInfra’s OpenAI-compatible API […]</p>
© 2026 Deep Infra. All rights reserved.