We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Llama Model Family

Developed by Meta, Llama (Large Language Model Meta AI) is a family of state-of-the-art open-weight models designed for efficiency and performance. The latest versions feature Mixture-of-Experts (MoE) architectures, enabling cost-effective inference by dynamically activating subsets of parameters. With support for multimodal inputs (text + images) and extended context windows (up to 10M tokens), Llama excels in tasks like code generation, multilingual understanding, and long-form reasoning. The models support FP8 quantization and batch inference, ensuring low-latency, high-throughput performance for production workloads. With permissive licensing and robust tooling (e.g., Llama Guard for safety), Llama is ideal for developers seeking powerful, customizable AI with minimal overhead.

Featured Model: meta-llama/Llama-4-Scout-17B-16E-Instruct

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

Price per 1M input tokens

$0.08

Price per 1M output tokens

$0.30

Release Date

04/5/2025

Context Size

327,680

Quantization

bfloat16

License Type

License

# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Featured Model: meta-llama/Llama-4-Maverick-17B-128E-Instruct-Turbo

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

Price per 1M input tokens

$0.50

Price per 1M output tokens

$0.50

Release Date

05/16/2025

Context Size

8,192

Quantization

fp8

License Type

License

# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-Turbo",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Available Llama 4 Models

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Llama-4-Scout-17B-16E	320k	$0.08	$0.30	View more
Llama-4-Maverick-17B-128E	1024k	$0.15	$0.60	View more
Llama-4-Maverick-17B-128E-Turbo	8k	$0.50	$0.50	View more
Llama-Guard-4-12B	160k	$0.05	$0.05	View more

Available Llama 3 Models

Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Llama-3.3-70B-Instruct	128k	$0.23	$0.40	View more
Llama-3.3-70B-Instruct-Turbo	128k	$0.038	$0.12	View more
Llama-3.2-11B-Vision-Instruct	128k	$0.049	$0.049	View more
Llama-3.2-3B-Instruct	128k	$0.003	$0.006	View more
Llama-3.2-1B-Instruct	128k	$0.005	$0.01	View more
Meta-Llama-3.1-405B-Instruct	32k	$0.80	$0.80	View more
Meta-Llama-3.1-70B-Instruct	128k	$0.23	$0.40	View more
Meta-Llama-3.1-70B-Instruct-Turbo	128k	$0.10	$0.28	View more
Meta-Llama-3.1-8B-Instruct	128k	$0.03	$0.05	View more
Meta-Llama-3.1-8B-Instruct-Turbo	128k	$0.015	$0.02	View more
Meta-Llama-3-70B-Instruct	8k	$0.30	$0.40	View more
Meta-Llama-3-8B-Instruct	8k	$0.03	$0.06	View more

FAQ

What is LLaMA AI?

LLaMA AI is Meta’s family of open-source foundational language models, encompassing advanced capabilities in text, chat, code, and multimodal (image + text) understanding. The latest generation—LLaMA 4—delivers state-of-the-art performance with efficient, scalable MoE architecture and context windows spanning thousands to millions of tokens. Meta provides extensive documentation, responsible-use guidelines, and a developer ecosystem—including cookbooks, tutorials, and a “Llama Everywhere” deployment guide—to support a wide range of use cases.

What tasks are LLaMA models best suited for?

LLaMA models are powerful general-purpose LLMs ideal for tasks like natural language generation, multilingual dialogue, programming assistance, document summarization, and image-language tasks. They also excel in enterprise applications like search augmentation, AI copilots, and automated reasoning systems.

Are the LLaMA models on DeepInfra optimized for low latency?

DeepInfra’s infrastructure deploys LLaMA models on high-performance GPUs (A100, H100, B200) with regional autoscaling, ensuring ultra-low latency and high throughput. This makes DeepInfra suitable for real-time applications where responsiveness and uptime are critical.

How large are the context windows for LLaMA models?

LLaMA models on DeepInfra support extended context windows ranging from 8k to 128k tokens, depending on the model variant. This makes them ideal for processing long documents, handling large conversation histories, or powering advanced RAG systems without truncation.

Can I use LLaMA models for multimodal (image + text) applications?

Yes. Several LLaMA models on DeepInfra include multimodal support, allowing them to process images and text in the same input. This enables use cases like visual Q&A, document intelligence, and screenshot analysis—all via a single API endpoint.

How do I integrate Llama models into my application?

You can integrate Llama models seamlessly using DeepInfra’s OpenAI-compatible API. Just replace your existing base URL with DeepInfra’s endpoint and use your DeepInfra API key—no infrastructure setup required. DeepInfra also supports integration through libraries like openai, litellm, and other SDKs, making it easy to switch or scale your workloads instantly.

What are the pricing details for using Llama models on DeepInfra?

Pricing is usage-based:

Input Tokens: between $0.003 and $0.80 per million
Output Tokens: between $0.006 and $0.80 per million

Prices vary slightly by model. There are no upfront fees, and you only pay for what you use.

How do I get started using Llama on DeepInfra?

Get your API key
Test models directly from the browser, cURL, or SDKs
Review pricing on your usage dashboard

Within minutes, you can deploy apps using Llama models—without any infrastructure setup.

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales Get Started

Latest Models

bigcode/

starcoder2-15b

openai/

whisper-tiny

Phind/

Phind-CodeLlama-34B-v2

Gryphe/

MythoMax-L2-13b

openchat/

openchat_3.5

Featured Models

deepseek-ai/

DeepSeek-R1-0528

Qwen/

Qwen3-Coder-480B-A35B-Instruct

meta-llama/

Llama-4-Scout-17B-16E-Instruct

Qwen/

Qwen3-32B

Qwen/

Qwen3-235B-A22B-Instruct-2507

canopylabs/

orpheus-3b-0.1-ft

Company

Pricing

Docs

Compare

DeepStart

About

Careers

Trust Center

Privacy

Terms

Have questions or need a custom solution?

Contact Sales