We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Llama Model Family

Developed by Meta, Llama (Large Language Model Meta AI) is a family of state-of-the-art open-weight models designed for efficiency and performance. The latest versions feature Mixture-of-Experts (MoE) architectures, enabling cost-effective inference by dynamically activating subsets of parameters. With support for multimodal inputs (text + images) and extended context windows (up to 10M tokens), Llama excels in tasks like code generation, multilingual understanding, and long-form reasoning. The models support FP8 quantization and batch inference, ensuring low-latency, high-throughput performance for production workloads. With permissive licensing and robust tooling (e.g., Llama Guard for safety), Llama is ideal for developers seeking powerful, customizable AI with minimal overhead.

Featured Model: meta-llama/Llama-4-Scout-17B-16E-Instruct

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

Price per 1M input tokens

$0.08


Price per 1M output tokens

$0.30


Release Date

04/5/2025


Context Size

327,680


Quantization

bfloat16


License Type

License


# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Featured Model: meta-llama/Llama-4-Maverick-17B-128E-Instruct-Turbo

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

Price per 1M input tokens

$0.50


Price per 1M output tokens

$0.50


Release Date

05/16/2025


Context Size

8,192


Quantization

fp8


License Type

License


# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-Turbo",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Available Llama 4 Models

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

ModelContext$ per 1M input tokens$ per 1M output tokens
Actions
Llama-4-Scout-17B-16E320k$0.08$0.30
Llama-4-Maverick-17B-128E1024k$0.15$0.60
Llama-4-Maverick-17B-128E-Turbo8k$0.50$0.50
Llama-Guard-4-12B160k$0.05$0.05

Available Llama 3 Models

Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.

FAQ

What is LLaMA AI?

LLaMA AI is Meta’s family of open-source foundational language models, encompassing advanced capabilities in text, chat, code, and multimodal (image + text) understanding. The latest generation—LLaMA 4—delivers state-of-the-art performance with efficient, scalable MoE architecture and context windows spanning thousands to millions of tokens. Meta provides extensive documentation, responsible-use guidelines, and a developer ecosystem—including cookbooks, tutorials, and a “Llama Everywhere” deployment guide—to support a wide range of use cases.

What tasks are LLaMA models best suited for?

LLaMA models are powerful general-purpose LLMs ideal for tasks like natural language generation, multilingual dialogue, programming assistance, document summarization, and image-language tasks. They also excel in enterprise applications like search augmentation, AI copilots, and automated reasoning systems.

Are the LLaMA models on DeepInfra optimized for low latency?

DeepInfra’s infrastructure deploys LLaMA models on high-performance GPUs (A100, H100, B200) with regional autoscaling, ensuring ultra-low latency and high throughput. This makes DeepInfra suitable for real-time applications where responsiveness and uptime are critical.

How large are the context windows for LLaMA models?

LLaMA models on DeepInfra support extended context windows ranging from 8k to 128k tokens, depending on the model variant. This makes them ideal for processing long documents, handling large conversation histories, or powering advanced RAG systems without truncation.

Can I use LLaMA models for multimodal (image + text) applications?

Yes. Several LLaMA models on DeepInfra include multimodal support, allowing them to process images and text in the same input. This enables use cases like visual Q&A, document intelligence, and screenshot analysis—all via a single API endpoint.

How do I integrate Llama models into my application?

You can integrate Llama models seamlessly using DeepInfra’s OpenAI-compatible API. Just replace your existing base URL with DeepInfra’s endpoint and use your DeepInfra API key—no infrastructure setup required. DeepInfra also supports integration through libraries like openai, litellm, and other SDKs, making it easy to switch or scale your workloads instantly.

What are the pricing details for using Llama models on DeepInfra?

Pricing is usage-based:
  • Input Tokens: between $0.003 and $0.80 per million
  • Output Tokens: between $0.006 and $0.80 per million
Prices vary slightly by model. There are no upfront fees, and you only pay for what you use.

How do I get started using Llama on DeepInfra?

Sign in with GitHub at deepinfra.com
  • Get your API key
  • Test models directly from the browser, cURL, or SDKs
  • Review pricing on your usage dashboard
Within minutes, you can deploy apps using Llama models—without any infrastructure setup.