We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

nvidia logo

nvidia/

Cosmos3-Nano

$0.0108 / second (480p)

Cosmos3 is a world foundation model that unifies understanding and generation within a single Mixture-of-Transformer (MoT) architecture. Two tightly coupled towers—a Reasoner (vision-language model) and a Generator (world simulator)—share latent representations so that structured perception directly grounds realistic, temporally consistent simulation.

nvidia/Cosmos3-Nano cover image

Input

Prompt

Text prompt describing the desired content

Output Type

Output type: 'video' for a video clip, 'image' for a single image.

Resolution

Output resolution. Pricing per second: 256p (~$0.003), 480p (~$0.01), 720p (~$0.02).

Aspect Ratio

Output aspect ratio.

Duration Seconds

Video duration in seconds (1–8). Ignored when output_type is 'image'. (Default: 5, 1 ≤ duration_seconds ≤ 8)

You need to login to use this model

Login

Settings

Image Url

First-frame image for image-to-video: URL or base64-encoded image data. Omit for text-to-video. Mutually exclusive with video_url.. (Default: empty)

Video Url

Conditioning video for video-to-video: URL or base64-encoded video data. Omit for text-to-video. Mutually exclusive with image_url.. (Default: empty)

Seed

Random seed for reproducible output (Default: empty, 0 ≤ seed)

Output

Model Information

Cosmos

NVIDIA Cosmos

Website | Framework

Introduction

NVIDIA Cosmos is an open platform of world models, datasets, and tools that enables developers to build Physical AI for robots, autonomous vehicles, smart infrastructure, and more.

Cosmos 3

Cosmos 3 is our newest model family [Report] [Website]. It is a suite of omnimodal world models designed to jointly process and generate language, images, video, audio, and action sequences within a unified Mixture-of-Transformers architecture. By supporting highly flexible input-output configurations, it seamlessly unifies critical modalities for Physical AI — effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework.

Cosmos 3 exposes two runtime surfaces:

SurfaceInputsOutputsUse Cases
ReasonerText, visionTextWorld understanding, grounding, physical reasoning, task planning, action forecasting, embodied agent reasoning, and autonomous system decision making
GeneratorText, vision, sound, actionVision, sound, actionWorld generation, world simulation, future prediction, synthetic data generation, policy learning, and robot training

Key Capabilities

  • World understanding: Analyze videos and images for captions, temporal events, next actions, spatial grounding, physical plausibility, and causal outcomes.
  • World generation: Produce images, videos, synchronized sound, and action-conditioned rollouts from text, image, video, or action inputs.
  • Action modeling: Predict policy actions, inverse dynamics, and forward dynamics for robotics, camera motion, egocentric motion, and autonomous-driving settings.
  • Research and production paths: Use Diffusers and Transformers for Python-first development, then vLLM-Omni and vLLM for OpenAI-compatible serving.
  • Post-training recipes: Adapt vision, action, and reasoner workflows with Cosmos Framework training recipes and task-specific evaluation [Coming Soon].

Model Architecture

Cosmos 3 model architecture

Cosmos 3 is an omnimodal world model built on a unified Mixture-of-Transformers (MoT) architecture that combines an autoregressive (AR) transformer for reasoning with a diffusion transformer (DM) for multimodal generation. In Reasoner Mode, language and visual understanding tokens are processed through causal self-attention, enabling next-token prediction for tasks such as perception, planning, and world reasoning. In Generator Mode, noisy image, video, audio, and action tokens are denoised through full attention, allowing the model to jointly generate coherent multimodal outputs. Both modes share the same transformer architecture, multimodal attention layers, and a unified 3D multi-dimensional rotary position embedding (mRoPE) representation that encodes spatial and temporal structure across modalities, enabling consistent reasoning over images, videos, audio streams, and action trajectories.

Model Family

ModelSizePrimary Capability
Cosmos3-Nano16BCompact omnimodal world model for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI.
Cosmos3-Super64BFrontier-scale omnimodal world model for advanced multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI.
Cosmos3-Super-Text2Image64BHigh-fidelity text-to-image generation.
Cosmos3-Super-Image2Video64BTemporally coherent image-to-video generation.
Cosmos3-Nano-Policy-DROID16BVision-language robot policy for DROID manipulation and control.

Supported Generation Settings

SettingSupported values
Resolution tiers256p, 480p, 720p, default=480p
Aspect ratios16:9, 4:3, 1:1, 3:4, 9:16, default=16:9
Frame rates10, 16, 24, and 30 FPS, default=24
Frame count5 to 300 frames, default=189
PrecisionBF16 tested
Operating systemLinux
GPU architecturesNVIDIA Ampere, Hopper, and Blackwell

Input and Output

SpecValue
Input typesText, text + image, text + video, text + image + action
Input formatsText string, JPG/PNG/JPEG/WEBP image, MP4 video, JSON action array
Vision conditioning720p uses 1280x720, 480p uses 832x480, and 256p uses 320x192. Video conditioning uses 5 frames at the matching resolution.
Action conditioningSupported action dimensions depend on the embodiment, including camera motion (9D), autonomous vehicle (9D), egocentric motion (57D), single-arm robot (10D, DROID/UR/Fractal/Bridge/UMI), dual-arm robot (20D, dual DROID arms), humanoid robot (29D, AgiBot).
Output typesImage, video, sound, action state, text
Output formatsJPG image, MP4 video, AAC sound stream muxed into MP4, JSON action values, text string
Prompt lengthFewer than 300 words is recommended for world-generation prompts
Sound outputStereo AAC at 48 kHz when generated with video

Use Cases

Generator

Generator examples produce non-text outputs conditioned by text, vision, and action inputs.

WorkflowInputsOutputsWhat it demonstrates
Text-to-imageTextVisionRobotics laboratory scene generation from a text prompt
Text-to-videoTextVisionIndustrial video generation from a dense scene description
Text-to-video with soundTextVision, soundSynchronized visual and audio generation
Image-to-videoText, imageVisionRobot manipulation animation from a starting image and prompt
Image-to-video with soundText, imageVision, soundImage-conditioned motion with synchronized audio
Video-to-videoText, videoVisionPrompt-guided transformation of a robot manipulation video
Video-to-video with soundText, video, soundVision, soundPrompt-guided transformation of a robot manipulation video
Forward dynamicsText, vision, actionVisionFuture-state rollout from action and visual context
Action policyText, visionAction, visionAction trajectories and rollout video from context

Generator prompt upsampling expands short scene descriptions into dense structured prompts. The current examples use these sampling defaults:

ParameterValue
max_tokens20000
temperature0.7
top_p0.8
top_k20
repetition_penalty1.0
presence_penalty1.5
seed3407

Reasoner

Reasoner examples produce text outputs from text and vision inputs. It follows Qwen3-VL-compatible message conventions for image and video inputs.

WorkflowInputsOutputsWhat it demonstrates
CaptionVideoTextDetailed video captioning
Temporal localizationVideo, queryText or JSONEvent detection, timestamp query, and interval question answering
Embodied reasoningVideo, questionTextNext-action prediction for robotics and assisted-task settings
Common-sense reasoningVideo, questionTextPhysical common-sense judgment with visible context
2D groundingImage, promptJSON boxesBounding-box localization from an image prompt
Describe anythingImage, marked subjectsJSON or textAttribute captioning for marked subjects
Action CoTImage or video, promptText or JSONTrajectory prediction and driving-scene chain-of-thought
Physical Plausibility AnalysisVideo, promptLabelPhysical plausibility classification
Situation UnderstandingVideo, questionTextSituation understanding and likely-next-action prediction

Reasoner examples use the following sampling settings:

ParameterWithout reasoningWith reasoning
top_p0.80.95
top_k2020
repetition_penalty1.01.0
presence_penalty1.50.0
temperature0.70.6

Use this basic message shape for text + vision requests:

[
  {
    "role": "system",
    "content": [{"type": "text", "text": "You are a helpful assistant."}]
  },
  {
    "role": "user",
    "content": [
      {"type": "video_url", "video_url": "https://example.com/video.mp4"},
      {"type": "text", "text": "List the notable events with approximate timestamps."}
    ]
  }
]
copy

For explicit reasoning, append this format instruction to the user prompt:

Answer the question using the following format:

**\<think>**
Your reasoning.
**\</think>**

Write your final answer immediately after the </think> tag.
copy

Quickstart

Before running examples, create a Hugging Face access token and then authenticate locally:

uvx hf@latest auth login
copy

Set HF_HOME if you want to use a shared cache or a disk with more space.

Generator with Diffusers

Expand Diffusers generator setup, example, and modes

Use HuggingFace Diffusers for Cosmos 3 Generator research, training, and model development. This path loads the full Cosmos 3 checkpoint, including the reasoner path, diffusion generation path, and media tokenizers.

uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=auto \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate \
  av \
  cosmos_guardrail \
  huggingface_hub \
  imageio \
  imageio-ffmpeg \
  torch \
  torchvision \
  transformers
copy

--torch-backend=auto lets uv detect your NVIDIA driver and install a matching CUDA build of torch/torchvision. Without it, uv pulls the newest CUDA wheel (currently cu130), which fails on pre-CUDA-13 drivers with The NVIDIA driver on your system is too old and torch.cuda.is_available() returns False. Pin an explicit backend instead if you prefer, e.g. --torch-backend=cu128 for a CUDA 12.8 driver.

A text-to-video run takes a while: the first run downloads Cosmos3-Nano, and diffusion is compute-heavy, running through every inference step before producing output. Long step times are expected, not a hang.

import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)

result = pipe(
    prompt="A mobile robot navigates a warehouse aisle and stops at a shelf.",
    negative_prompt="",
    image=None,
    num_frames=189,
    height=720,
    width=1280,
    fps=24,
    num_inference_steps=35,
    guidance_scale=6.0,
    enable_sound=False,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)

export_to_video(result.video, "cosmos3_t2v.mp4", fps=24, macro_block_size=1)
copy

Diffusers modes:

ModeUse
text-to-imageSingle-frame image generation with num_frames=1; returns a PIL image
text-to-videoVideo generation; 189 frames is about 7.9 seconds at 24 FPS
image-to-videoVideo generation conditioned on an input image
text-to-video-with-soundVideo generation with sound for checkpoints that include sound modules

See the Cosmos 3 Diffusers documentation for runnable examples of each mode.

Generator with vLLM-Omni

Expand vLLM-Omni generator setup, endpoints, and request reference

Use vLLM-Omni for Generator production inference behind an OpenAI-compatible API. This integration loads the full Cosmos 3 checkpoint, including the Qwen3-VL-based reasoner path and the diffusion generation path. For understanding-only tasks that return text, use Reasoner with vLLM instead, which loads only the reasoner.

Compatibility status: Cosmos 3 Generator support is being upstreamed in vllm-project/vllm-omni#3454, which adds text-to-image, text-to-video, and image-to-video; follow-up PRs add video-to-video, video-with-sound, and action. Until they merge, the vllm/vllm-omni:cosmos3 Docker image is the official build with every modality supported; the PR-branch install below covers only the three merged modes.

Start the server from the Docker image (all modalities). Mount any directory that contains local media or action files you want the server to read.

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000 \
  --init-timeout 1800
copy

Cosmos3 checkpoints can exceed the default server init timeout; use --init-timeout 1800 on every vllm serve command in this section.

vLLM-Omni prints Application startup complete. when the API is ready.

For nvidia/Cosmos3-Super (the larger 64B model), split weights across GPUs and optionally offload layers to reduce peak memory: --tensor-parallel-size splits model weights across multiple GPUs, and --enable-layerwise-offload offloads transformer blocks between CPU and GPU with a latency tradeoff and extra CPU RAM use. For example, on four GPUs, add --tensor-parallel-size 4 --enable-layerwise-offload --init-timeout 1800 to the vllm serve command.

Additional parallelism options:

OptionUse
--cfg-parallel-size 2Runs the positive and negative CFG branches in parallel on two GPUs. Set CFG strength with the request-level guidance_scale; do not use true_cfg_scale.
--ulysses-degree 2Enables Ulysses sequence parallelism, splitting the sequence dimension across GPUs.

When combining parallelism options, ensure the server has enough GPUs for the product of the enabled degrees (tensor_parallel_size × cfg_parallel_size × ulysses_degree).

To install the three merged modes (text-to-image, text-to-video, image-to-video) from the upstreaming PR branch instead of using the Docker image, create a venv and install vLLM-Omni from the PR ref, choosing the CUDA build that matches your driver:

uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
# CUDA 13 driver:
uv pip install --torch-backend=cu130 \
  "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head"
# CUDA 12.8 driver:
# uv pip install --torch-backend=cu128 \
#   "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head"
copy

Then run vllm serve nvidia/Cosmos3-Nano --omni --model-class-name Cosmos3OmniDiffusersPipeline --allowed-local-media-path / --port 8000 --init-timeout 1800 directly, without the docker run ... vllm/vllm-omni:cosmos3 wrapper.

Vision endpoints:

ModeEndpointNotes
Text to imagePOST /v1/images/generationsReturns a base64-encoded PNG
Text to videoPOST /v1/videos/syncBlocks and returns the MP4 bytes directly
Image to videoPOST /v1/videos/syncUpload the conditioning image with input_reference
Video to videoPOST /v1/videos/syncUpload a source video and choose which frames stay as clean conditioning
Video with soundPOST /v1/videos/syncAdd generate_sound=true to produce a soundtrack alongside the video

Action modes use Cosmos 3 as a world model: they condition on an embodiment (domain_name) and exchange video and action sequences. Policy and inverse dynamics return a predicted action chunk, so send those through the asynchronous POST /v1/videos job and read the action data from the completed result; forward dynamics returns only video and can use synchronous POST /v1/videos/sync.

Modeaction_modeInputOutput
PolicypolicyImage + instructionVideo + predicted action chunk
Inverse dynamicsinverse_dynamicsVideo + instructionVideo + predicted action chunk
Forward dynamicsforward_dynamicsImage + action chunkVideo

Pass embodiment settings through extra_params: action_mode, domain_name (for example bridge_orig_lerobot, av, or camera_pose), raw_action_dim, and action_chunk_size. Forward dynamics also takes an action_path pointing at an action file the server can read, so start the server with --allowed-local-media-path covering that file (for Docker, mount the file and pass the container-visible path). For the full set of robot, autonomous-vehicle, and camera-pose variants, see the Cosmos 3 online-serving examples.

Example video request:

curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
  --form-string "negative_prompt=blurry, distorted, low quality" \
  --form-string "size=1280x720" \
  --form-string "num_frames=189" \
  --form-string "fps=24" \
  --form-string "num_inference_steps=35" \
  --form-string "guidance_scale=6.0" \
  --form-string "flow_shift=10.0" \
  --form-string "seed=0" \
  --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
  -o cosmos3_t2v_output.mp4
copy

Use --form-string for text fields (prompt, negative_prompt, extra_params) rather than -F: with -F, curl treats ; as a content-type separator and silently truncates any value that contains one.

Common request fields (the image endpoint follows the Image Generation API, and the video endpoints follow the Videos API):

FieldPurpose
promptPositive text prompt
negative_promptConcepts or artifacts to avoid
sizeOutput resolution as <width>x<height>
num_frames, fpsVideo length and frame rate (video endpoints only)
num_inference_stepsDiffusion denoising steps
guidance_scaleClassifier-free guidance scale (use this for Cosmos 3 CFG; do not use true_cfg_scale)
flow_shiftScheduler flow-shift value
seedReproducibility seed
max_sequence_lengthMaximum number of prompt tokens kept for conditioning (Cosmos 3 default 512); longer prompts are truncated with a warning, shorter ones padded
input_referenceUploaded image or video for image-to-video, video-to-video, and action requests
extra_paramsJSON-encoded Cosmos 3-specific options: action settings (action_mode, domain_name, raw_action_dim, action_chunk_size, action_path), video-to-video conditioning (condition_frame_indexes_vision, condition_video_keep), prompt-template toggles (use_resolution_template, use_duration_template), and the per-request guardrails toggle
extra_argsJSON object for Cosmos 3-specific image-endpoint options such as use_resolution_template

Disabling guardrails: Cosmos 3 ships safety guardrails that screen prompts and blur faces in generated output. Disable them per request by adding guardrails: false to extra_params:

curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
  --form-string 'extra_params={"guardrails":false,"use_resolution_template":false,"use_duration_template":false}' \
  -o cosmos3_t2v.mp4
copy

To disable guardrails server-wide so the guardrail models are never loaded (per-request overrides then cannot turn them back on), pass a deploy config — a future release replaces this with a dedicated --cosmos3-no-guardrails flag:

# no_guardrails.yaml
async_chunk: false
stages:
  - stage_id: 0
    max_num_seqs: 1
    enforce_eager: true
    trust_remote_code: true
    model_class_name: Cosmos3OmniDiffusersPipeline
    model_config:
      guardrails: false
      offload_guardrail_models: false
copy

References:

Reasoner with Transformers

Coming soon!

Reasoner with vLLM

Use vLLM for Reasoner production inference behind an OpenAI-compatible chat-completions API.

uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 "vllm==0.21.0" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
copy

The vLLM version and the torch backend are paired: use --torch-backend=cu130 "vllm==0.21.0" for a CUDA 13 driver, or --torch-backend=cu128 "vllm==0.19.1" for CUDA 12.8. vLLM does not publish wheels for every CUDA minor version, so --torch-backend=auto is not reliable here — pick the pair that matches your driver.

vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --async-scheduling \
  --allowed-local-media-path / \
  --port 8000
copy

For notebook launch commands (Cosmos3-Super on four GPUs, media-path defaults, and full flag sets), see cookbooks/cosmos3/README.md — Start the server.

If your vLLM build reports that DeepGEMM is unavailable, disable it before starting the server:

export VLLM_USE_DEEP_GEMM=0
copy

Configuration notes:

OptionUse
--tensor-parallel-sizeNumber of GPUs used for tensor parallel inference
--mm-encoder-tp-mode dataData parallelism for the visual encoder in multimodal workloads
--media-io-kwargs '{"video": {"num_frames": -1}}'Allows the processor to consider all available frames before downstream frame sampling
--allowed-local-media-pathRequired when requests pass local file:// media paths

Troubleshooting

Which CUDA version should I use?

CUDA 13 (recommended) or 12.8. Your system CUDA and PyTorch's CUDA major version must match — check with nvidia-smi and python -c "import torch; print(torch.version.cuda)".

Which base container should I use?

NVIDIA NGC PyTorch: nvcr.io/nvidia/pytorch:25.09-py3 for CUDA 13, or nvcr.io/nvidia/pytorch:25.06-py3 for CUDA 12.

torch.cuda.is_available() is False ("The NVIDIA driver on your system is too old")

The installed torch is newer CUDA than your driver — uv pip install torch defaults to CUDA 13 (cu130). Install a matching build: uv pip install --torch-backend=auto torch torchvision (or pin, e.g. --torch-backend=cu128). For uv sync notebooks use COSMOS3_UV_GROUP=cu128-train; for vLLM pair cu128 with vllm==0.19.1.

Import fails with libxcb.so.1: cannot open shared object file

On headless servers and minimal containers, importing or running a pipeline can fail with libxcb.so.1: cannot open shared object file (or another missing graphics library), because a dependency links against system X11/graphics libraries that are not installed. Install them:

apt-get install -y libxcb1 libgl1 libglib2.0-0
copy

uv errors on install or sync

The Cosmos Framework requires uv >= 0.11.3 (enforced via its pyproject.toml). Older versions fail to parse the project config (for example the [tool.uv.audit] section) and do not recognize newer --torch-backend values such as cu130 (you may see a value is required for '--torch-backend' or an accepted-values list that stops at cu129). Upgrade with uv self update (or reinstall from https://astral.sh/uv).

Choosing an Integration

GoalUseNotes
Generator research or model developmentDiffusersPython-first path for inspecting and modifying generator behavior
Generator production inferencevLLM-OmniAPI path for image, video, sound, and action outputs
Reasoner research or model developmentTransformers (coming soon)Python-first path for prompts, processors, and model behavior
Reasoner production inferencevLLMOpenAI-compatible endpoint for text outputs from text and vision inputs
Runnable setup, training, or evaluationCosmos FrameworkFull workflow docs for setup, inference, omni-model training, and evaluation

Examples

We are building examples that show Cosmos 3 capabilities end to end, including world generation, world understanding, captioning, temporal localization, grounding, and physical reasoning. Each example is a self-contained script or notebook you can run from this repository.

ExampleSurfaceWorkflows demonstratedOpennbviewer
Generator (audiovisual) with DiffusersGeneratorText-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via Cosmos3OmniPipeline.NotebookRender with nbviewer
Generator (audiovisual) with Cosmos FrameworkGeneratorText-to-image, plus text-to-video and image-to-video each with sound on or off, through the cosmos_framework.scripts.inference entrypoint.NotebookRender with nbviewer
Generator (audiovisual) with vLLM-OmniGeneratorText-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server.NotebookRender with nbviewer
Forward dynamics with Cosmos FrameworkGeneratorForward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the cosmos_framework.scripts.inference entrypoint.NotebookRender with nbviewer
Forward dynamics with vLLM-OmniGeneratorForward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server.NotebookRender with nbviewer
Inverse dynamics with Cosmos FrameworkGeneratorInverse dynamics: ego-motion trajectory prediction from input AV video, through the cosmos_framework.scripts.inference entrypoint.NotebookRender with nbviewer
Inverse dynamics with vLLM-OmniGeneratorInverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible vLLM-Omni server.NotebookRender with nbviewer
Reasoner with Cosmos FrameworkReasonerText and image reasoning: detailed captioning, robot task planning, 2D grounding, describe-anything, and action-trajectory prompts, through the cosmos_framework.scripts.inference entrypoint.NotebookRender with nbviewer
Reasoner with vLLMReasonerImage and video reasoning: captioning, temporal localization, embodied reasoning, common-sense reasoning, 2D grounding, describe-anything, action CoT, driving scenes, physical-plausibility, and situation understanding, against an OpenAI-compatible vLLM server (Cosmos3-Super on 4 GPUs by default; switch to Nano per the cookbook README).NotebookRender with nbviewer

Inference Benchmarks

Cosmos 3 latency and serving numbers live in inference_benchmarks.md. Generator sections report diffusion-path latency (seconds) by GPU, engine, resolution, and tensor-parallel width; the Reasoner section reports vLLM serving metrics under concurrent load. Empty cells mean a combination has not been measured yet, not that it is unsupported.

BenchmarkSurfaceModelWhat it covers
Cosmos3-Nano generatorGeneratorCosmos3-NanoText-to-image, text-to-video, and image-to-video latency across PyTorch, vLLM-Omni, and Diffusers
Cosmos3-Super generatorGeneratorCosmos3-SuperThe same modalities and engines at the larger checkpoint scale
Cosmos3-Nano reasonerReasonerCosmos3-NanovLLM serving metrics — TTFT, request latency, and throughput at concurrency 1/64/128/256

Limitations

Cosmos 3 can produce artifacts in long, high-resolution, or physically complex outputs. Common failure modes include temporal inconsistency, unstable camera or object motion, inaccurate sound-video alignment, imperfect action-state consistency, object morphing, inaccurate 3D structure, and implausible physical dynamics. Applications that require physically grounded simulation, safety-critical control, or complex multi-agent behavior need additional validation, guardrails, and system-level safety analysis before deployment.

Ecosystem

ProjectPurpose
Cosmos FrameworkEnd-to-end Physical AI framework for training and serving world models, including setup, inference, and training
Cosmos CuratorDistributed Physical AI data curation system covering processing, annotation, filtering, and deduplication
Cosmos EvaluatorAutomated Physical AI evaluation system for world generation and world reasoning outputs

News

License and Contact

This project may download and install additional third-party open source software projects. Review the license terms of those projects before use.

NVIDIA Cosmos source code and models are released under, and subject to the terms of, the OpenMDW-1.1 License. For a custom license, contact cosmos-license@nvidia.com.