Paper Link: https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf

Model Architecture

Cosmos 3 is capable of processing multimodal inputs and generating multimodal outputs

Cosmos 3 treats action as a core modality, the action tokens bridge the physical world with language-based reasoning and video-based world modeling

Cosmos 3 uses modality-specific encoders to project different modalities into a unified representation space, which is then processed by a Mixture-of-Transformers (MoT) backbone

Language tokens are generated autoregressively, while other modalities are generated through iterative denoising

Encoders

Cosmos 3 adopts 2 separate encoders for visual input

For visual understanding a ViT encoder (16 x 16 patches) pre-trained with vision-language alignment is used

For visual generated a VAE encoder from Wan2.2-TI2V-5B is used

ViT encoder uses 16 x 16 patch size followed by a two-layer MLP that merges 2 x 2 tokens and projects them into the latent space of the transformer

For audio generation they adopt audio VAE architecture

Action modeling is supported across diverse embodiments, since each domain exposes its own native control space, the actions are mapped into a unified action interface that enables consistent multimodal reasoning. generation, and policy learning across domains

Action Representation

Cosmos 3 treats action as the change from one frame to the next

Instead of seeing before and after the model gets an action token that says what movement caused that change

Cosmos 3 splits action into reusable pieces:

  1. Ego pose - eg. car / camera motion & head-camera motion
  2. Effector pose - eg. for robot this’d be the gripper
  3. Grasp state - eg. manipulation state, robot gripper: open, half-closed, closed

Cosmos 3 avoids learning embodiment-specific controller details like PID parameters

SE(3) - math notation for a 3D pose (3D position + 3D orientation)

6D representation is six numbers arranged as two 3D direction vectors

Action tokenization

Purpose is to map embodiment data into a shared action space, while preserving specific structure and semantics

Cosmos uses domain-aware input and output projection layers with separate weight matrices for each embodiment domain

input projection:

Output projection: