Paper Link: https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf
Model Architecture
Cosmos 3 is capable of processing multimodal inputs and generating multimodal outputs
Cosmos 3 treats action as a core modality, the action tokens bridge the physical world with language-based reasoning and video-based world modeling
Cosmos 3 uses modality-specific encoders to project different modalities into a unified representation space, which is then processed by a Mixture-of-Transformers (MoT) backbone
Language tokens are generated autoregressively, while other modalities are generated through iterative denoising
Encoders
Cosmos 3 adopts 2 separate encoders for visual input
For visual understanding a ViT encoder (16 x 16 patches) pre-trained with vision-language alignment is used
For visual generated a VAE encoder from Wan2.2-TI2V-5B is used
ViT encoder uses 16 x 16 patch size followed by a two-layer MLP that merges 2 x 2 tokens and projects them into the latent space of the transformer
For audio generation they adopt audio VAE architecture
Action modeling is supported across diverse embodiments, since each domain exposes its own native control space, the actions are mapped into a unified action interface that enables consistent multimodal reasoning. generation, and policy learning across domains
Action Representation
Cosmos 3 treats action as the change from one frame to the next
Instead of seeing before and after the model gets an action token that says what movement caused that change
Cosmos 3 splits action into reusable pieces:
- Ego pose - eg. car / camera motion & head-camera motion
- Effector pose - eg. for robot this’d be the gripper
- Grasp state - eg. manipulation state, robot gripper: open, half-closed, closed
Cosmos 3 avoids learning embodiment-specific controller details like PID parameters
SE(3) - math notation for a 3D pose (3D position + 3D orientation)
6D representation is six numbers arranged as two 3D direction vectors
Action tokenization
Purpose is to map embodiment data into a shared action space, while preserving specific structure and semantics
Cosmos uses domain-aware input and output projection layers with separate weight matrices for each embodiment domain
input projection:
Output projection: