Skip to content

Cosmos Cookbook

Overview

The NVIDIA Cosmos ecosystem is a suite of World Foundation Models (WFMs) for real-world, domain-specific applications. This cookbook provides step-by-step workflows, technical recipes, and concrete examples across robotics, simulation, autonomous systems, and physical scene understanding. It serves as a technical reference for reproducing successful Cosmos model deployments across different domains.

The Cosmos ecosystem covers the complete AI development lifecycle: from inference with pre-trained models to custom post-training for domain-specific adaptation. The cookbook includes quick-start inference examples, advanced post-training workflows, and proven recipes for successful model deployment and customization.

Open Source Community Platform

The Cosmos Cookbook is designed as an open-source platform where NVIDIA shares practical knowledge and proven techniques with the broader AI community. This collaborative approach enables researchers, developers, and practitioners to contribute their own workflows, improvements, and domain-specific adaptations.

Repository: https://github.com/nvidia-cosmos/cosmos-cookbook

We encourage community contributions including new examples, workflow improvements, bug fixes, and documentation enhancements. The open-source nature ensures that the collective knowledge and best practices around Cosmos models continue to evolve and benefit the entire ecosystem.

Post-Training Examples

The cookbook includes comprehensive case studies demonstrating real-world post-training applications across the Cosmos ecosystem.

Cosmos Predict

Future state prediction and generation

Workflow Description Link
Training Traffic anomaly generation with improved realism and prompt alignment Traffic Anomaly Generation
Training Synthetic trajectory data generation for humanoid robot learning GR00T-Dreams

Cosmos Transfer

Multi-control video generation and augmentation

Workflow Description Link
Inference Weather augmentation pipeline for simulation data using multi-modal controls Weather Augmentation
Inference CG-to-real conversion for multi-view warehouse environments Warehouse Simulation
Inference Synthetic manipulation motion generation for humanoid robots GR00T-Mimic
Inference CARLA simulator-to-real augmentation for traffic anomaly scenarios CARLA Sim2Real

Cosmos Reason

Vision-language reasoning and quality control

Workflow Description Link
Training Physical plausibility check for video quality assessment Video Rewards
Training Spatial AI understanding for warehouse environments Spatial AI Warehouse
Training Intelligent transportation scene understanding and analysis Intelligent Transportation

Cosmos Model Ecosystem

The Cosmos architecture consists of multiple model families, each targeting specific capabilities in the AI development workflow:

Cosmos Curator

Cosmos Curator - A GPU-accelerated video curation pipeline built on Ray. Supports multi-model analysis, content filtering, annotation, and deduplication for both inference and training data preparation.

Cosmos Predict - Future State Prediction Models

Cosmos Predict 2.5 (Latest) - A flow-based model that unifies Text2World, Image2World, and Video2World into a single architecture. Uses Cosmos-Reason1 as the text encoder and significantly improves upon Predict 2 in both quality and prompt alignment. Provides specialized variants for robotics, autonomous vehicles (multiview), and simulation with support for custom post-training for domain-specific prediction tasks.

Cosmos Predict 2 - A diffusion transformer for future state prediction. Provides text-to-image and video-to-world generation capabilities, with specialized variants for robotics and simulation. Supports custom training for domain-specific prediction tasks.

Cosmos Transfer - Multi-Control Video Generation Models

Cosmos Transfer 2.5 (Latest) - Enhanced multi-control video generation system with improved quality and control precision. Features ControlNet and MultiControlNet conditioning (including depth, segmentation, LiDAR, and HDMap), 4K upscaling capabilities, and supports training for custom control modalities and domain adaptation.

Cosmos Transfer 1 - A multi-control video generation system with ControlNet and MultiControlNet conditioning (including depth, segmentation, LiDAR, and HDMap). Includes 4K upscaling capabilities and supports training for custom control modalities and domain adaptation.

Cosmos Reason - Vision-Language Reasoning Models

Cosmos Reason 1 - A 7B vision-language model for physically grounded reasoning. Handles spatial/temporal understanding and chain-of-thought tasks, with fine-tuning support for embodied AI applications and domain-specific reasoning.

Cosmos RL - Training Framework

Cosmos RL - A distributed training framework supporting both supervised fine-tuning (SFT) and reinforcement learning approaches. Features elastic policy rollout, FP8/FP4 precision support, and optimization for large-scale VLM and LLM training.

All models include pre-trained checkpoints and support custom training for domain-specific adaptation. The diagram below illustrates component interactions across inference and training workflows.

Cosmos Overview

Cosmos Workflows

The cookbook is organized around key workflows spanning inference and training use cases:

1. Data Curation - Use Cosmos Curator to prepare your datasets with modular, scalable processing pipelines. This includes splitting, captioning, filtering, deduplication, task-specific sampling, and cloud-native or local execution.

2. Model Post-Training - Fine-tune foundation models using your curated data. This covers domain adaptation for Predict (2 and 2.5), Transfer (1 and 2.5), and Reason 1, setup for supervised fine-tuning, LoRA, or reinforcement learning, and use of Cosmos RL for large-scale distributed rollout.

3. Evaluation and Quality Control - Ensure your post-trained models are aligned and robust through metrics, visualization, and qualitative inspection. Leverage Cosmos Reason 1 as a quality filter (e.g. for synthetic data rejection sampling).

4. Model Distillation - Accelerate diffusion models by distilling a more efficient variant while preserving output quality. This covers single-step distillation techniques, including Knowledge Distillation (KD) and Improved Distribution Matching Distillation (DMD2).

Quick Start Paths

This cookbook provides flexible entry points for both inference and training workflows. Each section contains runnable scripts, technical recipes, and complete examples.

Quick Start Paths