Glossary
C
Chain of Thought (CoT) : A reasoning technique where models generate step-by-step explanations of their thought process before arriving at a final answer, improving transparency and accuracy.
Checkpoint : A saved snapshot of a model's weights and training state at a specific point during training, allowing training to be resumed or models to be evaluated at different stages.
Context Parallelism (CP) : A parallelization strategy that splits the sequence/context dimension across multiple devices to handle longer sequences.
Cosmos Curator : A GPU-accelerated video curation pipeline built on Ray for multi-model analysis, content filtering, annotation, and deduplication of inference and training data.
Cosmos Predict : A diffusion transformer model for future state prediction and video-to-world generation, with specialized variants for robotics and simulation.
Cosmos Reason : A 7B vision-language model for physically grounded reasoning, handling spatial/temporal understanding and chain-of-thought tasks for embodied AI applications.
Cosmos RL : A distributed training framework supporting supervised fine-tuning (SFT) and reinforcement learning approaches with elastic policy rollout and FP8/FP4 precision support.
Cosmos Transfer : A multi-control video generation system with ControlNet and MultiControlNet conditioning (depth, segmentation, LiDAR, HDMap) including 4K upscaling capabilities.
ControlNet : A neural network architecture that adds conditional control to diffusion models, allowing them to be guided by additional inputs like depth maps, edge maps, or segmentation masks.
D
Data Parallel (DP) : A training parallelization strategy where the model is replicated across multiple devices, and each device processes different data batches.
Data Parallelism Shard Size : The number of devices across which gradients are synchronized in distributed training.
Deduplication : The process of identifying and removing duplicate or near-duplicate samples from a dataset to improve data quality and training efficiency.
Diffusion Model : A generative model that learns to create data by iteratively denoising samples, starting from pure noise and gradually refining them into coherent outputs.
E
Embodied AI : AI systems that interact with the physical world through sensors and actuators, such as robots and autonomous vehicles.
Epoch : A complete pass through the entire training dataset during model training.
F
Fine-Tuning : The process of adapting a pre-trained model to a specific task or domain by training it on task-specific data.
FP4/FP8 : 4-bit and 8-bit floating-point number formats that reduce memory usage and increase training speed while maintaining acceptable model performance.
FPS (Frames Per Second) : The number of video frames processed or generated per second.
FSDP (Fully Sharded Data Parallel) : A memory-efficient distributed training strategy that shards model parameters, gradients, and optimizer states across multiple devices.
G
Gradient Checkpointing : A memory-saving technique that trades computation for memory by recomputing intermediate activations during backpropagation instead of storing them.
Gradient Clipping : A technique to prevent exploding gradients by capping the gradient norm at a maximum value during training.
H
HDMap (High-Definition Map) : A detailed, lane-level map representation used in autonomous driving that includes precise road geometry, lane markings, and traffic rules.
I
Inference : The process of using a trained model to make predictions or generate outputs on new, unseen data.
Interactive Meta-Action : In autonomous driving, a driving behavior that involves interaction with other traffic participants, such as yielding, following, or overtaking.
ITS (Intelligent Transportation Systems) : Advanced applications that aim to provide innovative services relating to different modes of transport and traffic management.
L
LoRA (Low-Rank Adaptation) : An efficient fine-tuning method that adds trainable low-rank matrices to pre-trained model weights, reducing the number of trainable parameters.
LLM (Large Language Model) : A neural network model trained on vast amounts of text data, capable of understanding and generating human-like text.
M
Max Pixels : The maximum number of pixels to process in an image or video frame, often used to control computational requirements.
Model Checkpoint : See Checkpoint.
Multi-Control : The ability to condition a generative model on multiple types of control signals simultaneously (e.g., depth, segmentation, and HDMap).
MultiControlNet : An extension of ControlNet that combines multiple conditional control signals to guide video generation with greater precision.
N
Non-Interactive Meta-Action : In autonomous driving, a driving behavior that doesn't directly involve other traffic participants, such as lane merging or turning at an empty intersection.
O
Optimizer : An algorithm that adjusts model weights during training to minimize the loss function. Common optimizers include Adam, AdamW, and SGD.
P
Parallelism : The strategy of distributing computation across multiple devices to speed up training or inference. See also Data Parallel, Tensor Parallel, Pipeline Parallel.
Physical Plausibility : The degree to which generated or predicted content adheres to real-world physics laws and constraints.
Pipeline Parallel (PP) : A training strategy that splits a model into sequential stages across multiple devices, with different devices processing different layers.
Post-Training : The process of further training or fine-tuning a pre-trained model on specific tasks or domains after initial training is complete.
R
Reinforcement Learning (RL) : A machine learning approach where models learn by receiving rewards or penalties based on their actions, optimizing for long-term cumulative reward.
Reward Model : A model trained to score or evaluate outputs, used in reinforcement learning to provide feedback signals for training.
S
Scene Understanding : The ability of a model to interpret and comprehend the contents, context, and relationships within a visual scene.
SFT (Supervised Fine-Tuning) : A training method that fine-tunes a pre-trained model using labeled examples with supervised learning objectives.
Sim-to-Real (Sim2Real) : The process of transferring knowledge or models trained in simulation environments to real-world applications.
Spatial AI : AI systems that understand and reason about spatial relationships, positions, and interactions in physical environments.
System Prompt : Initial instructions or context provided to a language model that define its role, behavior, or constraints for a conversation or task.
T
Tensor Parallel (TP) : A parallelization strategy that splits individual tensors (model layers) across multiple devices.
Traffic Participant : Any entity in a traffic environment, including vehicles, pedestrians, cyclists, and other road users.
Training Configuration : A set of hyperparameters and settings that define how a model is trained, including learning rate, batch size, and optimization strategy.
Transfer Learning : The technique of applying knowledge learned from one task or domain to improve performance on a different but related task or domain.
U
Upscaling : The process of increasing the resolution of an image or video while attempting to preserve or enhance quality.
V
Video Augmentation : Techniques for modifying or enhancing video data, such as changing weather conditions, lighting, or style, to increase dataset diversity.
Video-Language Model (VLM) : A neural network model that processes both video and language inputs, capable of understanding visual content and generating or responding to text.
Visual Question Answering (VQA) : A task where models answer questions about the content of images or videos.
VRU (Vulnerable Road User) : Traffic participants who are not protected by a vehicle structure, including pedestrians, cyclists, and motorcyclists.
W
Warmup Steps : An initial training period where the learning rate gradually increases from a small value to the target learning rate, helping stabilize early training.
Weight Decay : A regularization technique that adds a penalty proportional to the magnitude of model weights to the loss function, helping prevent overfitting.
WFM (World Foundation Model) : Large-scale foundation models designed to understand and generate representations of the physical world, forming the basis of the Cosmos ecosystem.
Z
Zero-Shot : The ability of a model to perform a task without having been explicitly trained on examples of that specific task, relying only on pre-training knowledge.