Intelligent Transportation Post-Training with Cosmos Reason 2¶
This notebook demonstrates how to fine-tune NVIDIA Cosmos Reason 2 for intelligent transportation scene understanding.
Overview¶
Supervised Fine-Tuning (SFT) aligns pre-trained models to specific tasks by showing clear input-output pairs. In this notebook, we fine-tune Cosmos Reason 2 to understand traffic scenes — including road attributes, pedestrian situations, and vehicle behavior.
Table of Contents¶
- Environment Setup
- 1.1. Install with uv (Recommended)
- 1.2. Switch to Cosmos Reason 2 Kernel
- 1.3. Alternative: Docker Container
- 1.4. Verify Installation
- Configuration
- 2.1. Paths and Settings
- 2.2. Hugging Face Setup
- Dataset Preparation
- 3.1. Video Helper Utilities
- 3.2. Labels and Annotations
- Zero-Shot Inference
- 4.1. Inference Helper Class
- 4.2. Run Zero-Shot Inference
- Training
- 5.1. Training Configuration
- 5.2. Update Config Paths
- 5.3. Vision Token Analysis
- 5.4. Run Training
- Evaluation
- 6.1. Run Evaluation
- Inference on Fine-Tuned Checkpoints
- 7.1. Clean Up GPU Memory
- Deployment
- 8.1. FP8 Quantization
- 8.2. Deploy with NIM
- 8.3. Test NIM API
Prerequisites¶
- Refer to the Cosmos Reason 2 GitHub repository for detailed setup instructions.
Reference: NVIDIA Cosmos Cookbook — Intelligent Transportation Post-Training
1. Environment Setup¶
Set up Cosmos Reason 2 repo and its dependencies using one of the options below.
1.1. Install with uv (Recommended)¶
Use uv to set up Cosmos Reason 2 and Cosmos-RL quickly. This step can take several minutes and requires sufficient disk space.
Before running the cell below, ensure the following system dependencies are installed:
sudo apt-get update && sudo apt-get install -y ffmpeg redis-server
import os
import sys
# Install uv and update PATH
!curl -LsSf https://astral.sh/uv/install.sh | sh
os.environ["PATH"] = f"{os.path.expanduser('~/.local/bin')}:{os.environ['PATH']}"
# Clone repositories
for repo in ["cosmos-reason2", "cosmos-cookbook"]:
if not os.path.exists(repo):
!git clone https://github.com/nvidia-cosmos/{repo}.git
# Sync environments
print("\nInstalling prerequisites...")
!uv add --project cosmos-reason2 opencv-python decord ipykernel
!uv sync --project cosmos-reason2 --extra cu128
!uv sync --project cosmos-reason2/examples/cosmos_rl
!uv run --project cosmos-reason2 python -m ipykernel install --user --name cosmos-reason2-venv --display-name "Cosmos-Reason2"
print("\n✅ Setup complete!")
1.2. Switch to Cosmos Reason 2 Kernel¶
After installing dependencies, switch the notebook to the cosmos-reason2 venv kernel to perform inference in the notebook.
Switch the notebook kernel:
- In JupyterLab: Kernel → Change Kernel…
- In Classic Notebook: Kernel → Change kernel
- Select Cosmos-Reason2
Verify in a cell:
import sys
print(sys.executable)
1.3. Alternative: Docker Container (Optional)¶
If you prefer running in a containerized environment, you can build and run the Cosmos Reason 2 Docker container. This requires Docker and the NVIDIA Container Toolkit.
Build the container: The build command tags the image for reuse.
CUDA Variants:
- CUDA 12.8:
--build-arg=CUDA_VERSION=12.8.1(default, requires NVIDIA Driver) - CUDA 13.0:
--build-arg=CUDA_VERSION=13.0.0(required for DGX Spark and Jetson AGX)
# Docker Container Build (Optional)
# Uncomment and run if using Docker instead of uv/pip.
# Replace /path/to/cosmos-reason2 with your local clone path.
# Build the container (run from the cosmos-reason2 repo directory):
# !cd /path/to/cosmos-reason2 && docker build -f Dockerfile --build-arg=CUDA_VERSION=12.8.1 -t cosmos-reason2:cu128 .
# For CUDA 13.0 (DGX Spark / Jetson AGX):
# !cd /path/to/cosmos-reason2 && docker build -f Dockerfile --build-arg=CUDA_VERSION=13.0.0 -t cosmos-reason2:cu130 .
print("Docker build commands (uncomment to run):")
print(" cd /path/to/cosmos-reason2")
print(" docker build -f Dockerfile --build-arg=CUDA_VERSION=12.8.1 -t cosmos-reason2:cu128 .")
Run the container:
The container mounts the current directory to /workspace and preserves venv and cache directories.
# Docker Container Run (Optional)
# Uncomment and customize before running
# docker run -it --gpus all --ipc=host --rm \
# -v .:/workspace \
# -v /workspace/.venv \
# -v /workspace/examples/cosmos_rl/.venv \
# -v /root/.cache:/root/.cache \
# -e HF_TOKEN="$HF_TOKEN" \
# cosmos-reason2:cu128
print("Docker run command (uncomment to run):")
print("""docker run -it --gpus all --ipc=host --rm \\
-v .:/workspace \\
-v /workspace/.venv \\
-v /workspace/examples/cosmos_rl/.venv \\
-v /root/.cache:/root/.cache \\
-e HF_TOKEN="$HF_TOKEN" \\
cosmos-reason2:cu128""")
print("\nOptional arguments:")
print(" --ipc=host Use host shared memory (torchrun needs this)")
print(" -v /root/.cache Mount host cache to avoid re-downloads")
print(" -e HF_TOKEN Pass Hugging Face token to container")
1.4. Verify Installation¶
Confirm that core dependencies and the cosmos-rl CLI are available before proceeding.
# Verify installations
import sys
import os
COSMOS_REASON2_REPO = os.path.join(os.getcwd(), "cosmos-reason2") # default clone location
COSMOS_RL_PATH = f"{COSMOS_REASON2_REPO}/examples/cosmos_rl"
COSMOS_RL_BIN = f"{COSMOS_RL_PATH}/.venv/bin/cosmos-rl"
print("Verifying installations:\n")
print("Checking cosmos-rl venv:")
!ls -la {COSMOS_RL_PATH}/.venv/bin/ 2>/dev/null | grep -E "cosmos|python" || echo " venv not found"
if os.path.exists(COSMOS_RL_BIN):
print(f"\n✅ cosmos-rl found at {COSMOS_RL_BIN}")
print("\ncosmos-rl --help:")
!{COSMOS_RL_BIN} --help 2>&1 | head -15
else:
print(f"\n❌ cosmos-rl not found at {COSMOS_RL_BIN}")
print("\nTry running uv sync manually:")
!cd {COSMOS_RL_PATH} && {sys.executable} -m uv sync 2>&1 | tail -30
2. Configuration¶
Set the dataset, model, and repo paths once here. The rest of the notebook references these variables.
2.1. Paths and Settings¶
Update the paths below to match your local environment. All subsequent cells reference these variables.
# Setup and Imports
import os
import json
from pathlib import Path
from IPython.display import display, Video, HTML
import numpy as np
# ==============================================================================
# CONFIGURATION — Update these paths before running the notebook
# ==============================================================================
# --- Repository Paths ---
# Path to the cloned cosmos-reason2 repository
COSMOS_REASON2_REPO = "/path/to/cosmos-reason2"
# Path to the cloned cosmos-cookbook repository (contains training scripts)
COSMOS_COOKBOOK_REPO = "/path/to/cosmos-cookbook"
# --- Dataset Paths ---
# Training dataset directory (should contain videos/ and annotations.json)
TRAIN_DATA_PATH = "/path/to/wts_data_train"
# Validation dataset directory (should contain videos/ and annotations.json)
VAL_DATA_PATH = "/path/to/wts_data_val"
# --- Model Paths ---
# Local path to the base model. If not available locally, use the
# Hugging Face ID "nvidia/Cosmos-Reason2-8B" (requires authentication — see next cell).
BASE_MODEL_PATH = "/path/to/Cosmos-Reason2-8B"
# Output directory for fine-tuned model checkpoints
FINETUNED_MODEL_PATH = "/path/to/Cosmos-Reason2-8B_FT"
# Example video for quick testing
EXAMPLE_VIDEO_PATH = "/path/to/example_video.mp4"
# --- Derived Paths (computed from above, usually no need to edit) ---
TRAIN_VIDEOS_PATH = f"{TRAIN_DATA_PATH}/videos"
TRAIN_ANNOTATIONS_PATH = f"{TRAIN_DATA_PATH}/annotations.json"
VAL_VIDEOS_PATH = f"{VAL_DATA_PATH}/videos"
VAL_ANNOTATIONS_PATH = f"{VAL_DATA_PATH}/annotations.json"
# Cosmos-RL directory (inside the cosmos-reason2 repo)
COSMOS_RL_PATH = f"{COSMOS_REASON2_REPO}/examples/cosmos_rl"
print("Configuration:")
print(f" Train Dataset Path: {TRAIN_DATA_PATH}")
print(f" Validation Dataset: {VAL_DATA_PATH}")
print(f" Base Model Path: {BASE_MODEL_PATH}")
print(f" Fine-Tuned Model Path: {FINETUNED_MODEL_PATH}")
print(f" Example Video Path: {EXAMPLE_VIDEO_PATH}")
print(f" Cosmos Reason2 Repo: {COSMOS_REASON2_REPO}")
print(f" Cosmos Cookbook Repo: {COSMOS_COOKBOOK_REPO}")
2.2. Hugging Face Setup (Optional)¶
If you are downloading the Cosmos Reason 2 model from Hugging Face (e.g., nvidia/Cosmos-Reason2-8B), you need to authenticate. This cell prompts for your HF token and performs authentication.
import subprocess
import getpass
from IPython.display import display, HTML
import time
display(HTML('<a href="https://huggingface.co/settings/tokens" target="_blank" style="font-size:16px;">🔑 Get Hugging Face Token</a>'))
time.sleep(2)
hf_token = getpass.getpass("Hugging Face Token (leave blank to skip): ").strip()
if hf_token:
result = subprocess.run(
["uvx", "hf", "auth", "login", "--token", hf_token],
capture_output=True, text=True
)
print("✅ Hugging Face login successful" if result.returncode == 0 else f"❌ Failed: {result.stderr}")
else:
print("⏭️ Skipped Hugging Face authentication")
3. Dataset Preparation¶
Before post-training a vision-language model, it helps to inspect a few samples to understand clip length, camera viewpoints, and the kinds of questions and answers available. This quick check also confirms your dataset paths are correct and that annotations align with videos.
For this notebook, we use the Woven Traffic Safety (WTS) Dataset (Environment VQA subset) as the example. It includes:
- 255 traffic scenarios
- 1,200+ video segments
- 341 videos with ~5.6k MCQ question-answer pairs
- Average video length is ~75 seconds.
Let's load and display a sample video from the dataset.
3.1. Video Helper Utilities¶
These helpers list videos, display metadata, and sample frames so you can quickly validate the dataset contents.
def list_videos(video_dir, num_samples=5):
"""List available videos in the dataset directory."""
video_extensions = ['.mp4', '.avi', '.mov', '.mkv']
videos = []
video_path = Path(video_dir)
if video_path.exists():
for ext in video_extensions:
videos.extend(list(video_path.rglob(f"*{ext}")))
return videos[:num_samples] if videos else []
def display_video_with_info(video_path, width=640):
"""Display a video with metadata information."""
import cv2
cap = cv2.VideoCapture(str(video_path))
fps = cap.get(cv2.CAP_PROP_FPS)
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
width_px = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height_px = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
duration = frame_count / fps if fps > 0 else 0
cap.release()
print(f"📹 Video: {video_path.name}")
print(f" Resolution: {width_px} x {height_px}")
print(f" FPS: {fps:.2f}")
print(f" Duration: {duration:.2f} seconds")
print(f" Total Frames: {frame_count}")
return Video(str(video_path), embed=True, width=width)
# List and display sample videos
print("🔍 Searching for videos in WTS dataset...\n")
train_videos_path = os.path.join(TRAIN_VIDEOS_PATH)
sample_videos = list_videos(train_videos_path)
if sample_videos:
print(f"Found {len(sample_videos)} sample videos:\n")
for i, v in enumerate(sample_videos):
print(f" {i+1}. {v.name}")
# Display the first video
print("\n" + "="*60)
print("Displaying first video:")
print("="*60 + "\n")
display(display_video_with_info(sample_videos[0]))
else:
print("⚠️ No videos found. Please update TRAIN_VIDEOS_PATH to your dataset location.")
print(f" Current path: {train_videos_path}")
3.2. Labels and Annotations¶
The WTS dataset provides rich annotations including:
- Textual descriptions of pedestrian and vehicle behavior
- Traffic VQA with multiple-choice questions (MCQ)
The WTS VQA annotations are pre-processed into Llava dataset format using cosmos-cookbook/scripts/examples/reason2/intelligent-transportation/data_preprocess.py. This JSON-based format is commonly used for visual SFT on VLMs, including Llava and Qwen-VL families. Each entry contains an id, a media reference (video or image), and a conversation with a human query and the expected VLM answer. Here is an example:
def display_llava_format(example):
"""Pretty print a Llava-format example from the dataset."""
print("📋 Llava Dataset Format Example (from WTS):")
print("="*60)
print(json.dumps(example, indent=2))
print("="*60)
def parse_mcq_text(text):
"""Parse MCQ question/options from the WTS Llava-format prompt."""
cleaned = text.replace("<video>", " ").strip()
lines = [line.strip() for line in cleaned.splitlines() if line.strip()]
question = lines[0] if lines else ""
options = lines[1:] if len(lines) > 1 else []
return question, options
def is_correct_option(option, answer):
"""Mark the correct option based on the answer token (e.g., 'A')."""
opt = option.strip()
ans = answer.strip()
if not ans:
return False
prefixes = [f"{ans}:", f"{ans})", f"{ans}.", f"{ans} "]
return opt == ans or any(opt.startswith(prefix) for prefix in prefixes)
# Load actual MCQ examples from the training annotations
annotations_path = TRAIN_ANNOTATIONS_PATH
if not os.path.exists(annotations_path):
print("⚠️ annotations.json not found. Update TRAIN_DATA_PATH to your dataset location.")
print(f" Current path: {annotations_path}")
else:
with open(annotations_path, "r") as f:
annotations = json.load(f)
# Display a real Llava-format entry
if annotations:
display_llava_format(annotations[0])
# Display a few actual MCQ questions
print("\n\n📝 Sample MCQ Questions from the Training Set:")
print("="*60)
for i, ann in enumerate(annotations[:4], 1):
question_text, options = parse_mcq_text(ann["conversations"][0]["value"])
answer = ann["conversations"][1]["value"]
print(f"\nQ{i}: {question_text}")
for opt in options:
marker = "✓" if is_correct_option(opt, answer) else " "
print(f" [{marker}] {opt}")
print("\n" + "="*60)
# Inference Class for Cosmos Reason 2
class CosmosReason2Inference:
"""
Inference wrapper for fine-tuned Cosmos Reason 2 model.
"""
def __init__(self, model_path, nframes=8, max_tokens=512):
"""
Initialize the inference engine.
Args:
model_path: Path to the model checkpoint (base or fine-tuned)
nframes: Number of frames to sample from videos
max_tokens: Maximum tokens to generate
"""
self.model_path = model_path
self.nframes = nframes
self.max_tokens = max_tokens
self.llm = None
self.processor = None
self.sampling_params = None
def load_model(self):
"""Load the model using vLLM."""
try:
from vllm import LLM, SamplingParams
from transformers import AutoProcessor
import torch
import gc
torch.cuda.empty_cache()
gc.collect()
print(f"🔄 Loading model from: {self.model_path}")
self.llm = LLM(
model=self.model_path,
tensor_parallel_size=1,
max_model_len=32768,
trust_remote_code=True,
limit_mm_per_prompt={"video": 1, "image": 0}
)
# Load processor for chat template
self.processor = AutoProcessor.from_pretrained(
self.model_path,
trust_remote_code=True
)
self.sampling_params = SamplingParams(
max_tokens=self.max_tokens,
temperature=0.0
)
print("✅ Model loaded successfully!")
return True
except ImportError:
print("⚠️ vLLM not installed. Install with: pip install vllm")
return False
except Exception as e:
print(f"❌ Error loading model: {e}")
return False
def query(self, video_path, question, system_prompt="You are a helpful assistant."):
"""
Query the model with a video and question.
Args:
video_path: Path to the video file
question: Question to ask about the video
system_prompt: System prompt for the model
Returns:
Model's response as string
"""
if self.llm is None or not hasattr(self, 'processor') or self.processor is None:
print("⚠️ Model not loaded. Call load_model() first.")
return None
try:
from qwen_vl_utils import process_vision_info
# Prepare messages with video
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": [
{"type": "video", "video": str(video_path), "nframes": self.nframes},
{"type": "text", "text": question}
]}
]
# Apply chat template to get text prompt
text_prompt = self.processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Extract video data using process_vision_info
image_inputs, video_inputs, video_kwargs = process_vision_info(
messages,
image_patch_size=16,
return_video_kwargs=True,
return_video_metadata=True
)
# Prepare input for vLLM generate
model_input = {
"prompt": text_prompt,
"multi_modal_data": {"video": video_inputs},
"mm_processor_kwargs": video_kwargs
}
# Run inference using generate (not chat)
outputs = self.llm.generate([model_input], self.sampling_params)
response = outputs[0].outputs[0].text
return response
except ImportError as ie:
print(f"⚠️ Import error: {ie}")
print(" Install with: pip install qwen-vl-utils")
return None
except Exception as e:
print(f"❌ Error during inference: {e}")
import traceback
traceback.print_exc()
return None
4.2. Run Zero-Shot Inference¶
Before fine-tuning, run a quick zero-shot evaluation with the base model to establish a baseline for comparison. For more ways to perform inference, refer to the official GitHub repository.
# Zero-shot inference with base model
print("="*70)
print("🔍 ZERO-SHOT INFERENCE (Base Model)")
print("="*70)
inference_base = CosmosReason2Inference(
model_path=BASE_MODEL_PATH, # Base model path
nframes=8,
max_tokens=512
)
inference_base.load_model()
# Test video
zero_shot_video = EXAMPLE_VIDEO_PATH
print(f"\nVideo: {zero_shot_video}\n")
# Sample question
question = "What is the pedestrian doing in this video?"
print("📝 Question:")
print(question)
print("-"*70)
response = inference_base.query(zero_shot_video, question)
print(f"\n✅ ANSWER: {response}")
print("="*70)
# Clean up GPU memory before training
try:
del inference_base
import torch, gc
torch.cuda.empty_cache()
gc.collect()
except Exception:
pass
5. Training¶
Configure and run supervised fine-tuning for the WTS dataset.
5.1. Training Configuration¶
The training configuration is specified in a TOML file. Key hyperparameters are optimized for training on 8x A100 GPUs. Adjust the parameters according to your hardware.
Key Configuration Highlights¶
| Parameter | Value |
|---|---|
| Learning Rate | 2e-5 with cosine decay |
| Batch Size | 32 per replica |
| Model | nvidia/Cosmos-Reason2-2B (or 8B) |
| Max Length | 32,768 tokens |
| Vision | 8 frames uniformly sampled (nframes=8) |
# Use the official training config from the cosmos-cookbook repo
CONFIG_FILE = "scripts/examples/reason2/intelligent-transportation/sft_config.toml"
CONFIG_PATH = f"{COSMOS_COOKBOOK_REPO}/{CONFIG_FILE}"
# Display the raw config file
print("📄 Official Training Config from cosmos-cookbook")
print(f" Source: github.com/nvidia-cosmos/cosmos-cookbook/{CONFIG_FILE}\n")
print("="*70)
!cat {CONFIG_PATH}
print("="*70)
# Parse and show key parameters
try:
import tomllib
except ImportError:
try:
import tomli as tomllib
except:
import pip._vendor.tomli as tomllib
with open(CONFIG_PATH, "rb") as f:
config = tomllib.load(f)
print("\n🔑 Key Training Parameters:\n")
print(f" Model: {config['policy']['model_name_or_path']}")
print(f" Learning Rate: {config['train']['optm_lr']}")
print(f" Batch Size: {config['train']['train_batch_per_replica']} per GPU")
print(f" Max Seq Length: {config['policy']['model_max_length']}")
5.2. Update Config Paths¶
Patch the sft_config.toml file with your local dataset and output paths. This keeps the training script aligned with your environment.
Also update dp_shard_size under [policy.parallelism] to match your GPU count (for this setup, use dp_shard_size = 4), and tune train_batch_per_replica (batch size), model_max_length (context length) based on available GPU memory. If you encounter OOM errors, reduce these values.
# Update sft_config.toml with actual paths
import os
import subprocess
from pathlib import Path
CONFIG_PATH = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation/sft_config.toml"
print("📝 Updating sft_config.toml with actual paths...\n")
# Use sed to update the config file directly (in-place)
subprocess.run(["sed", "-i", f's|annotation_path = .*|annotation_path = "{TRAIN_ANNOTATIONS_PATH}"|', CONFIG_PATH])
subprocess.run(["sed", "-i", f's|media_path = .*|media_path = "{TRAIN_VIDEOS_PATH}"|', CONFIG_PATH])
subprocess.run(["sed", "-i", f's|output_dir = .*|output_dir = "{FINETUNED_MODEL_PATH}"|', CONFIG_PATH])
print(f" annotation_path: {TRAIN_ANNOTATIONS_PATH}")
print(f" media_path: {TRAIN_VIDEOS_PATH}")
print(f" output_dir: {FINETUNED_MODEL_PATH}")
# Verify paths exist
print("\n🔍 Verifying paths:")
ann_path = TRAIN_ANNOTATIONS_PATH
media_path = TRAIN_VIDEOS_PATH
if os.path.exists(ann_path):
print(f" ✅ annotations.json exists")
else:
print(f" ❌ annotations.json NOT found at {ann_path}")
if os.path.exists(media_path):
print(f" ✅ videos directory exists")
video_files = list(Path(media_path).rglob("*.mp4"))
print(f" Found {len(video_files)} video files")
else:
print(f" ❌ videos directory NOT found at {media_path}")
# Show updated config section
print("\n📄 Updated [custom.dataset] section:")
print("="*50)
!grep -A3 "\[custom.dataset\]" {CONFIG_PATH}
print("="*50)
print("\n📄 Updated [policy.parallelism] section:")
print("="*50)
!grep -A6 "\[policy.parallelism\]" {CONFIG_PATH}
print("="*50)
5.3. Vision Token Analysis¶
Understanding how vision tokens are calculated is crucial for optimizing training. Qwen3-VL (the backbone of Cosmos Reason 2) compresses input videos in both space and time:
Compression Factors¶
- Spatial Compression: Effective patch size = 32 (16 patch × 2 spatial merge)
- Temporal Compression: Effective temporal step = 2 (2 frames merge into 1)
Ablation Configurations¶
- nframes=8 (~3k tokens) — Fewer frames, higher resolution per frame
- fps=1, 8M pixels (~8k tokens) — More frames, lower resolution per frame
# OPTIONAL ablation config: fps=1, total_pixels=8M
# Uncomment and run the following cell to enable the ablation config
# import re
# from pathlib import Path
# config_path = Path(CONFIG_PATH)
# config_text = config_path.read_text(encoding="utf-8")
# def upsert(text, key, value, anchor_pattern):
# key_pattern = rf"(?m)^{key}\s*=.*$"
# if re.search(key_pattern, text):
# return re.sub(key_pattern, f"{key} = {value}", text, count=1)
# return re.sub(anchor_pattern, lambda m: f"{m.group(0)}\n{key} = {value}", text, count=1)
# config_text = upsert(config_text, "fps", 1, r"(?m)^\[custom\.vision\]\s*$")
# config_text = upsert(config_text, "total_pixels", 8388608, r"(?m)^fps\s*=.*$")
# config_path.write_text(config_text, encoding="utf-8")
# !grep -A4 "\[custom.vision\]" {CONFIG_PATH}
# print("\nYou can now train with this config for the fps=1 / 8M-pixels ablation study.")
5.4. Run Training¶
Now we launch the SFT training using the Cosmos-RL framework. The training uses:
- Supervised Fine-Tuning (SFT) on MCQ data
Please refer to Cosmos-RL docs for system requirements.
Training time: ~1 hour 16 minutes for 3k vision tokens configuration on 8xA100s.
Troubleshooting
- CUDA out of memory: reduce batch size, lower
nframes, or decreasemodel_max_length; restart the kernel to clear GPU memory.
# Run Training with Cosmos-RL (using cosmos-rl's own venv)
import os
import sys
COSMOS_RL_VENV = f"{COSMOS_RL_PATH}/.venv"
TRAINING_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"
print("🚀 Running Training with Cosmos-RL")
print("="*70)
print(f" Working Dir: {TRAINING_DIR}")
print(f" Config: sft_config.toml")
print(f" Script: custom_sft.py")
print("="*70)
# Check Redis package installed
try:
import redis
except ImportError as exc:
raise ImportError(
"Redis is required for training. Please install the system service\n") from exc
print("\n⏱️ Expected training time (8× A100):")
print(" - 3k tokens (nframes=8): ~1h 16m for 3 epochs")
# Setup cosmos-rl venv if needed
if not os.path.exists(f"{COSMOS_RL_VENV}/bin/cosmos-rl"):
print("\n📦 Setting up cosmos-rl venv with uv sync...")
!cd {COSMOS_RL_PATH} && pip install -q uv && uv sync
# Run training - MUST activate venv so subprocesses get the right python
print("\n🔄 Starting training...\n")
!source {COSMOS_RL_VENV}/bin/activate && cd {TRAINING_DIR} && cosmos-rl --config sft_config.toml custom_sft.py
6. Evaluation¶
Measure performance on the validation set using the official evaluation script.
6.1. Run Evaluation¶
After training, we evaluate the model on the validation set of the WTS Environment VQA dataset:
- 171 videos with 2.6k MCQ questions (unseen during training)
- Evaluation uses vLLM inference engine for efficient batch processing
- Metrics: Accuracy on multiple-choice questions. After the evaluation is finished, you can find the accuracy in the results.json under the results folder.
Before running evaluation, you need to set FINETUNED_CHECKPOINT_PATH to the actual checkpoint folder (for example: {FINETUNED_MODEL_PATH}/<timestamp>/safetensors/step_<n>).
# Set this to the fine-tuned checkpoint folder you want to evaluate/use
# Example: {FINETUNED_MODEL_PATH}/20260210003314/safetensors/step_1
FINETUNED_CHECKPOINT_PATH = f"{FINETUNED_MODEL_PATH}/<timestamp>/safetensors/step_<n>"
if FINETUNED_CHECKPOINT_PATH.startswith("/path/to"):
raise ValueError("Please set FINETUNED_CHECKPOINT_PATH to your actual checkpoint folder before continuing.")
print(f"Using checkpoint: {FINETUNED_CHECKPOINT_PATH}")
# Run Evaluation using cosmos-cookbook script
import sys
import os
import subprocess
import logging
EVAL_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"
EVAL_CONFIG = f"{EVAL_DIR}/eval_config.yaml"
# Suppress INFO logs (notebook + child processes)
EVAL_LOG_LEVEL = "WARNING" # evaluate.py/logger level
logging.getLogger().setLevel(logging.WARNING)
os.environ["LOGLEVEL"] = EVAL_LOG_LEVEL
# Update paths in eval_config.yaml (preserve rest of config)
print("\n📝 Updating paths in eval_config.yaml...")
subprocess.run(["sed", "-i", f's|annotation_path:.*|annotation_path: {VAL_ANNOTATIONS_PATH}|', EVAL_CONFIG])
subprocess.run(["sed", "-i", f's|media_dir:.*|media_dir: {VAL_VIDEOS_PATH}|', EVAL_CONFIG])
subprocess.run(["sed", "-i", f's|model_name:.*|model_name: {FINETUNED_CHECKPOINT_PATH}|', EVAL_CONFIG])
print(f" checkpoint_path: {FINETUNED_CHECKPOINT_PATH}")
print(f" annotation_path: {VAL_ANNOTATIONS_PATH}")
print(f" media_dir: {VAL_VIDEOS_PATH}")
print(f" model_name: {FINETUNED_CHECKPOINT_PATH}")
# Show updated config
print("\n📄 Evaluation Config:")
print("="*70)
!cat {EVAL_CONFIG}
print("="*70)
# Remove any existing eval run results
!rm -rf "{EVAL_DIR}/results/post_trained_cr2"
# Run evaluation
print("\n🔄 Starting evaluation...\n")
!cd {EVAL_DIR} && LOGLEVEL={EVAL_LOG_LEVEL} {sys.executable} evaluate.py --config eval_config.yaml
# Load and display results from evaluate.py
import json
import os
import glob
EVAL_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"
RESULTS_BASE = os.path.join(EVAL_DIR, "results")
# Find all results.json files from evaluate.py output
result_files = glob.glob(os.path.join(RESULTS_BASE, "**/results.json"), recursive=True)
if result_files:
# Use the most recent results file
result_file = max(result_files, key=os.path.getmtime)
with open(result_file, 'r') as f:
metrics = json.load(f)
print("="*60)
print("📊 EVALUATION RESULTS")
print("="*60)
print(f"\n Accuracy: {metrics['accuracy']*100:.2f}%")
print(f" Correct: {metrics['total_correct']} / {metrics['total_questions']}")
print(f"\n Results: {result_file}")
print("\n" + "="*60)
else:
print(f"ℹ️ No results found in {RESULTS_BASE}")
print(" Run the evaluation cell above first.")
7. Inference on Fine-Tuned Checkpoints¶
Run inference on custom traffic videos using the fine-tuned Cosmos Reason 2 model. The model can answer both multiple-choice and open-ended questions about traffic scenes.
# Fine-tuned inference demo
inference = CosmosReason2Inference(model_path=FINETUNED_CHECKPOINT_PATH, nframes=8)
inference.load_model()
# Sample questions for traffic scene understanding
ANSWER_STYLE_SUFFIX = "Answer with a sentence."
SAMPLE_QUESTIONS = [
"What type of road is shown in this video?",
"How many vehicles can you see in the scene?",
"Is there any pedestrian in the video? If yes, what are they doing?",
"What potential traffic hazards do you observe?",
"Describe the overall traffic flow and density.",
]
SAMPLE_QUESTIONS = [f"{q} {ANSWER_STYLE_SUFFIX}" for q in SAMPLE_QUESTIONS]
test_video = EXAMPLE_VIDEO_PATH
print("\n" + "="*70)
print("🎬 TESTING INFERENCE ON TRAFFIC VIDEO (FINE-TUNED)")
print("="*70)
print(f"Video: {test_video}")
print(f"Total Questions: {len(SAMPLE_QUESTIONS)}")
print("="*70 + "\n")
for i, question in enumerate(SAMPLE_QUESTIONS, 1):
print(f"📝 Question {i}/{len(SAMPLE_QUESTIONS)}")
print("-"*70)
print(question)
print("-"*70)
response = inference.query(test_video, question)
print(f"✅ ANSWER: {response}")
print("="*70 + "\n")
print("✅ All questions processed successfully!")
7.1. Clean Up GPU Memory¶
Before proceeding with quantization and deployment, terminate any remaining vLLM processes to free GPU memory.
import os
import signal
import subprocess
out = subprocess.check_output(
["nvidia-smi", "--query-compute-apps=pid,process_name", "--format=csv,noheader"]
).decode().strip()
terminated = 0
for line in out.splitlines():
if not line.strip():
continue
pid, name = [x.strip() for x in line.split(",")]
if "VLLM" in name or "EngineCore" in name:
os.kill(int(pid), signal.SIGTERM)
terminated += 1
if terminated:
print(f"✅ Terminated {terminated} vLLM GPU worker process(es).")
else:
print("✅ No vLLM GPU worker processes found.")
Once inference results look satisfactory, the next step is to prepare the model for production deployment.
8. Deployment¶
Prepare the fine-tuned model for production use and serve it with NIM.
8.1. FP8 Quantization¶
Quantize the fine-tuned model to FP8 precision to reduce memory footprint and improve inference throughput. The quantization script is included in the Cosmos Reason 2 repository.
# FP8 Quantization Configuration
FP8_MODEL_OUTPUT_PATH = f"{FINETUNED_MODEL_PATH}_fp8"
QUANTIZATION_CONFIG = {
"model_path": FINETUNED_MODEL_PATH,
"output_path": FP8_MODEL_OUTPUT_PATH,
"precision": "fp8",
}
quantize_script = f"{COSMOS_REASON2_REPO}/scripts/quantize.py"
print("Quantization Setup")
print("="*70)
print(f" Input Model: {QUANTIZATION_CONFIG['model_path']}")
print(f" Output Path: {QUANTIZATION_CONFIG['output_path']}")
print(f" Precision: {QUANTIZATION_CONFIG['precision']}")
print("="*70)
QUANTIZE_CMD = f"""\
python {quantize_script} \\
--model "{QUANTIZATION_CONFIG['model_path']}" \\
-o "{QUANTIZATION_CONFIG['output_path']}" \\
--precision {QUANTIZATION_CONFIG['precision']}
"""
print("\nQuantization Command:")
print("-"*70)
print(QUANTIZE_CMD)
print("-"*70)
Run the FP8 quantization command below. This requires the Cosmos-RL virtual environment to be set up (see Section 1). The quantized model will be saved to the output path shown above.
import os
COSMOS_RL_VENV = f"{COSMOS_RL_PATH}/.venv"
# Run FP8 Quantization (Shell Command)
!source {COSMOS_RL_VENV}/bin/activate && pip install uv && {COSMOS_REASON2_REPO}/scripts/quantize.py \
--model "{QUANTIZATION_CONFIG['model_path']}" \
-o "{QUANTIZATION_CONFIG['output_path']}" \
--precision fp8
8.2. Deploy with NIM¶
You need an NGC API key to pull the Cosmos Reason 2 NIM image. This cell prompts for your key and performs a Docker login.
# NGC Login
import subprocess
import getpass
from IPython.display import display, HTML
import time
display(HTML('<a href="https://org.ngc.nvidia.com/setup/api-key" target="_blank" style="font-size:16px;">🔑 Get NGC API Key</a>'))
time.sleep(2)
ngc_api_key = getpass.getpass("NGC API Key: ").strip()
if ngc_api_key:
result = subprocess.run(
["docker", "login", "nvcr.io", "-u", "$oauthtoken", "--password-stdin"],
input=ngc_api_key, text=True, capture_output=True
)
print("✅ Login successful" if result.returncode == 0 else f"❌ Failed: {result.stderr}")
else:
print("❌ No key provided")
NIM Deployment Configuration¶
Define the model path, NIM image, and runtime parameters before launching the container.
Ensure GPU memory is free before deployment and adjust the max_model_len parameter below based on available GPU memory. Reduce it if you encounter CUDA out-of-memory errors.
All necessary tags and parameters are included in the Docker command below. For additional NIM configuration options, refer to the official NIM Configuration page.
By default, the nvidia-container-runtime only mounts a minimal set of libraries for security and efficiency: Default NVIDIA_DRIVER_CAPABILITIES: compute,utility This includes:
- compute: CUDA libraries (libcuda.so, libnvcuvid.so, etc.)
- utility: Management tools (nvidia-smi, nvidia-debugdump)
NOT included by default:
- video: NVENC/NVDEC video codec libraries (libnvidia-encode.so, libnvidia-decode.so)
- graphics: OpenGL libraries
- display: X11 libraries
Please refer to NVIDIA Container Toolkit guide for more help.
# NVIDIA NIM Deployment Configuration
NIM_CONFIG = {
"model_path": f"{QUANTIZATION_CONFIG['output_path']}/model_fp8", # FP8 quantized model
"nim_image": "nvcr.io/nim/nvidia/cosmos-reason2-8b:latest",
"model_name": "cosmos-reason2-wts",
"port": 8000,
"shm_size": "32GB",
"max_model_len": 131072, # 128k tokens context (supports up to 256k); reduce for lower memory usage
"allowed_local_media_path": "/path/to/media" # UPDATE THIS PATH TO YOUR LOCAL VIDEO PATH
}
print("🚀 NVIDIA NIM Deployment")
print("="*70)
print(f"Model: {NIM_CONFIG['model_path']}")
print(f"Max Context Length: {NIM_CONFIG['max_model_len']:,} tokens")
print(f"Port: {NIM_CONFIG['port']}")
print("="*70)
# Shell Command for NIM Deployment
NIM_DEPLOY_CMD = f"""
# Set environment variables
export CUSTOM_WEIGHTS="{NIM_CONFIG['model_path']}"
export NIM_IMAGE="{NIM_CONFIG['nim_image']}"
# Launch NIM container
docker run -d --name=cosmos-reason2-wts \\
--gpus all \\
--shm-size={NIM_CONFIG['shm_size']} \\
-e NIM_MODEL_NAME=$CUSTOM_WEIGHTS \\
-e NIM_SERVED_MODEL_NAME="{NIM_CONFIG['model_name']}" \\
-e NIM_MAX_MODEL_LEN={NIM_CONFIG['max_model_len']} \\
-e NIM_ALLOWED_LOCAL_MEDIA_PATH="{NIM_CONFIG['allowed_local_media_path']}" \\
-e NVIDIA_VISIBLE_DEVICES=all \\
-v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \\
-v {NIM_CONFIG['allowed_local_media_path']}:{NIM_CONFIG['allowed_local_media_path']}:ro \\
-u $(id -u) \\
-p {NIM_CONFIG['port']}:8000 \\
$NIM_IMAGE \\
/opt/nim/start_server.sh --allowed-local-media-path {NIM_CONFIG['allowed_local_media_path']}
# Wait for startup (takes ~2-3 minutes)
# Check the deployment status using
docker logs -f cosmos-reason2-wts
# Health check
curl http://localhost:{NIM_CONFIG['port']}/v1/health/ready | jq .
"""
print("\n💡 Steps:")
print(" 1. Run the commands below")
print(" 2. Monitor with: docker logs -f cosmos-reason2-wts")
print(" 3. Stop with: docker stop cosmos-reason2-wts \n")
print("-"*70)
print("📝 NIM Deployment Commands:")
print("-"*70)
print(NIM_DEPLOY_CMD)
print("-"*70)
8.3. Test NIM API¶
Send a sample request to the local NIM endpoint to confirm the deployment is responding correctly. The example below uses a remote video URL for convenience; replace it with your own video path as needed.
import os
import requests
from typing import Tuple, Dict, Any
# NIM Client Settings
NIM_ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL_NAME = "cosmos-reason2-wts"
def to_file_url(path: str) -> str:
if not os.path.isabs(path):
raise ValueError("Local video must be an absolute path")
return f"file://{path}"
def nim_video_chat(
prompt: str,
video: str,
fps: float = 1.0,
timeout: int = 120,
) -> Tuple[str, Dict[str, Any]]:
if video.startswith(("http://", "https://", "data:", "file://")):
video_url = video
else:
video_url = to_file_url(video)
payload = {
"model": MODEL_NAME,
"messages": [{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": video_url}},
{"type": "text", "text": prompt},
],
}],
"media_io_kwargs": {"video": {"fps": fps}},
"stream": False,
}
resp = requests.post(NIM_ENDPOINT, json=payload, timeout=timeout)
if not resp.ok:
print(f"[HTTP {resp.status_code}] {resp.text}")
resp.raise_for_status()
data = resp.json()
return data["choices"][0]["message"]["content"], data
# --- Example: Remote + Local Video ---
print("▶ Remote video")
remote_answer, _ = nim_video_chat(
prompt="What is in this video?",
video="https://download.samplelib.com/mp4/sample-5s.mp4",
fps=4.0,
)
print("Answer:", remote_answer)
print("\n" + "-" * 60 + "\n")
print("▶ Local video")
local_answer, _ = nim_video_chat(
prompt="What is in this video?",
video=EXAMPLE_VIDEO_PATH,
fps=4.0,
)
print("Answer:", local_answer)