Intelligent Transportation Post-Training with Cosmos Reason 2¶

This notebook demonstrates how to fine-tune NVIDIA Cosmos Reason 2 for intelligent transportation scene understanding.

Overview¶

Supervised Fine-Tuning (SFT) aligns pre-trained models to specific tasks by showing clear input-output pairs. In this notebook, we fine-tune Cosmos Reason 2 to understand traffic scenes — including road attributes, pedestrian situations, and vehicle behavior.

Table of Contents¶

Environment Setup
- 1.1. Install with uv (Recommended)
- 1.2. Switch to Cosmos Reason 2 Kernel
- 1.3. Alternative: Docker Container
- 1.4. Verify Installation
Configuration
- 2.1. Paths and Settings
- 2.2. Hugging Face Setup
Dataset Preparation
- 3.1. Video Helper Utilities
- 3.2. Labels and Annotations
Zero-Shot Inference
- 4.1. Inference Helper Class
- 4.2. Run Zero-Shot Inference
Training
- 5.1. Training Configuration
- 5.2. Update Config Paths
- 5.3. Vision Token Analysis
- 5.4. Run Training
Evaluation
- 6.1. Run Evaluation
Inference on Fine-Tuned Checkpoints
- 7.1. Clean Up GPU Memory
Deployment
- 8.1. FP8 Quantization
- 8.2. Deploy with NIM
- 8.3. Test NIM API

Prerequisites¶

Refer to the Cosmos Reason 2 GitHub repository for detailed setup instructions.

Reference: NVIDIA Cosmos Cookbook — Intelligent Transportation Post-Training

1. Environment Setup¶

Set up Cosmos Reason 2 repo and its dependencies using one of the options below.

1.1. Install with uv (Recommended)¶

Use uv to set up Cosmos Reason 2 and Cosmos-RL quickly. This step can take several minutes and requires sufficient disk space.

Before running the cell below, ensure the following system dependencies are installed:
sudo apt-get update && sudo apt-get install -y ffmpeg redis-server

In [ ]:

Copied!





import os
import sys

# Install uv and update PATH
!curl -LsSf https://astral.sh/uv/install.sh | sh
os.environ["PATH"] = f"{os.path.expanduser('~/.local/bin')}:{os.environ['PATH']}"

# Clone repositories
for repo in ["cosmos-reason2", "cosmos-cookbook"]:
    if not os.path.exists(repo):
        !git clone https://github.com/nvidia-cosmos/{repo}.git

# Sync environments
print("\nInstalling prerequisites...")
!uv add --project cosmos-reason2 opencv-python decord ipykernel
!uv sync --project cosmos-reason2 --extra cu128
!uv sync --project cosmos-reason2/examples/cosmos_rl
!uv run --project cosmos-reason2 python -m ipykernel install --user --name cosmos-reason2-venv --display-name "Cosmos-Reason2"

print("\n✅ Setup complete!")
import os
import sys

# Install uv and update PATH
!curl -LsSf https://astral.sh/uv/install.sh | sh
os.environ["PATH"] = f"{os.path.expanduser('~/.local/bin')}:{os.environ['PATH']}"

# Clone repositories
for repo in ["cosmos-reason2", "cosmos-cookbook"]:
    if not os.path.exists(repo):
        !git clone https://github.com/nvidia-cosmos/{repo}.git

# Sync environments
print("\nInstalling prerequisites...")
!uv add --project cosmos-reason2 opencv-python decord ipykernel
!uv sync --project cosmos-reason2 --extra cu128
!uv sync --project cosmos-reason2/examples/cosmos_rl
!uv run --project cosmos-reason2 python -m ipykernel install --user --name cosmos-reason2-venv --display-name "Cosmos-Reason2"

print("\n✅ Setup complete!")

1.2. Switch to Cosmos Reason 2 Kernel¶

After installing dependencies, switch the notebook to the cosmos-reason2 venv kernel to perform inference in the notebook.

Switch the notebook kernel:
- In JupyterLab: Kernel → Change Kernel…
- In Classic Notebook: Kernel → Change kernel
- Select Cosmos-Reason2
Verify in a cell:

In [ ]:

Copied!

import sys
print(sys.executable)
import sys
print(sys.executable)

1.3. Alternative: Docker Container (Optional)¶

If you prefer running in a containerized environment, you can build and run the Cosmos Reason 2 Docker container. This requires Docker and the NVIDIA Container Toolkit.

Build the container: The build command tags the image for reuse.

CUDA Variants:

CUDA 12.8: --build-arg=CUDA_VERSION=12.8.1 (default, requires NVIDIA Driver)
CUDA 13.0: --build-arg=CUDA_VERSION=13.0.0 (required for DGX Spark and Jetson AGX)

In [ ]:

Copied!





# Docker Container Build (Optional)
# Uncomment and run if using Docker instead of uv/pip.
# Replace /path/to/cosmos-reason2 with your local clone path.

# Build the container (run from the cosmos-reason2 repo directory):
# !cd /path/to/cosmos-reason2 && docker build -f Dockerfile --build-arg=CUDA_VERSION=12.8.1 -t cosmos-reason2:cu128 .

# For CUDA 13.0 (DGX Spark / Jetson AGX):
# !cd /path/to/cosmos-reason2 && docker build -f Dockerfile --build-arg=CUDA_VERSION=13.0.0 -t cosmos-reason2:cu130 .

print("Docker build commands (uncomment to run):")
print("  cd /path/to/cosmos-reason2")
print("  docker build -f Dockerfile --build-arg=CUDA_VERSION=12.8.1 -t cosmos-reason2:cu128 .")
# Docker Container Build (Optional)
# Uncomment and run if using Docker instead of uv/pip.
# Replace /path/to/cosmos-reason2 with your local clone path.

# Build the container (run from the cosmos-reason2 repo directory):
# !cd /path/to/cosmos-reason2 && docker build -f Dockerfile --build-arg=CUDA_VERSION=12.8.1 -t cosmos-reason2:cu128 .

# For CUDA 13.0 (DGX Spark / Jetson AGX):
# !cd /path/to/cosmos-reason2 && docker build -f Dockerfile --build-arg=CUDA_VERSION=13.0.0 -t cosmos-reason2:cu130 .

print("Docker build commands (uncomment to run):")
print("  cd /path/to/cosmos-reason2")
print("  docker build -f Dockerfile --build-arg=CUDA_VERSION=12.8.1 -t cosmos-reason2:cu128 .")

Run the container:

The container mounts the current directory to /workspace and preserves venv and cache directories.

In [ ]:

Copied!





# Docker Container Run (Optional)
# Uncomment and customize before running

# docker run -it --gpus all --ipc=host --rm \
#     -v .:/workspace \
#     -v /workspace/.venv \
#     -v /workspace/examples/cosmos_rl/.venv \
#     -v /root/.cache:/root/.cache \
#     -e HF_TOKEN="$HF_TOKEN" \
#     cosmos-reason2:cu128

print("Docker run command (uncomment to run):")
print("""docker run -it --gpus all --ipc=host --rm \\
    -v .:/workspace \\
    -v /workspace/.venv \\
    -v /workspace/examples/cosmos_rl/.venv \\
    -v /root/.cache:/root/.cache \\
    -e HF_TOKEN="$HF_TOKEN" \\
    cosmos-reason2:cu128""")

print("\nOptional arguments:")
print("  --ipc=host         Use host shared memory (torchrun needs this)")
print("  -v /root/.cache    Mount host cache to avoid re-downloads")
print("  -e HF_TOKEN        Pass Hugging Face token to container")
# Docker Container Run (Optional)
# Uncomment and customize before running

# docker run -it --gpus all --ipc=host --rm \
#     -v .:/workspace \
#     -v /workspace/.venv \
#     -v /workspace/examples/cosmos_rl/.venv \
#     -v /root/.cache:/root/.cache \
#     -e HF_TOKEN="$HF_TOKEN" \
#     cosmos-reason2:cu128

print("Docker run command (uncomment to run):")
print("""docker run -it --gpus all --ipc=host --rm \\
    -v .:/workspace \\
    -v /workspace/.venv \\
    -v /workspace/examples/cosmos_rl/.venv \\
    -v /root/.cache:/root/.cache \\
    -e HF_TOKEN="$HF_TOKEN" \\
    cosmos-reason2:cu128""")

print("\nOptional arguments:")
print("  --ipc=host         Use host shared memory (torchrun needs this)")
print("  -v /root/.cache    Mount host cache to avoid re-downloads")
print("  -e HF_TOKEN        Pass Hugging Face token to container")

1.4. Verify Installation¶

Confirm that core dependencies and the cosmos-rl CLI are available before proceeding.

In [ ]:

Copied!





# Verify installations
import sys
import os

COSMOS_REASON2_REPO = os.path.join(os.getcwd(), "cosmos-reason2")  # default clone location
COSMOS_RL_PATH = f"{COSMOS_REASON2_REPO}/examples/cosmos_rl"
COSMOS_RL_BIN = f"{COSMOS_RL_PATH}/.venv/bin/cosmos-rl"

print("Verifying installations:\n")

print("Checking cosmos-rl venv:")
!ls -la {COSMOS_RL_PATH}/.venv/bin/ 2>/dev/null | grep -E "cosmos|python" || echo "  venv not found"

if os.path.exists(COSMOS_RL_BIN):
    print(f"\n✅ cosmos-rl found at {COSMOS_RL_BIN}")
    print("\ncosmos-rl --help:")
    !{COSMOS_RL_BIN} --help 2>&1 | head -15
else:
    print(f"\n❌ cosmos-rl not found at {COSMOS_RL_BIN}")
    print("\nTry running uv sync manually:")
    !cd {COSMOS_RL_PATH} && {sys.executable} -m uv sync 2>&1 | tail -30
# Verify installations
import sys
import os

COSMOS_REASON2_REPO = os.path.join(os.getcwd(), "cosmos-reason2")  # default clone location
COSMOS_RL_PATH = f"{COSMOS_REASON2_REPO}/examples/cosmos_rl"
COSMOS_RL_BIN = f"{COSMOS_RL_PATH}/.venv/bin/cosmos-rl"

print("Verifying installations:\n")

print("Checking cosmos-rl venv:")
!ls -la {COSMOS_RL_PATH}/.venv/bin/ 2>/dev/null | grep -E "cosmos|python" || echo "  venv not found"

if os.path.exists(COSMOS_RL_BIN):
    print(f"\n✅ cosmos-rl found at {COSMOS_RL_BIN}")
    print("\ncosmos-rl --help:")
    !{COSMOS_RL_BIN} --help 2>&1 | head -15
else:
    print(f"\n❌ cosmos-rl not found at {COSMOS_RL_BIN}")
    print("\nTry running uv sync manually:")
    !cd {COSMOS_RL_PATH} && {sys.executable} -m uv sync 2>&1 | tail -30

2. Configuration¶

Set the dataset, model, and repo paths once here. The rest of the notebook references these variables.

2.1. Paths and Settings¶

Update the paths below to match your local environment. All subsequent cells reference these variables.

In [ ]:

Copied!





# Setup and Imports
import os
import json
from pathlib import Path
from IPython.display import display, Video, HTML
import numpy as np

# ==============================================================================
# CONFIGURATION — Update these paths before running the notebook
# ==============================================================================

# --- Repository Paths ---
# Path to the cloned cosmos-reason2 repository
COSMOS_REASON2_REPO = "/path/to/cosmos-reason2"

# Path to the cloned cosmos-cookbook repository (contains training scripts)
COSMOS_COOKBOOK_REPO = "/path/to/cosmos-cookbook"

# --- Dataset Paths ---
# Training dataset directory (should contain videos/ and annotations.json)
TRAIN_DATA_PATH = "/path/to/wts_data_train"

# Validation dataset directory (should contain videos/ and annotations.json)
VAL_DATA_PATH = "/path/to/wts_data_val"

# --- Model Paths ---
# Local path to the base model. If not available locally, use the
# Hugging Face ID "nvidia/Cosmos-Reason2-8B" (requires authentication — see next cell).
BASE_MODEL_PATH = "/path/to/Cosmos-Reason2-8B"

# Output directory for fine-tuned model checkpoints
FINETUNED_MODEL_PATH = "/path/to/Cosmos-Reason2-8B_FT"

# Example video for quick testing
EXAMPLE_VIDEO_PATH = "/path/to/example_video.mp4"

# --- Derived Paths (computed from above, usually no need to edit) ---
TRAIN_VIDEOS_PATH = f"{TRAIN_DATA_PATH}/videos"
TRAIN_ANNOTATIONS_PATH = f"{TRAIN_DATA_PATH}/annotations.json"
VAL_VIDEOS_PATH = f"{VAL_DATA_PATH}/videos"
VAL_ANNOTATIONS_PATH = f"{VAL_DATA_PATH}/annotations.json"

# Cosmos-RL directory (inside the cosmos-reason2 repo)
COSMOS_RL_PATH = f"{COSMOS_REASON2_REPO}/examples/cosmos_rl"

print("Configuration:")
print(f"  Train Dataset Path:   {TRAIN_DATA_PATH}")
print(f"  Validation Dataset:   {VAL_DATA_PATH}")
print(f"  Base Model Path:       {BASE_MODEL_PATH}")
print(f"  Fine-Tuned Model Path: {FINETUNED_MODEL_PATH}")
print(f"  Example Video Path:    {EXAMPLE_VIDEO_PATH}")
print(f"  Cosmos Reason2 Repo:   {COSMOS_REASON2_REPO}")
print(f"  Cosmos Cookbook Repo:  {COSMOS_COOKBOOK_REPO}")
# Setup and Imports
import os
import json
from pathlib import Path
from IPython.display import display, Video, HTML
import numpy as np

# ==============================================================================
# CONFIGURATION — Update these paths before running the notebook
# ==============================================================================

# --- Repository Paths ---
# Path to the cloned cosmos-reason2 repository
COSMOS_REASON2_REPO = "/path/to/cosmos-reason2"

# Path to the cloned cosmos-cookbook repository (contains training scripts)
COSMOS_COOKBOOK_REPO = "/path/to/cosmos-cookbook"

# --- Dataset Paths ---
# Training dataset directory (should contain videos/ and annotations.json)
TRAIN_DATA_PATH = "/path/to/wts_data_train"

# Validation dataset directory (should contain videos/ and annotations.json)
VAL_DATA_PATH = "/path/to/wts_data_val"

# --- Model Paths ---
# Local path to the base model. If not available locally, use the
# Hugging Face ID "nvidia/Cosmos-Reason2-8B" (requires authentication — see next cell).
BASE_MODEL_PATH = "/path/to/Cosmos-Reason2-8B"

# Output directory for fine-tuned model checkpoints
FINETUNED_MODEL_PATH = "/path/to/Cosmos-Reason2-8B_FT"

# Example video for quick testing
EXAMPLE_VIDEO_PATH = "/path/to/example_video.mp4"

# --- Derived Paths (computed from above, usually no need to edit) ---
TRAIN_VIDEOS_PATH = f"{TRAIN_DATA_PATH}/videos"
TRAIN_ANNOTATIONS_PATH = f"{TRAIN_DATA_PATH}/annotations.json"
VAL_VIDEOS_PATH = f"{VAL_DATA_PATH}/videos"
VAL_ANNOTATIONS_PATH = f"{VAL_DATA_PATH}/annotations.json"

# Cosmos-RL directory (inside the cosmos-reason2 repo)
COSMOS_RL_PATH = f"{COSMOS_REASON2_REPO}/examples/cosmos_rl"

print("Configuration:")
print(f"  Train Dataset Path:   {TRAIN_DATA_PATH}")
print(f"  Validation Dataset:   {VAL_DATA_PATH}")
print(f"  Base Model Path:       {BASE_MODEL_PATH}")
print(f"  Fine-Tuned Model Path: {FINETUNED_MODEL_PATH}")
print(f"  Example Video Path:    {EXAMPLE_VIDEO_PATH}")
print(f"  Cosmos Reason2 Repo:   {COSMOS_REASON2_REPO}")
print(f"  Cosmos Cookbook Repo:  {COSMOS_COOKBOOK_REPO}")

2.2. Hugging Face Setup (Optional)¶

If you are downloading the Cosmos Reason 2 model from Hugging Face (e.g., nvidia/Cosmos-Reason2-8B), you need to authenticate. This cell prompts for your HF token and performs authentication.

In [ ]:

Copied!





import subprocess
import getpass
from IPython.display import display, HTML
import time

display(HTML('<a href="https://huggingface.co/settings/tokens" target="_blank" style="font-size:16px;">🔑 Get Hugging Face Token</a>'))
time.sleep(2)

hf_token = getpass.getpass("Hugging Face Token (leave blank to skip): ").strip()

if hf_token:
    result = subprocess.run(
        ["uvx", "hf", "auth", "login", "--token", hf_token],
        capture_output=True, text=True
    )
    print("✅ Hugging Face login successful" if result.returncode == 0 else f"❌ Failed: {result.stderr}")
else:
    print("⏭️ Skipped Hugging Face authentication")
import subprocess
import getpass
from IPython.display import display, HTML
import time

display(HTML('🔑 Get Hugging Face Token'))
time.sleep(2)

hf_token = getpass.getpass("Hugging Face Token (leave blank to skip): ").strip()

if hf_token:
    result = subprocess.run(
        ["uvx", "hf", "auth", "login", "--token", hf_token],
        capture_output=True, text=True
    )
    print("✅ Hugging Face login successful" if result.returncode == 0 else f"❌ Failed: {result.stderr}")
else:
    print("⏭️ Skipped Hugging Face authentication")

3. Dataset Preparation¶

Before post-training a vision-language model, it helps to inspect a few samples to understand clip length, camera viewpoints, and the kinds of questions and answers available. This quick check also confirms your dataset paths are correct and that annotations align with videos.

For this notebook, we use the Woven Traffic Safety (WTS) Dataset (Environment VQA subset) as the example. It includes:

255 traffic scenarios
1,200+ video segments
341 videos with ~5.6k MCQ question-answer pairs
Average video length is ~75 seconds.

Let's load and display a sample video from the dataset.

3.1. Video Helper Utilities¶

These helpers list videos, display metadata, and sample frames so you can quickly validate the dataset contents.

In [ ]:

Copied!





def list_videos(video_dir, num_samples=5):
    """List available videos in the dataset directory."""
    video_extensions = ['.mp4', '.avi', '.mov', '.mkv']
    videos = []
    
    video_path = Path(video_dir)
    if video_path.exists():
        for ext in video_extensions:
            videos.extend(list(video_path.rglob(f"*{ext}")))
    
    return videos[:num_samples] if videos else []

def display_video_with_info(video_path, width=640):
    """Display a video with metadata information."""
    import cv2
    
    cap = cv2.VideoCapture(str(video_path))
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    width_px = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height_px = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    duration = frame_count / fps if fps > 0 else 0
    cap.release()
    
    print(f"📹 Video: {video_path.name}")
    print(f"   Resolution: {width_px} x {height_px}")
    print(f"   FPS: {fps:.2f}")
    print(f"   Duration: {duration:.2f} seconds")
    print(f"   Total Frames: {frame_count}")
    
    return Video(str(video_path), embed=True, width=width)

# List and display sample videos
print("🔍 Searching for videos in WTS dataset...\n")

train_videos_path = os.path.join(TRAIN_VIDEOS_PATH)
sample_videos = list_videos(train_videos_path)

if sample_videos:
    print(f"Found {len(sample_videos)} sample videos:\n")
    for i, v in enumerate(sample_videos):
        print(f"  {i+1}. {v.name}")
    
    # Display the first video
    print("\n" + "="*60)
    print("Displaying first video:")
    print("="*60 + "\n")
    
    display(display_video_with_info(sample_videos[0]))
else:
    print("⚠️ No videos found. Please update TRAIN_VIDEOS_PATH to your dataset location.")
    print(f"   Current path: {train_videos_path}")
def list_videos(video_dir, num_samples=5):
    """List available videos in the dataset directory."""
    video_extensions = ['.mp4', '.avi', '.mov', '.mkv']
    videos = []
    
    video_path = Path(video_dir)
    if video_path.exists():
        for ext in video_extensions:
            videos.extend(list(video_path.rglob(f"*{ext}")))
    
    return videos[:num_samples] if videos else []

def display_video_with_info(video_path, width=640):
    """Display a video with metadata information."""
    import cv2
    
    cap = cv2.VideoCapture(str(video_path))
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    width_px = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height_px = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    duration = frame_count / fps if fps > 0 else 0
    cap.release()
    
    print(f"📹 Video: {video_path.name}")
    print(f"   Resolution: {width_px} x {height_px}")
    print(f"   FPS: {fps:.2f}")
    print(f"   Duration: {duration:.2f} seconds")
    print(f"   Total Frames: {frame_count}")
    
    return Video(str(video_path), embed=True, width=width)

# List and display sample videos
print("🔍 Searching for videos in WTS dataset...\n")

train_videos_path = os.path.join(TRAIN_VIDEOS_PATH)
sample_videos = list_videos(train_videos_path)

if sample_videos:
    print(f"Found {len(sample_videos)} sample videos:\n")
    for i, v in enumerate(sample_videos):
        print(f"  {i+1}. {v.name}")
    
    # Display the first video
    print("\n" + "="*60)
    print("Displaying first video:")
    print("="*60 + "\n")
    
    display(display_video_with_info(sample_videos[0]))
else:
    print("⚠️ No videos found. Please update TRAIN_VIDEOS_PATH to your dataset location.")
    print(f"   Current path: {train_videos_path}")

3.2. Labels and Annotations¶

The WTS dataset provides rich annotations including:

Textual descriptions of pedestrian and vehicle behavior
Traffic VQA with multiple-choice questions (MCQ)

The WTS VQA annotations are pre-processed into Llava dataset format using cosmos-cookbook/scripts/examples/reason2/intelligent-transportation/data_preprocess.py. This JSON-based format is commonly used for visual SFT on VLMs, including Llava and Qwen-VL families. Each entry contains an id, a media reference (video or image), and a conversation with a human query and the expected VLM answer. Here is an example:

In [ ]:

Copied!





def display_llava_format(example):
    """Pretty print a Llava-format example from the dataset."""
    print("📋 Llava Dataset Format Example (from WTS):")
    print("="*60)
    print(json.dumps(example, indent=2))
    print("="*60)


def parse_mcq_text(text):
    """Parse MCQ question/options from the WTS Llava-format prompt."""
    cleaned = text.replace("<video>", " ").strip()
    lines = [line.strip() for line in cleaned.splitlines() if line.strip()]
    question = lines[0] if lines else ""
    options = lines[1:] if len(lines) > 1 else []
    return question, options


def is_correct_option(option, answer):
    """Mark the correct option based on the answer token (e.g., 'A')."""
    opt = option.strip()
    ans = answer.strip()
    if not ans:
        return False
    prefixes = [f"{ans}:", f"{ans})", f"{ans}.", f"{ans} "]
    return opt == ans or any(opt.startswith(prefix) for prefix in prefixes)


# Load actual MCQ examples from the training annotations
annotations_path = TRAIN_ANNOTATIONS_PATH

if not os.path.exists(annotations_path):
    print("⚠️ annotations.json not found. Update TRAIN_DATA_PATH to your dataset location.")
    print(f"   Current path: {annotations_path}")
else:
    with open(annotations_path, "r") as f:
        annotations = json.load(f)

    # Display a real Llava-format entry
    if annotations:
        display_llava_format(annotations[0])

    # Display a few actual MCQ questions
    print("\n\n📝 Sample MCQ Questions from the Training Set:")
    print("="*60)
    for i, ann in enumerate(annotations[:4], 1):
        question_text, options = parse_mcq_text(ann["conversations"][0]["value"])
        answer = ann["conversations"][1]["value"]
        print(f"\nQ{i}: {question_text}")
        for opt in options:
            marker = "✓" if is_correct_option(opt, answer) else " "
            print(f"   [{marker}] {opt}")
    print("\n" + "="*60)
def display_llava_format(example):
    """Pretty print a Llava-format example from the dataset."""
    print("📋 Llava Dataset Format Example (from WTS):")
    print("="*60)
    print(json.dumps(example, indent=2))
    print("="*60)


def parse_mcq_text(text):
    """Parse MCQ question/options from the WTS Llava-format prompt."""
    cleaned = text.replace("", " ").strip()
    lines = [line.strip() for line in cleaned.splitlines() if line.strip()]
    question = lines[0] if lines else ""
    options = lines[1:] if len(lines) > 1 else []
    return question, options


def is_correct_option(option, answer):
    """Mark the correct option based on the answer token (e.g., 'A')."""
    opt = option.strip()
    ans = answer.strip()
    if not ans:
        return False
    prefixes = [f"{ans}:", f"{ans})", f"{ans}.", f"{ans} "]
    return opt == ans or any(opt.startswith(prefix) for prefix in prefixes)


# Load actual MCQ examples from the training annotations
annotations_path = TRAIN_ANNOTATIONS_PATH

if not os.path.exists(annotations_path):
    print("⚠️ annotations.json not found. Update TRAIN_DATA_PATH to your dataset location.")
    print(f"   Current path: {annotations_path}")
else:
    with open(annotations_path, "r") as f:
        annotations = json.load(f)

    # Display a real Llava-format entry
    if annotations:
        display_llava_format(annotations[0])

    # Display a few actual MCQ questions
    print("\n\n📝 Sample MCQ Questions from the Training Set:")
    print("="*60)
    for i, ann in enumerate(annotations[:4], 1):
        question_text, options = parse_mcq_text(ann["conversations"][0]["value"])
        answer = ann["conversations"][1]["value"]
        print(f"\nQ{i}: {question_text}")
        for opt in options:
            marker = "✓" if is_correct_option(opt, answer) else " "
            print(f"   [{marker}] {opt}")
    print("\n" + "="*60)

4. Zero-Shot Inference¶

Run a quick baseline before training so we can compare against the fine-tuned model later.

4.1. Inference Helper Class¶

Define a reusable inference helper so we can run zero-shot evaluation before training and reuse the same logic after fine-tuning.

In [ ]:

Copied!





# Inference Class for Cosmos Reason 2
class CosmosReason2Inference:
    """
    Inference wrapper for fine-tuned Cosmos Reason 2 model.
    """
    
    def __init__(self, model_path, nframes=8, max_tokens=512):
        """
        Initialize the inference engine.
        
        Args:
            model_path: Path to the model checkpoint (base or fine-tuned)
            nframes: Number of frames to sample from videos
            max_tokens: Maximum tokens to generate
        """
        self.model_path = model_path
        self.nframes = nframes
        self.max_tokens = max_tokens
        self.llm = None
        self.processor = None
        self.sampling_params = None
        
    def load_model(self):
        """Load the model using vLLM."""
        try:
            from vllm import LLM, SamplingParams
            from transformers import AutoProcessor
            import torch
            import gc

            torch.cuda.empty_cache()
            gc.collect()
            
            print(f"🔄 Loading model from: {self.model_path}")
            
            self.llm = LLM(
                model=self.model_path,
                tensor_parallel_size=1,
                max_model_len=32768,
                trust_remote_code=True,
                limit_mm_per_prompt={"video": 1, "image": 0}
            )
            
            # Load processor for chat template
            self.processor = AutoProcessor.from_pretrained(
                self.model_path,
                trust_remote_code=True
            )
            
            self.sampling_params = SamplingParams(
                max_tokens=self.max_tokens,
                temperature=0.0
            )
            
            print("✅ Model loaded successfully!")
            return True
            
        except ImportError:
            print("⚠️ vLLM not installed. Install with: pip install vllm")
            return False
        except Exception as e:
            print(f"❌ Error loading model: {e}")
            return False
    
    def query(self, video_path, question, system_prompt="You are a helpful assistant."):
        """
        Query the model with a video and question.
        
        Args:
            video_path: Path to the video file
            question: Question to ask about the video
            system_prompt: System prompt for the model
        
        Returns:
            Model's response as string
        """
        if self.llm is None or not hasattr(self, 'processor') or self.processor is None:
            print("⚠️ Model not loaded. Call load_model() first.")
            return None
        
        try:
            from qwen_vl_utils import process_vision_info
            
            # Prepare messages with video
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": [
                    {"type": "video", "video": str(video_path), "nframes": self.nframes},
                    {"type": "text", "text": question}
                ]}
            ]
            
            # Apply chat template to get text prompt
            text_prompt = self.processor.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True
            )
            
            # Extract video data using process_vision_info
            image_inputs, video_inputs, video_kwargs = process_vision_info(
                messages,
                image_patch_size=16,
                return_video_kwargs=True,
                return_video_metadata=True
            )
            
            # Prepare input for vLLM generate
            model_input = {
                "prompt": text_prompt,
                "multi_modal_data": {"video": video_inputs},
                "mm_processor_kwargs": video_kwargs
            }
            
            # Run inference using generate (not chat)
            outputs = self.llm.generate([model_input], self.sampling_params)
            response = outputs[0].outputs[0].text
            
            return response
            
        except ImportError as ie:
            print(f"⚠️ Import error: {ie}")
            print("   Install with: pip install qwen-vl-utils")
            return None
        except Exception as e:
            print(f"❌ Error during inference: {e}")
            import traceback
            traceback.print_exc()
            return None
# Inference Class for Cosmos Reason 2
class CosmosReason2Inference:
    """
    Inference wrapper for fine-tuned Cosmos Reason 2 model.
    """
    
    def __init__(self, model_path, nframes=8, max_tokens=512):
        """
        Initialize the inference engine.
        
        Args:
            model_path: Path to the model checkpoint (base or fine-tuned)
            nframes: Number of frames to sample from videos
            max_tokens: Maximum tokens to generate
        """
        self.model_path = model_path
        self.nframes = nframes
        self.max_tokens = max_tokens
        self.llm = None
        self.processor = None
        self.sampling_params = None
        
    def load_model(self):
        """Load the model using vLLM."""
        try:
            from vllm import LLM, SamplingParams
            from transformers import AutoProcessor
            import torch
            import gc

            torch.cuda.empty_cache()
            gc.collect()
            
            print(f"🔄 Loading model from: {self.model_path}")
            
            self.llm = LLM(
                model=self.model_path,
                tensor_parallel_size=1,
                max_model_len=32768,
                trust_remote_code=True,
                limit_mm_per_prompt={"video": 1, "image": 0}
            )
            
            # Load processor for chat template
            self.processor = AutoProcessor.from_pretrained(
                self.model_path,
                trust_remote_code=True
            )
            
            self.sampling_params = SamplingParams(
                max_tokens=self.max_tokens,
                temperature=0.0
            )
            
            print("✅ Model loaded successfully!")
            return True
            
        except ImportError:
            print("⚠️ vLLM not installed. Install with: pip install vllm")
            return False
        except Exception as e:
            print(f"❌ Error loading model: {e}")
            return False
    
    def query(self, video_path, question, system_prompt="You are a helpful assistant."):
        """
        Query the model with a video and question.
        
        Args:
            video_path: Path to the video file
            question: Question to ask about the video
            system_prompt: System prompt for the model
        
        Returns:
            Model's response as string
        """
        if self.llm is None or not hasattr(self, 'processor') or self.processor is None:
            print("⚠️ Model not loaded. Call load_model() first.")
            return None
        
        try:
            from qwen_vl_utils import process_vision_info
            
            # Prepare messages with video
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": [
                    {"type": "video", "video": str(video_path), "nframes": self.nframes},
                    {"type": "text", "text": question}
                ]}
            ]
            
            # Apply chat template to get text prompt
            text_prompt = self.processor.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True
            )
            
            # Extract video data using process_vision_info
            image_inputs, video_inputs, video_kwargs = process_vision_info(
                messages,
                image_patch_size=16,
                return_video_kwargs=True,
                return_video_metadata=True
            )
            
            # Prepare input for vLLM generate
            model_input = {
                "prompt": text_prompt,
                "multi_modal_data": {"video": video_inputs},
                "mm_processor_kwargs": video_kwargs
            }
            
            # Run inference using generate (not chat)
            outputs = self.llm.generate([model_input], self.sampling_params)
            response = outputs[0].outputs[0].text
            
            return response
            
        except ImportError as ie:
            print(f"⚠️ Import error: {ie}")
            print("   Install with: pip install qwen-vl-utils")
            return None
        except Exception as e:
            print(f"❌ Error during inference: {e}")
            import traceback
            traceback.print_exc()
            return None

4.2. Run Zero-Shot Inference¶

Before fine-tuning, run a quick zero-shot evaluation with the base model to establish a baseline for comparison. For more ways to perform inference, refer to the official GitHub repository.

In [ ]:

Copied!





# Zero-shot inference with base model
print("="*70)
print("🔍 ZERO-SHOT INFERENCE (Base Model)")
print("="*70)

inference_base = CosmosReason2Inference(
    model_path=BASE_MODEL_PATH,  # Base model path
    nframes=8,
    max_tokens=512
)

inference_base.load_model()

# Test video
zero_shot_video = EXAMPLE_VIDEO_PATH
print(f"\nVideo: {zero_shot_video}\n")

# Sample question
question = "What is the pedestrian doing in this video?"
print("📝 Question:")
print(question)
print("-"*70)

response = inference_base.query(zero_shot_video, question)
print(f"\n✅ ANSWER: {response}")
print("="*70)

# Clean up GPU memory before training
try:
    del inference_base
    import torch, gc
    torch.cuda.empty_cache()
    gc.collect()
except Exception:
    pass
# Zero-shot inference with base model
print("="*70)
print("🔍 ZERO-SHOT INFERENCE (Base Model)")
print("="*70)

inference_base = CosmosReason2Inference(
    model_path=BASE_MODEL_PATH,  # Base model path
    nframes=8,
    max_tokens=512
)

inference_base.load_model()

# Test video
zero_shot_video = EXAMPLE_VIDEO_PATH
print(f"\nVideo: {zero_shot_video}\n")

# Sample question
question = "What is the pedestrian doing in this video?"
print("📝 Question:")
print(question)
print("-"*70)

response = inference_base.query(zero_shot_video, question)
print(f"\n✅ ANSWER: {response}")
print("="*70)

# Clean up GPU memory before training
try:
    del inference_base
    import torch, gc
    torch.cuda.empty_cache()
    gc.collect()
except Exception:
    pass

5. Training¶

Configure and run supervised fine-tuning for the WTS dataset.

5.1. Training Configuration¶

The training configuration is specified in a TOML file. Key hyperparameters are optimized for training on 8x A100 GPUs. Adjust the parameters according to your hardware.

Key Configuration Highlights¶

Parameter	Value
Learning Rate	2e-5 with cosine decay
Batch Size	32 per replica
Model	nvidia/Cosmos-Reason2-2B (or 8B)
Max Length	32,768 tokens
Vision	8 frames uniformly sampled (`nframes=8`)

In [ ]:

Copied!





# Use the official training config from the cosmos-cookbook repo
CONFIG_FILE = "scripts/examples/reason2/intelligent-transportation/sft_config.toml"

CONFIG_PATH = f"{COSMOS_COOKBOOK_REPO}/{CONFIG_FILE}"

# Display the raw config file
print("📄 Official Training Config from cosmos-cookbook")
print(f"   Source: github.com/nvidia-cosmos/cosmos-cookbook/{CONFIG_FILE}\n")
print("="*70)
!cat {CONFIG_PATH}
print("="*70)

# Parse and show key parameters
try:
    import tomllib
except ImportError:
    try:
        import tomli as tomllib
    except:
        import pip._vendor.tomli as tomllib

with open(CONFIG_PATH, "rb") as f:
    config = tomllib.load(f)

print("\n🔑 Key Training Parameters:\n")
print(f"  Model:           {config['policy']['model_name_or_path']}")
print(f"  Learning Rate:   {config['train']['optm_lr']}")
print(f"  Batch Size:      {config['train']['train_batch_per_replica']} per GPU")
print(f"  Max Seq Length:  {config['policy']['model_max_length']}")
# Use the official training config from the cosmos-cookbook repo
CONFIG_FILE = "scripts/examples/reason2/intelligent-transportation/sft_config.toml"

CONFIG_PATH = f"{COSMOS_COOKBOOK_REPO}/{CONFIG_FILE}"

# Display the raw config file
print("📄 Official Training Config from cosmos-cookbook")
print(f"   Source: github.com/nvidia-cosmos/cosmos-cookbook/{CONFIG_FILE}\n")
print("="*70)
!cat {CONFIG_PATH}
print("="*70)

# Parse and show key parameters
try:
    import tomllib
except ImportError:
    try:
        import tomli as tomllib
    except:
        import pip._vendor.tomli as tomllib

with open(CONFIG_PATH, "rb") as f:
    config = tomllib.load(f)

print("\n🔑 Key Training Parameters:\n")
print(f"  Model:           {config['policy']['model_name_or_path']}")
print(f"  Learning Rate:   {config['train']['optm_lr']}")
print(f"  Batch Size:      {config['train']['train_batch_per_replica']} per GPU")
print(f"  Max Seq Length:  {config['policy']['model_max_length']}")

5.2. Update Config Paths¶

Patch the sft_config.toml file with your local dataset and output paths. This keeps the training script aligned with your environment.

Also update dp_shard_size under [policy.parallelism] to match your GPU count (for this setup, use dp_shard_size = 4), and tune train_batch_per_replica (batch size), model_max_length (context length) based on available GPU memory. If you encounter OOM errors, reduce these values.

In [ ]:

Copied!





# Update sft_config.toml with actual paths
import os
import subprocess
from pathlib import Path

CONFIG_PATH = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation/sft_config.toml"
print("📝 Updating sft_config.toml with actual paths...\n")

# Use sed to update the config file directly (in-place)
subprocess.run(["sed", "-i", f's|annotation_path = .*|annotation_path = "{TRAIN_ANNOTATIONS_PATH}"|', CONFIG_PATH])
subprocess.run(["sed", "-i", f's|media_path = .*|media_path = "{TRAIN_VIDEOS_PATH}"|', CONFIG_PATH])
subprocess.run(["sed", "-i", f's|output_dir = .*|output_dir = "{FINETUNED_MODEL_PATH}"|', CONFIG_PATH])

print(f"  annotation_path: {TRAIN_ANNOTATIONS_PATH}")
print(f"  media_path:      {TRAIN_VIDEOS_PATH}")
print(f"  output_dir:      {FINETUNED_MODEL_PATH}")

# Verify paths exist
print("\n🔍 Verifying paths:")
ann_path = TRAIN_ANNOTATIONS_PATH
media_path = TRAIN_VIDEOS_PATH

if os.path.exists(ann_path):
    print(f"  ✅ annotations.json exists")
else:
    print(f"  ❌ annotations.json NOT found at {ann_path}")
    
if os.path.exists(media_path):
    print(f"  ✅ videos directory exists")
    video_files = list(Path(media_path).rglob("*.mp4"))
    print(f"     Found {len(video_files)} video files")
else:
    print(f"  ❌ videos directory NOT found at {media_path}")

# Show updated config section
print("\n📄 Updated [custom.dataset] section:")
print("="*50)
!grep -A3 "\[custom.dataset\]" {CONFIG_PATH}
print("="*50)


print("\n📄 Updated [policy.parallelism] section:")
print("="*50)
!grep -A6 "\[policy.parallelism\]" {CONFIG_PATH}
print("="*50)
# Update sft_config.toml with actual paths
import os
import subprocess
from pathlib import Path

CONFIG_PATH = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation/sft_config.toml"
print("📝 Updating sft_config.toml with actual paths...\n")

# Use sed to update the config file directly (in-place)
subprocess.run(["sed", "-i", f's|annotation_path = .*|annotation_path = "{TRAIN_ANNOTATIONS_PATH}"|', CONFIG_PATH])
subprocess.run(["sed", "-i", f's|media_path = .*|media_path = "{TRAIN_VIDEOS_PATH}"|', CONFIG_PATH])
subprocess.run(["sed", "-i", f's|output_dir = .*|output_dir = "{FINETUNED_MODEL_PATH}"|', CONFIG_PATH])

print(f"  annotation_path: {TRAIN_ANNOTATIONS_PATH}")
print(f"  media_path:      {TRAIN_VIDEOS_PATH}")
print(f"  output_dir:      {FINETUNED_MODEL_PATH}")

# Verify paths exist
print("\n🔍 Verifying paths:")
ann_path = TRAIN_ANNOTATIONS_PATH
media_path = TRAIN_VIDEOS_PATH

if os.path.exists(ann_path):
    print(f"  ✅ annotations.json exists")
else:
    print(f"  ❌ annotations.json NOT found at {ann_path}")
    
if os.path.exists(media_path):
    print(f"  ✅ videos directory exists")
    video_files = list(Path(media_path).rglob("*.mp4"))
    print(f"     Found {len(video_files)} video files")
else:
    print(f"  ❌ videos directory NOT found at {media_path}")

# Show updated config section
print("\n📄 Updated [custom.dataset] section:")
print("="*50)
!grep -A3 "\[custom.dataset\]" {CONFIG_PATH}
print("="*50)


print("\n📄 Updated [policy.parallelism] section:")
print("="*50)
!grep -A6 "\[policy.parallelism\]" {CONFIG_PATH}
print("="*50)

5.3. Vision Token Analysis¶

Understanding how vision tokens are calculated is crucial for optimizing training. Qwen3-VL (the backbone of Cosmos Reason 2) compresses input videos in both space and time:

Compression Factors¶

Spatial Compression: Effective patch size = 32 (16 patch × 2 spatial merge)
Temporal Compression: Effective temporal step = 2 (2 frames merge into 1)

Ablation Configurations¶

nframes=8 (~3k tokens) — Fewer frames, higher resolution per frame
fps=1, 8M pixels (~8k tokens) — More frames, lower resolution per frame

In [ ]:

Copied!





# OPTIONAL ablation config: fps=1, total_pixels=8M
# Uncomment and run the following cell to enable the ablation config

# import re
# from pathlib import Path

# config_path = Path(CONFIG_PATH)
# config_text = config_path.read_text(encoding="utf-8")

# def upsert(text, key, value, anchor_pattern):
#     key_pattern = rf"(?m)^{key}\s*=.*$"
#     if re.search(key_pattern, text):
#         return re.sub(key_pattern, f"{key} = {value}", text, count=1)
#     return re.sub(anchor_pattern, lambda m: f"{m.group(0)}\n{key} = {value}", text, count=1)


# config_text = upsert(config_text, "fps", 1, r"(?m)^\[custom\.vision\]\s*$")
# config_text = upsert(config_text, "total_pixels", 8388608, r"(?m)^fps\s*=.*$")
# config_path.write_text(config_text, encoding="utf-8")

# !grep -A4 "\[custom.vision\]" {CONFIG_PATH}
# print("\nYou can now train with this config for the fps=1 / 8M-pixels ablation study.")
# OPTIONAL ablation config: fps=1, total_pixels=8M
# Uncomment and run the following cell to enable the ablation config

# import re
# from pathlib import Path

# config_path = Path(CONFIG_PATH)
# config_text = config_path.read_text(encoding="utf-8")

# def upsert(text, key, value, anchor_pattern):
#     key_pattern = rf"(?m)^{key}\s*=.*$"
#     if re.search(key_pattern, text):
#         return re.sub(key_pattern, f"{key} = {value}", text, count=1)
#     return re.sub(anchor_pattern, lambda m: f"{m.group(0)}\n{key} = {value}", text, count=1)


# config_text = upsert(config_text, "fps", 1, r"(?m)^\[custom\.vision\]\s*$")
# config_text = upsert(config_text, "total_pixels", 8388608, r"(?m)^fps\s*=.*$")
# config_path.write_text(config_text, encoding="utf-8")

# !grep -A4 "\[custom.vision\]" {CONFIG_PATH}
# print("\nYou can now train with this config for the fps=1 / 8M-pixels ablation study.")

5.4. Run Training¶

Now we launch the SFT training using the Cosmos-RL framework. The training uses:

Supervised Fine-Tuning (SFT) on MCQ data

Please refer to Cosmos-RL docs for system requirements.
Training time: ~1 hour 16 minutes for 3k vision tokens configuration on 8xA100s.

Troubleshooting

CUDA out of memory: reduce batch size, lower nframes, or decrease model_max_length; restart the kernel to clear GPU memory.

In [ ]:

Copied!





# Run Training with Cosmos-RL (using cosmos-rl's own venv)
import os
import sys

COSMOS_RL_VENV = f"{COSMOS_RL_PATH}/.venv"
TRAINING_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"

print("🚀 Running Training with Cosmos-RL")
print("="*70)
print(f"  Working Dir: {TRAINING_DIR}")
print(f"  Config:      sft_config.toml")
print(f"  Script:      custom_sft.py")
print("="*70)

# Check Redis package installed
try:
    import redis
except ImportError as exc:
    raise ImportError(
    "Redis is required for training. Please install the system service\n") from exc

print("\n⏱️ Expected training time (8× A100):")
print("   - 3k tokens (nframes=8): ~1h 16m for 3 epochs")

# Setup cosmos-rl venv if needed
if not os.path.exists(f"{COSMOS_RL_VENV}/bin/cosmos-rl"):
    print("\n📦 Setting up cosmos-rl venv with uv sync...")
    !cd {COSMOS_RL_PATH} && pip install -q uv && uv sync

# Run training - MUST activate venv so subprocesses get the right python
print("\n🔄 Starting training...\n")
!source {COSMOS_RL_VENV}/bin/activate && cd {TRAINING_DIR} && cosmos-rl --config sft_config.toml custom_sft.py
# Run Training with Cosmos-RL (using cosmos-rl's own venv)
import os
import sys

COSMOS_RL_VENV = f"{COSMOS_RL_PATH}/.venv"
TRAINING_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"

print("🚀 Running Training with Cosmos-RL")
print("="*70)
print(f"  Working Dir: {TRAINING_DIR}")
print(f"  Config:      sft_config.toml")
print(f"  Script:      custom_sft.py")
print("="*70)

# Check Redis package installed
try:
    import redis
except ImportError as exc:
    raise ImportError(
    "Redis is required for training. Please install the system service\n") from exc

print("\n⏱️ Expected training time (8× A100):")
print("   - 3k tokens (nframes=8): ~1h 16m for 3 epochs")

# Setup cosmos-rl venv if needed
if not os.path.exists(f"{COSMOS_RL_VENV}/bin/cosmos-rl"):
    print("\n📦 Setting up cosmos-rl venv with uv sync...")
    !cd {COSMOS_RL_PATH} && pip install -q uv && uv sync

# Run training - MUST activate venv so subprocesses get the right python
print("\n🔄 Starting training...\n")
!source {COSMOS_RL_VENV}/bin/activate && cd {TRAINING_DIR} && cosmos-rl --config sft_config.toml custom_sft.py

6. Evaluation¶

Measure performance on the validation set using the official evaluation script.

6.1. Run Evaluation¶

After training, we evaluate the model on the validation set of the WTS Environment VQA dataset:

171 videos with 2.6k MCQ questions (unseen during training)
Evaluation uses vLLM inference engine for efficient batch processing
Metrics: Accuracy on multiple-choice questions. After the evaluation is finished, you can find the accuracy in the results.json under the results folder.

Before running evaluation, you need to set FINETUNED_CHECKPOINT_PATH to the actual checkpoint folder (for example: {FINETUNED_MODEL_PATH}/<timestamp>/safetensors/step_<n>).

In [ ]:

Copied!





# Set this to the fine-tuned checkpoint folder you want to evaluate/use
# Example: {FINETUNED_MODEL_PATH}/20260210003314/safetensors/step_1
FINETUNED_CHECKPOINT_PATH = f"{FINETUNED_MODEL_PATH}/<timestamp>/safetensors/step_<n>"

if FINETUNED_CHECKPOINT_PATH.startswith("/path/to"):
    raise ValueError("Please set FINETUNED_CHECKPOINT_PATH to your actual checkpoint folder before continuing.")

print(f"Using checkpoint: {FINETUNED_CHECKPOINT_PATH}")
# Set this to the fine-tuned checkpoint folder you want to evaluate/use
# Example: {FINETUNED_MODEL_PATH}/20260210003314/safetensors/step_1
FINETUNED_CHECKPOINT_PATH = f"{FINETUNED_MODEL_PATH}//safetensors/step_"

if FINETUNED_CHECKPOINT_PATH.startswith("/path/to"):
    raise ValueError("Please set FINETUNED_CHECKPOINT_PATH to your actual checkpoint folder before continuing.")

print(f"Using checkpoint: {FINETUNED_CHECKPOINT_PATH}")

In [ ]:

Copied!





# Run Evaluation using cosmos-cookbook script
import sys
import os
import subprocess
import logging

EVAL_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"
EVAL_CONFIG = f"{EVAL_DIR}/eval_config.yaml"


# Suppress INFO logs (notebook + child processes)
EVAL_LOG_LEVEL = "WARNING"   # evaluate.py/logger level
logging.getLogger().setLevel(logging.WARNING)
os.environ["LOGLEVEL"] = EVAL_LOG_LEVEL

# Update paths in eval_config.yaml (preserve rest of config)
print("\n📝 Updating paths in eval_config.yaml...")

subprocess.run(["sed", "-i", f's|annotation_path:.*|annotation_path: {VAL_ANNOTATIONS_PATH}|', EVAL_CONFIG])
subprocess.run(["sed", "-i", f's|media_dir:.*|media_dir: {VAL_VIDEOS_PATH}|', EVAL_CONFIG])
subprocess.run(["sed", "-i", f's|model_name:.*|model_name: {FINETUNED_CHECKPOINT_PATH}|', EVAL_CONFIG])

print(f"  checkpoint_path:  {FINETUNED_CHECKPOINT_PATH}")
print(f"  annotation_path:  {VAL_ANNOTATIONS_PATH}")
print(f"  media_dir:        {VAL_VIDEOS_PATH}")
print(f"  model_name:       {FINETUNED_CHECKPOINT_PATH}")

# Show updated config
print("\n📄 Evaluation Config:")
print("="*70)
!cat {EVAL_CONFIG}
print("="*70)

# Remove any existing eval run results
!rm -rf "{EVAL_DIR}/results/post_trained_cr2"

# Run evaluation
print("\n🔄 Starting evaluation...\n")
!cd {EVAL_DIR} && LOGLEVEL={EVAL_LOG_LEVEL} {sys.executable} evaluate.py --config eval_config.yaml
# Run Evaluation using cosmos-cookbook script
import sys
import os
import subprocess
import logging

EVAL_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"
EVAL_CONFIG = f"{EVAL_DIR}/eval_config.yaml"


# Suppress INFO logs (notebook + child processes)
EVAL_LOG_LEVEL = "WARNING"   # evaluate.py/logger level
logging.getLogger().setLevel(logging.WARNING)
os.environ["LOGLEVEL"] = EVAL_LOG_LEVEL

# Update paths in eval_config.yaml (preserve rest of config)
print("\n📝 Updating paths in eval_config.yaml...")

subprocess.run(["sed", "-i", f's|annotation_path:.*|annotation_path: {VAL_ANNOTATIONS_PATH}|', EVAL_CONFIG])
subprocess.run(["sed", "-i", f's|media_dir:.*|media_dir: {VAL_VIDEOS_PATH}|', EVAL_CONFIG])
subprocess.run(["sed", "-i", f's|model_name:.*|model_name: {FINETUNED_CHECKPOINT_PATH}|', EVAL_CONFIG])

print(f"  checkpoint_path:  {FINETUNED_CHECKPOINT_PATH}")
print(f"  annotation_path:  {VAL_ANNOTATIONS_PATH}")
print(f"  media_dir:        {VAL_VIDEOS_PATH}")
print(f"  model_name:       {FINETUNED_CHECKPOINT_PATH}")

# Show updated config
print("\n📄 Evaluation Config:")
print("="*70)
!cat {EVAL_CONFIG}
print("="*70)

# Remove any existing eval run results
!rm -rf "{EVAL_DIR}/results/post_trained_cr2"

# Run evaluation
print("\n🔄 Starting evaluation...\n")
!cd {EVAL_DIR} && LOGLEVEL={EVAL_LOG_LEVEL} {sys.executable} evaluate.py --config eval_config.yaml

In [ ]:

Copied!





# Load and display results from evaluate.py
import json
import os
import glob

EVAL_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"
RESULTS_BASE = os.path.join(EVAL_DIR, "results")

# Find all results.json files from evaluate.py output
result_files = glob.glob(os.path.join(RESULTS_BASE, "**/results.json"), recursive=True)

if result_files:
    # Use the most recent results file
    result_file = max(result_files, key=os.path.getmtime)
    
    with open(result_file, 'r') as f:
        metrics = json.load(f)
    
    print("="*60)
    print("📊 EVALUATION RESULTS")
    print("="*60)
    print(f"\n   Accuracy:  {metrics['accuracy']*100:.2f}%")
    print(f"   Correct:   {metrics['total_correct']} / {metrics['total_questions']}")
    print(f"\n   Results:   {result_file}")
    print("\n" + "="*60)
else:
    print(f"ℹ️ No results found in {RESULTS_BASE}")
    print("   Run the evaluation cell above first.")
# Load and display results from evaluate.py
import json
import os
import glob

EVAL_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"
RESULTS_BASE = os.path.join(EVAL_DIR, "results")

# Find all results.json files from evaluate.py output
result_files = glob.glob(os.path.join(RESULTS_BASE, "**/results.json"), recursive=True)

if result_files:
    # Use the most recent results file
    result_file = max(result_files, key=os.path.getmtime)
    
    with open(result_file, 'r') as f:
        metrics = json.load(f)
    
    print("="*60)
    print("📊 EVALUATION RESULTS")
    print("="*60)
    print(f"\n   Accuracy:  {metrics['accuracy']*100:.2f}%")
    print(f"   Correct:   {metrics['total_correct']} / {metrics['total_questions']}")
    print(f"\n   Results:   {result_file}")
    print("\n" + "="*60)
else:
    print(f"ℹ️ No results found in {RESULTS_BASE}")
    print("   Run the evaluation cell above first.")

7. Inference on Fine-Tuned Checkpoints¶

Run inference on custom traffic videos using the fine-tuned Cosmos Reason 2 model. The model can answer both multiple-choice and open-ended questions about traffic scenes.

In [ ]:

Copied!





# Fine-tuned inference demo
inference = CosmosReason2Inference(model_path=FINETUNED_CHECKPOINT_PATH, nframes=8)
inference.load_model()

# Sample questions for traffic scene understanding
ANSWER_STYLE_SUFFIX = "Answer with a sentence."
SAMPLE_QUESTIONS = [
    "What type of road is shown in this video?",
    "How many vehicles can you see in the scene?",
    "Is there any pedestrian in the video? If yes, what are they doing?",
    "What potential traffic hazards do you observe?",
    "Describe the overall traffic flow and density.",
]
SAMPLE_QUESTIONS = [f"{q} {ANSWER_STYLE_SUFFIX}" for q in SAMPLE_QUESTIONS]


test_video = EXAMPLE_VIDEO_PATH

print("\n" + "="*70)
print("🎬 TESTING INFERENCE ON TRAFFIC VIDEO (FINE-TUNED)")
print("="*70)
print(f"Video: {test_video}")
print(f"Total Questions: {len(SAMPLE_QUESTIONS)}")
print("="*70 + "\n")

for i, question in enumerate(SAMPLE_QUESTIONS, 1):
    print(f"📝 Question {i}/{len(SAMPLE_QUESTIONS)}")
    print("-"*70)
    print(question)
    print("-"*70)
    
    response = inference.query(test_video, question)
    
    print(f"✅ ANSWER: {response}")
    print("="*70 + "\n")

print("✅ All questions processed successfully!")
# Fine-tuned inference demo
inference = CosmosReason2Inference(model_path=FINETUNED_CHECKPOINT_PATH, nframes=8)
inference.load_model()

# Sample questions for traffic scene understanding
ANSWER_STYLE_SUFFIX = "Answer with a sentence."
SAMPLE_QUESTIONS = [
    "What type of road is shown in this video?",
    "How many vehicles can you see in the scene?",
    "Is there any pedestrian in the video? If yes, what are they doing?",
    "What potential traffic hazards do you observe?",
    "Describe the overall traffic flow and density.",
]
SAMPLE_QUESTIONS = [f"{q} {ANSWER_STYLE_SUFFIX}" for q in SAMPLE_QUESTIONS]


test_video = EXAMPLE_VIDEO_PATH

print("\n" + "="*70)
print("🎬 TESTING INFERENCE ON TRAFFIC VIDEO (FINE-TUNED)")
print("="*70)
print(f"Video: {test_video}")
print(f"Total Questions: {len(SAMPLE_QUESTIONS)}")
print("="*70 + "\n")

for i, question in enumerate(SAMPLE_QUESTIONS, 1):
    print(f"📝 Question {i}/{len(SAMPLE_QUESTIONS)}")
    print("-"*70)
    print(question)
    print("-"*70)
    
    response = inference.query(test_video, question)
    
    print(f"✅ ANSWER: {response}")
    print("="*70 + "\n")

print("✅ All questions processed successfully!")

7.1. Clean Up GPU Memory¶

Before proceeding with quantization and deployment, terminate any remaining vLLM processes to free GPU memory.

In [ ]:

Copied!





import os
import signal
import subprocess

out = subprocess.check_output(
    ["nvidia-smi", "--query-compute-apps=pid,process_name", "--format=csv,noheader"]
).decode().strip()

terminated = 0
for line in out.splitlines():
    if not line.strip():
        continue
    pid, name = [x.strip() for x in line.split(",")]
    if "VLLM" in name or "EngineCore" in name:
        os.kill(int(pid), signal.SIGTERM)
        terminated += 1

if terminated:
    print(f"✅ Terminated {terminated} vLLM GPU worker process(es).")
else:
    print("✅ No vLLM GPU worker processes found.")
import os
import signal
import subprocess

out = subprocess.check_output(
    ["nvidia-smi", "--query-compute-apps=pid,process_name", "--format=csv,noheader"]
).decode().strip()

terminated = 0
for line in out.splitlines():
    if not line.strip():
        continue
    pid, name = [x.strip() for x in line.split(",")]
    if "VLLM" in name or "EngineCore" in name:
        os.kill(int(pid), signal.SIGTERM)
        terminated += 1

if terminated:
    print(f"✅ Terminated {terminated} vLLM GPU worker process(es).")
else:
    print("✅ No vLLM GPU worker processes found.")

Once inference results look satisfactory, the next step is to prepare the model for production deployment.

8. Deployment¶

Prepare the fine-tuned model for production use and serve it with NIM.

8.1. FP8 Quantization¶

Quantize the fine-tuned model to FP8 precision to reduce memory footprint and improve inference throughput. The quantization script is included in the Cosmos Reason 2 repository.

In [ ]:

Copied!





# FP8 Quantization Configuration
FP8_MODEL_OUTPUT_PATH = f"{FINETUNED_MODEL_PATH}_fp8"

QUANTIZATION_CONFIG = {
    "model_path": FINETUNED_MODEL_PATH,
    "output_path": FP8_MODEL_OUTPUT_PATH,
    "precision": "fp8",
}

quantize_script = f"{COSMOS_REASON2_REPO}/scripts/quantize.py"

print("Quantization Setup")
print("="*70)
print(f"  Input Model:    {QUANTIZATION_CONFIG['model_path']}")
print(f"  Output Path:    {QUANTIZATION_CONFIG['output_path']}")
print(f"  Precision:      {QUANTIZATION_CONFIG['precision']}")
print("="*70)

QUANTIZE_CMD = f"""\
python {quantize_script} \\
    --model "{QUANTIZATION_CONFIG['model_path']}" \\
    -o "{QUANTIZATION_CONFIG['output_path']}" \\
    --precision {QUANTIZATION_CONFIG['precision']}
"""

print("\nQuantization Command:")
print("-"*70)
print(QUANTIZE_CMD)
print("-"*70)
# FP8 Quantization Configuration
FP8_MODEL_OUTPUT_PATH = f"{FINETUNED_MODEL_PATH}_fp8"

QUANTIZATION_CONFIG = {
    "model_path": FINETUNED_MODEL_PATH,
    "output_path": FP8_MODEL_OUTPUT_PATH,
    "precision": "fp8",
}

quantize_script = f"{COSMOS_REASON2_REPO}/scripts/quantize.py"

print("Quantization Setup")
print("="*70)
print(f"  Input Model:    {QUANTIZATION_CONFIG['model_path']}")
print(f"  Output Path:    {QUANTIZATION_CONFIG['output_path']}")
print(f"  Precision:      {QUANTIZATION_CONFIG['precision']}")
print("="*70)

QUANTIZE_CMD = f"""\
python {quantize_script} \\
    --model "{QUANTIZATION_CONFIG['model_path']}" \\
    -o "{QUANTIZATION_CONFIG['output_path']}" \\
    --precision {QUANTIZATION_CONFIG['precision']}
"""

print("\nQuantization Command:")
print("-"*70)
print(QUANTIZE_CMD)
print("-"*70)

Run the FP8 quantization command below. This requires the Cosmos-RL virtual environment to be set up (see Section 1). The quantized model will be saved to the output path shown above.

In [ ]:

Copied!





import os

COSMOS_RL_VENV = f"{COSMOS_RL_PATH}/.venv"

# Run FP8 Quantization (Shell Command)
!source {COSMOS_RL_VENV}/bin/activate && pip install uv && {COSMOS_REASON2_REPO}/scripts/quantize.py \
    --model "{QUANTIZATION_CONFIG['model_path']}" \
    -o "{QUANTIZATION_CONFIG['output_path']}" \
    --precision fp8
import os

COSMOS_RL_VENV = f"{COSMOS_RL_PATH}/.venv"

# Run FP8 Quantization (Shell Command)
!source {COSMOS_RL_VENV}/bin/activate && pip install uv && {COSMOS_REASON2_REPO}/scripts/quantize.py \
    --model "{QUANTIZATION_CONFIG['model_path']}" \
    -o "{QUANTIZATION_CONFIG['output_path']}" \
    --precision fp8

8.2. Deploy with NIM¶

You need an NGC API key to pull the Cosmos Reason 2 NIM image. This cell prompts for your key and performs a Docker login.

In [ ]:

Copied!





# NGC Login
import subprocess
import getpass
from IPython.display import display, HTML
import time

display(HTML('<a href="https://org.ngc.nvidia.com/setup/api-key" target="_blank" style="font-size:16px;">🔑 Get NGC API Key</a>'))
time.sleep(2)

ngc_api_key = getpass.getpass("NGC API Key: ").strip()

if ngc_api_key:
    result = subprocess.run(
        ["docker", "login", "nvcr.io", "-u", "$oauthtoken", "--password-stdin"],
        input=ngc_api_key, text=True, capture_output=True
    )
    print("✅ Login successful" if result.returncode == 0 else f"❌ Failed: {result.stderr}")
else:
    print("❌ No key provided")
# NGC Login
import subprocess
import getpass
from IPython.display import display, HTML
import time

display(HTML('🔑 Get NGC API Key'))
time.sleep(2)

ngc_api_key = getpass.getpass("NGC API Key: ").strip()

if ngc_api_key:
    result = subprocess.run(
        ["docker", "login", "nvcr.io", "-u", "$oauthtoken", "--password-stdin"],
        input=ngc_api_key, text=True, capture_output=True
    )
    print("✅ Login successful" if result.returncode == 0 else f"❌ Failed: {result.stderr}")
else:
    print("❌ No key provided")

NIM Deployment Configuration¶

Define the model path, NIM image, and runtime parameters before launching the container.

Ensure GPU memory is free before deployment and adjust the max_model_len parameter below based on available GPU memory. Reduce it if you encounter CUDA out-of-memory errors.

All necessary tags and parameters are included in the Docker command below. For additional NIM configuration options, refer to the official NIM Configuration page.

By default, the nvidia-container-runtime only mounts a minimal set of libraries for security and efficiency: Default NVIDIA_DRIVER_CAPABILITIES: compute,utility This includes:

compute: CUDA libraries (libcuda.so, libnvcuvid.so, etc.)
utility: Management tools (nvidia-smi, nvidia-debugdump)

NOT included by default:

video: NVENC/NVDEC video codec libraries (libnvidia-encode.so, libnvidia-decode.so)
graphics: OpenGL libraries
display: X11 libraries

Please refer to NVIDIA Container Toolkit guide for more help.

In [ ]:

Copied!





# NVIDIA NIM Deployment Configuration
NIM_CONFIG = {
    "model_path": f"{QUANTIZATION_CONFIG['output_path']}/model_fp8",  # FP8 quantized model
    "nim_image": "nvcr.io/nim/nvidia/cosmos-reason2-8b:latest",
    "model_name": "cosmos-reason2-wts",
    "port": 8000,
    "shm_size": "32GB",
    "max_model_len": 131072,  # 128k tokens context (supports up to 256k); reduce for lower memory usage
    "allowed_local_media_path": "/path/to/media"  # UPDATE THIS PATH TO YOUR LOCAL VIDEO PATH
}

print("🚀 NVIDIA NIM Deployment")
print("="*70)
print(f"Model: {NIM_CONFIG['model_path']}")
print(f"Max Context Length: {NIM_CONFIG['max_model_len']:,} tokens")
print(f"Port: {NIM_CONFIG['port']}")
print("="*70)
# NVIDIA NIM Deployment Configuration
NIM_CONFIG = {
    "model_path": f"{QUANTIZATION_CONFIG['output_path']}/model_fp8",  # FP8 quantized model
    "nim_image": "nvcr.io/nim/nvidia/cosmos-reason2-8b:latest",
    "model_name": "cosmos-reason2-wts",
    "port": 8000,
    "shm_size": "32GB",
    "max_model_len": 131072,  # 128k tokens context (supports up to 256k); reduce for lower memory usage
    "allowed_local_media_path": "/path/to/media"  # UPDATE THIS PATH TO YOUR LOCAL VIDEO PATH
}

print("🚀 NVIDIA NIM Deployment")
print("="*70)
print(f"Model: {NIM_CONFIG['model_path']}")
print(f"Max Context Length: {NIM_CONFIG['max_model_len']:,} tokens")
print(f"Port: {NIM_CONFIG['port']}")
print("="*70)

In [ ]:

Copied!





# Shell Command for NIM Deployment

NIM_DEPLOY_CMD = f"""
# Set environment variables
export CUSTOM_WEIGHTS="{NIM_CONFIG['model_path']}"
export NIM_IMAGE="{NIM_CONFIG['nim_image']}"

# Launch NIM container
docker run -d --name=cosmos-reason2-wts \\
    --gpus all \\
    --shm-size={NIM_CONFIG['shm_size']} \\
    -e NIM_MODEL_NAME=$CUSTOM_WEIGHTS \\
    -e NIM_SERVED_MODEL_NAME="{NIM_CONFIG['model_name']}" \\
    -e NIM_MAX_MODEL_LEN={NIM_CONFIG['max_model_len']} \\
    -e NIM_ALLOWED_LOCAL_MEDIA_PATH="{NIM_CONFIG['allowed_local_media_path']}" \\
    -e NVIDIA_VISIBLE_DEVICES=all \\
    -v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \\
    -v {NIM_CONFIG['allowed_local_media_path']}:{NIM_CONFIG['allowed_local_media_path']}:ro \\
    -u $(id -u) \\
    -p {NIM_CONFIG['port']}:8000 \\
    $NIM_IMAGE \\
    /opt/nim/start_server.sh --allowed-local-media-path {NIM_CONFIG['allowed_local_media_path']}

# Wait for startup (takes ~2-3 minutes)
# Check the deployment status using 
docker logs -f cosmos-reason2-wts

# Health check
curl http://localhost:{NIM_CONFIG['port']}/v1/health/ready | jq .
"""


print("\n💡 Steps:")
print("   1. Run the commands below")
print("   2. Monitor with: docker logs -f cosmos-reason2-wts")
print("   3. Stop with: docker stop cosmos-reason2-wts \n")
print("-"*70)
print("📝 NIM Deployment Commands:")
print("-"*70)
print(NIM_DEPLOY_CMD)
print("-"*70)
# Shell Command for NIM Deployment

NIM_DEPLOY_CMD = f"""
# Set environment variables
export CUSTOM_WEIGHTS="{NIM_CONFIG['model_path']}"
export NIM_IMAGE="{NIM_CONFIG['nim_image']}"

# Launch NIM container
docker run -d --name=cosmos-reason2-wts \\
    --gpus all \\
    --shm-size={NIM_CONFIG['shm_size']} \\
    -e NIM_MODEL_NAME=$CUSTOM_WEIGHTS \\
    -e NIM_SERVED_MODEL_NAME="{NIM_CONFIG['model_name']}" \\
    -e NIM_MAX_MODEL_LEN={NIM_CONFIG['max_model_len']} \\
    -e NIM_ALLOWED_LOCAL_MEDIA_PATH="{NIM_CONFIG['allowed_local_media_path']}" \\
    -e NVIDIA_VISIBLE_DEVICES=all \\
    -v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \\
    -v {NIM_CONFIG['allowed_local_media_path']}:{NIM_CONFIG['allowed_local_media_path']}:ro \\
    -u $(id -u) \\
    -p {NIM_CONFIG['port']}:8000 \\
    $NIM_IMAGE \\
    /opt/nim/start_server.sh --allowed-local-media-path {NIM_CONFIG['allowed_local_media_path']}

# Wait for startup (takes ~2-3 minutes)
# Check the deployment status using 
docker logs -f cosmos-reason2-wts

# Health check
curl http://localhost:{NIM_CONFIG['port']}/v1/health/ready | jq .
"""


print("\n💡 Steps:")
print("   1. Run the commands below")
print("   2. Monitor with: docker logs -f cosmos-reason2-wts")
print("   3. Stop with: docker stop cosmos-reason2-wts \n")
print("-"*70)
print("📝 NIM Deployment Commands:")
print("-"*70)
print(NIM_DEPLOY_CMD)
print("-"*70)

8.3. Test NIM API¶

Send a sample request to the local NIM endpoint to confirm the deployment is responding correctly. The example below uses a remote video URL for convenience; replace it with your own video path as needed.

In [ ]:

Copied!





import os
import requests
from typing import Tuple, Dict, Any

# NIM Client Settings
NIM_ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL_NAME = "cosmos-reason2-wts"


def to_file_url(path: str) -> str:
    if not os.path.isabs(path):
        raise ValueError("Local video must be an absolute path")
    return f"file://{path}"


def nim_video_chat(
    prompt: str,
    video: str,
    fps: float = 1.0,
    timeout: int = 120,
) -> Tuple[str, Dict[str, Any]]:
    if video.startswith(("http://", "https://", "data:", "file://")):
        video_url = video
    else:
        video_url = to_file_url(video)

    payload = {
        "model": MODEL_NAME,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url}},
                {"type": "text", "text": prompt},
            ],
        }],
        "media_io_kwargs": {"video": {"fps": fps}},
        "stream": False,
    }

    resp = requests.post(NIM_ENDPOINT, json=payload, timeout=timeout)
    if not resp.ok:
        print(f"[HTTP {resp.status_code}] {resp.text}")
        resp.raise_for_status()

    data = resp.json()
    return data["choices"][0]["message"]["content"], data


# --- Example: Remote + Local Video ---
print("▶ Remote video")
remote_answer, _ = nim_video_chat(
    prompt="What is in this video?",
    video="https://download.samplelib.com/mp4/sample-5s.mp4",
    fps=4.0,
)
print("Answer:", remote_answer)

print("\n" + "-" * 60 + "\n")

print("▶ Local video")
local_answer, _ = nim_video_chat(
    prompt="What is in this video?",
    video=EXAMPLE_VIDEO_PATH,
    fps=4.0,
)
print("Answer:", local_answer)
import os
import requests
from typing import Tuple, Dict, Any

# NIM Client Settings
NIM_ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL_NAME = "cosmos-reason2-wts"


def to_file_url(path: str) -> str:
    if not os.path.isabs(path):
        raise ValueError("Local video must be an absolute path")
    return f"file://{path}"


def nim_video_chat(
    prompt: str,
    video: str,
    fps: float = 1.0,
    timeout: int = 120,
) -> Tuple[str, Dict[str, Any]]:
    if video.startswith(("http://", "https://", "data:", "file://")):
        video_url = video
    else:
        video_url = to_file_url(video)

    payload = {
        "model": MODEL_NAME,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url}},
                {"type": "text", "text": prompt},
            ],
        }],
        "media_io_kwargs": {"video": {"fps": fps}},
        "stream": False,
    }

    resp = requests.post(NIM_ENDPOINT, json=payload, timeout=timeout)
    if not resp.ok:
        print(f"[HTTP {resp.status_code}] {resp.text}")
        resp.raise_for_status()

    data = resp.json()
    return data["choices"][0]["message"]["content"], data


# --- Example: Remote + Local Video ---
print("▶ Remote video")
remote_answer, _ = nim_video_chat(
    prompt="What is in this video?",
    video="https://download.samplelib.com/mp4/sample-5s.mp4",
    fps=4.0,
)
print("Answer:", remote_answer)

print("\n" + "-" * 60 + "\n")

print("▶ Local video")
local_answer, _ = nim_video_chat(
    prompt="What is in this video?",
    video=EXAMPLE_VIDEO_PATH,
    fps=4.0,
)
print("Answer:", local_answer)