Skip to content

Cosmos Reason for Mimic Gen temporal localization

Authors: Aigul DzhumamuratovaAlexander EfitorovHesam RabetiJingyi Jin Organization: NVIDIA

Overview

Model Workload Use case
Cosmos Reason 1 Post-training Temporal localization for MimicGen data generation

MimicGen is a system for automatically synthesizing large-scale, diverse robot-learning datasets from only a small number of human demonstrations.

It works by dividing each demonstration into object-centric subtasks annotated with timestamp boundaries.

Given a new scene with different object configurations, MimicGen:

  • selects an existing demonstration
  • spatially transforms each subtask to match the new context and stitches them together
  • executes the resulting trajectory to collect a new demonstration

This allows a small set of manually recorded demonstrations to be expanded into a large dataset with varied contexts and trajectories.

By leveraging the Cosmos Reason temporal localization capability, timestamp annotations for subtasks can be automatically generated from short simulation videos, reducing manual effort and improving scalability.

Benchmark Selection

For each specific task, we identify a set of subtasks that require timestamp boundary annotations. We overlay timestamps on the video and provide it as input to the model along with the following system and user prompts.

Isaac Lab Cube Stacking Events Timeline

System prompt:

You are a specialized behavior analyst. Your task is to analyze the video and identify MULTIPLE discrete events with precise timestamps. At each frame, the timestamp is embedded at the bottom of the video. You need to extract the timestamp and answer the user question
CRITICAL REQUIREMENTS:
1. Extract timestamps from the bottom of each frame
2. Extract timestamps for USER-DEFINED events

Answer the question in the following format:
<think>
I will analyze the video systematically:
1. First, identify ALL visible timestamps throughout the video
2. Identify USER-DEFINED events
3. Extract timestamps for identified USER-DEFINED events. There will be different timestamps for each video.
4. Always answer in English

Event 1: <start time> - <end time> - Event | reasoning
Event 2: <start time> - <end time> - Event | reasoning
Event 3: <start time> - <end time> - Event | reasoning

[Continue for all events identified]
</think>

<answer>
Event 1: <start time> - <end time> Specific Event | detailed explanation.
Event 2: <start time> - <end time> Specific Event | detailed explanation.
Event 3: <start time> - <end time> Specific Event | detailed explanation.
[Continue for all events identified]
</answer>

User prompt:

You should find the following 3 events in the input video
Event 1: grasping the red cube.
Event 2: releasing the red cube.
Event 3: grasping the green cube.
Extract the exact timestamps for each event.

Zero-shot Evaluation

Run Baseline Evaluation

For Zero-shot Evaluation we used the test dataset consisting of 60 videos across 7 different tasks. The test dataset includes both simulation environments from Isaac Lab Mimic (Cube Stacking, Humanoid Nut Pouring) and real-world datasets (AgiBot, BridgeData V2), and covers both robot-arm and humanoid scenarios. Looking ahead, the fine-tuning dataset will contain only robot-arm demonstrations. As a result, the humanoid scenarios in the test dataset are fully out-of-distribution, making them particularly useful for evaluating how well the model generalizes during re-evaluation.

Task Frame number Number of videos Caption
Isaac Lab: Cube stacking Cube Stacking 179-247 10 Event 1: grasping the red cube | The robotic arm closes its grippers around the red cube, securing it firmly.

Event 2: releasing the red cube | The robotic arm releases the red cube, allowing it to rest atop the blue cube.

Event 3: grasping the green cube | The robotic arm closes its grippers around the green cube, picking it up from the surface.
Isaac Lab: Humanoid nut pouring Nut Pouring 288-313 5 Event 1: Picking up the red cylinder from the table.

Event 2: Placing the red cylinder in the blue tray.

Event 3: Picking up the yellow bowl from the table.
BridgeData V2: Cube stacking Bridge Cube Stacking 38-47 5 Event 1: grasping the cube.

Event 2: releasing the cube.
AgiBot: task 358, Toaster Toaster 451 10 Event 1: grasping bread.

Event 2: releasing bread.

Event 3: pushing the toaster.

Event 4: releasing the toaster.
AgiBot: task 366, Chips Chips 397-451 10 Event 1: grasping a bag of chips.

Event 2: releasing a bag of chips.
AgiBot: task 378, Fork Fork 451 10 Event 1: grasping a fork.

Event 2: releasing a fork.

Event 3: grasping a bowl.

Event 4: releasing a bowl.
AgiBot: task 412, Cup Cup 451 10 Event 1: grasping a cup.

Event 2: grasping a rag.

Event 3: releasing a rag.

Timestamps were added to the videos using a custom adaptive script (compatible with different FPS values) using add_timestamps_to_all_videos_adaptive.py:

python add_timestamps_to_all_videos_adaptive.py \
-i /path/to/videos \
-o /path/to/videos_with_ts

We used the public checkpoint of Cosmos Reason 1 for evaluation with the following configuration:

  • max_tokens = 4096
  • fps = 4, 8, 12
  • temperature = 0.6

We tested the model using process_video_vllm_uni.py:

python process_video_vllm_uni.py \
  --model_path nvidia/Cosmos-Reason1-7B \
  --prompt cube \
  --video_dir /path/to/videos_with_ts \
  --output_dir /path/to/results \
  --fps_list 8

Optimal setting: fps = 8.

The target accuracy for timestamp annotation was defined as < 30% relative error (relative to subtask duration). Beyond this threshold, MimicGen’s trajectory generation time increased significantly.

We also compared Cosmos Reason against the Qwen3 model family. Among them, Qwen/Qwen3-VL-235B-A22B-Instruct achieved accuracy close to manual annotation. We evaluated Qwen3 models using this script process_video_vllm_qwen3_uni.py:

python process_video_uni_qwen3_cr2.py \
--model qwen-2b \
--prompt cube \
--num_trials 10 \
--video_dir /path/to/videos_with_ts \
--output_dir /path/to/results

Models tested: Qwen/Qwen3-VL-2B-Instruct, Qwen/Qwen3-VL-8B-Instruct, Qwen/Qwen3-VL-30B-A3B-Instruct.

For Qwen/Qwen3-VL-235B-A22B-Instruct we used this tutorial:

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8 \
--mm-encoder-tp-mode data \
--quantization fp8 \
--mm-processor-cache-type shm \
--async-scheduling \
--allowed-local-media-path /mnt/pvc/datasets

Processing with OpenAI API-compatible interface using process_video_openai_api.py:

python process_video_openai_api.py \
--model Qwen/Qwen3-VL-235B-A22B-Instruct \
--prompt cube \
--num_trials 10 \
--video_dir /path/to/videos_with_ts \
--output_dir /path/to/results

Example Baseline Results

Task VLM (max_tokens = 4096, fps = 8) Mean absolute error ⇩ (s) Mean relative error ⇩ (%) Notes
#1 Isaac Lab Cube Stacking
10 demos
Cosmos-Reason 1 7B 0.769 52.0 Error 52% for CR1 reduces MimicGen success rate by 3x (from 41% → 14.8%) and increases generation time by 3x.
Qwen3 VL 8B 0.508 32.4
Qwen3 VL 30B A3B 0.459 30.4
Qwen3 VL 235B A22B 0.260 17.3
#2 Isaac Lab Humanoid Nut Pouring
5 demos
Cosmos-Reason 1 7B 1.487 61.3 Error higher than Cube Stacking task.
Qwen3 VL 8B 1.001 38.1
Qwen3 VL 30B A3B 1.407 56.4
Qwen3 VL 235B A22B 0.310 13.7
#3 Real world Bridge 1 Cube Stacking
10 demos
Cosmos-Reason 1 7B 0.560 109.7 Real-world dataset performance worse than simulated data.
Qwen3 VL 8B 0.489 94.4
Qwen3 VL 30B A3B 0.560 109.7
Qwen3 VL 235B A22B 0.158 26.5
#4 Real world AgiBot dataset
40 demos
Cosmos-Reason 1 7B 2.381 68.45 Accuracy is lower compared to humanoid tasks; worse than simulated datasets.
Qwen3 VL 2B 2.965 93.2
Qwen3 VL 8B 1.769 50.33
Qwen3 VL 30B A3B 2.127 58.98
Qwen3 VL 235B A22B 1.608 45.45

Problem Analysis

The zero-shot evaluation demonstrated that the current accuracy level is insufficient to automate the MimicGen data preparation pipeline.

Key Gaps Identified:

  • Gap 1: Inaccurate timestamps (mean relative error > 50%)
  • Gap 2: Inconsistent output formatting - different structures between runs.
  • Gap 3: Missed user-defined subtasks in some cases.

Success Criteria:

  • Mean relative error < 30%
  • All user-defined events are correctly identified and labeled.

Data Curation

Data Curation

For this project, we utilized the public MimicGen dataset, focusing on 6 representative manipulation tasks.

To improve visual quality and ensure consistency across samples, all videos were rerendered using NVIDIA Omniverse. Rerendering with Omniverse is required, since the default visual outputs from the RoboMimic MuJoCo simulator produce poor-quality results that negatively affect downstream video-based reasoning. For users who wish to reproduce the rendering process or adapt it to their own tasks, please refer to the robosuite documentation for instructions on enabling Isaac rendering backends to re-render previously collected demonstrations using either ray tracing or path tracing.

The videos were generated at a resolution of 512x512 pixels and a frame rate of 30 FPS. Ground-truth subtask timestamps were extracted semi-automatically based on:

  • Object position trajectories
  • Robot arm kinematics

The automatically extracted timestamps were then manually refined to ensure accurate alignment with task transitions and key manipulation events.

Task Frame number Number of actions Caption
coffee, d0-d2 coffee 70 1 Event 1: <2.2> <5.6> grasping the pod
nut assembly, d0 nut_assembly 308 3 Event 1: <2.7> <5.5> grasping the square nut

Event 2: <5.5> <8.1> inserting the square nut

Event 3: <8.1> <10.6> grasping the round nut
square, d0 square 56 1 Event 1: <2.1> <4.3> grasping the square nut
stack, d1 stack 49 1 Event 1: <1.9> <3.3> grasping the red cube
threading, d0-d2 threading 56 1 Event 1: <1.9> <6.5> grasping the mallet
three piece assembly, d0 d1 three_piece_assembly 232 3 Event 1: <2.2> <5.4> grasping the first piece

Event 2: <5.5> <7.7> inserting the first piece

Event 3: <7.7> <10.5> grasping the second piece

Preprocessing

  • Add timestamps using add_timestamps_to_all_videos_adaptive.py:
python add_timestamps_to_all_videos_adaptive.py \
  -i mimicgen_dataset/videos \
  -o mimicgen_dataset/videos_ts
  • Convert to Hugging Face dataset format using create_dataset_from_local.py:
python create_dataset_from_local.py \
  --output mimicgen_dataset_hf \
  --prompts_pickle mimicgen_dataset/formatted_prompts_and_responses.pkl \
  --video_dir mimicgen_dataset/videos_ts

Each training example is stored as a conversation consisting of a list of messages. A sample annotation is shown below:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a specialized behavior analyst. Your task is to analyze the video and identify MULTIPLE discrete events with precise timestamps. At each frame, the timestamp is embedded at the bottom of the video. You need to extract the timestamp and answer the user question\nCRITICAL REQUIREMENTS:\n1. Extract timestamps from the bottom of each frame\n 2. Extract timestamps for USER-DEFINED events\n\nAnswer the question in the following format:\n<think>\nI will analyze the video systematically:\n. First, identify ALL visible timestamps throughout the video\n2. Identify USER-DEFINED events\n3. Extract timestamps for identified USER-DEFINED events. There will be different timestamps for each video.\n4. Always answer in English\n\nEvent 1: <start time> - <end time> - Event | reasoning\nEvent 2: <start time> - <end time> - Event | reasoning \nEvent 3: <start time> - <end time> - Event | reasoning\n\n[Continue for all events identified]\n</think>\n\n<answer>\nEvent 1: <start time> - <end time> Specific Event | detailed explanation.\nEvent 2: <start time> - <end time> Specific Event | detailed explanation.\nEvent 3: <start time> - <end time> Specific Event | detailed explanation.\n[Continue for all events identified]\n</answer>"
    },
    {
      "role": "user",
      "content": [
        {
          "type": "video",
          "video": "mimicgen_dataset/videos_ts/coffee_d0_demo2_agentview.mp4"
        },
        {
          "type": "text",
          "text": "Event 1: grasping the pod"
        }
      ]
    },
    {
      "role": "assistant",
      "content": "<answer>\nEvent 1: <2.9> <5.5> grasping the pod\n\n</answer>"
    }
  ]
}

Dataset configuration:

  • Input Type: Video
  • Resolution: 512×512
  • Encoding: H.264
  • Training Set Size: 3644 videos (MimicGen dataset)
  • Testing Set Size: 60 videos (Isaac Lab + Bridge1 + AgiBot)
  • Fps: 30
  • Caption or annotation: see table above

Dynamic Resolution and Pixel Budget Concept

Cosmos Reason 1 uses dynamic resolution: it automatically trades off frame rate and spatial resolution to fit a fixed per‑video total pixel budget:

  • budget = vision tokens × (2 × patch size)²

With patch size = 14; vision tokens ≤ 8192:

  • 8192 × (2 × 14)² = 6,422,528 pixels per video

Because clip duration is fixed and you set fps, the system adjusts per‑frame resolution to stay within the budget:

pixels per frame ≈ budget / frames, where frames = duration × fps

with side resolution \( \approx \sqrt{\text{pixels per frame}} \) is then snapped to the patch grid of 14 pixels

Example: 7s input video (Cosmos Reason params: vision tokens = 8192, fps = 8):

  1. 7 × 8 = 56 frames
  2. 6,422,528 / 56 ≈ 114,700 pixels per frame with side resolution: \( \sqrt{114,700} \approx 338 \) pixels
  3. snapped to patch grid: 24 × 14 = 336 pixels per side ⇒ 24² = 576 patches per frame

Guidance:

  • Ensure objects of interest cover multiple patches; otherwise they may be missed
  • If detail is insufficient, shorten the clip (trim or speed up) or lower fps to increase per‑frame spatial resolution

Supervised Fine-Tuning (SFT)

We apply supervised fine-tuning (SFT) to improve subtask boundary recognition, produce standardized timestamp formats, and enhance temporal reasoning accuracy for robotic manipulation sequences.

Training Configuration

We use the following mimicgen_sft.toml configuration optimized for 8 GPUs:

Training Configuration
[custom.dataset]
path = "data/mimicgen_dataset_hf"

[train]
epoch = 120
output_dir = "outputs/mimicgen_sft"
compile = false
train_batch_per_replica = 32

[policy]
model_name_or_path = "nvidia/Cosmos-Reason1-7B"
model_max_length = 4096

[logging]
logger = ['console']
project_name = "cosmos_reason1"
experiment_name = "post_training_hf/mimicgen_sft"

[train.train_policy]
type = "sft"
conversation_column_name = "conversations"
mini_batch = 4

[train.ckpt]
enable_checkpoint = true

[policy.parallelism]
tp_size = 1
cp_size = 1
dp_shard_size = 8
pp_size = 1

We use the cosmos-rl library for fine-tuning.:

# In the cosmos-reason1 root directory
cd examples/post_training_hf/
cosmos-rl --config configs/mimicgen_sft.toml scripts/custom_sft.py

Re-Evaluation

After fine-tuning, we evaluate the model on the test dataset from the Zero-shot Evaluation step. Re-evaluation was performed using postprocess.py:

python postprocess.py /path/to/results --gt-timestamps cube --use-start-time

--gt-timestamps options: nut, cube, bridge, toaster, chips, fork, cup

Results Comparison

Task VLM (max_tokens = 4096, fps = 8) Mean absolute error ⇩ (s) Mean relative error ⇩ (%) Improvement
#1 Isaac Lab Cube Stacking
10 demos
Cosmos-Reason 1 7B 0.769 52.0 baseline
Cosmos-Reason 1 7B SFT 0.434 28.6 -23.4% MRE
Qwen3 VL 8B 0.508 32.4 -
Qwen3 VL 30B A3B 0.459 30.4 -
Qwen3 VL 235B A22B 0.260 17.3 -
#2 Isaac Lab Humanoid Nut Pouring
5 demos
Cosmos-Reason 1 7B 1.487 61.3 baseline
Cosmos-Reason 1 7B SFT 0.506 17.5 -43.8% MRE
Qwen3 VL 8B 1.001 38.1 -
Qwen3 VL 30B A3B 1.407 56.4 -
Qwen3 VL 235B A22B 0.310 13.7 -
#3 Real world Bridge 1 Cube Stacking
10 demos
Cosmos-Reason 1 7B 0.560 109.7 baseline
Cosmos-Reason 1 7B SFT 0.560 109.7 No change
Qwen3 VL 8B 0.489 94.4 -
Qwen3 VL 30B A3B 0.560 109.7 -
Qwen3 VL 235B A22B 0.158 26.5 -
#4 Real world AgiBot dataset
40 demos
Qwen3 VL 2B 2.965 93.2 -
Cosmos-Reason 1 7B 2.381 68.45 baseline
Cosmos-Reason 1 7B SFT 1.571 43.03 -24.75%
Qwen3 VL 8B 1.769 50.33 -
Qwen3 VL 30B A3B 2.127 58.98 -
Qwen3 VL 235B A22B 1.608 45.45 -

Analysis

  • Mean relative error reduced 2x for simulation datasets and 1.5x for real-world datasets.
  • Fine-tuning reduced error from 71% to 47%, improving timestamp accuracy and reliability.

Comparison Examples

The following example shows temporal localization improvements from fine-tuning:

Isaac Lab Humanoid Nut Pouring

Isaac Lab Nut Pouring

  • Before SFT:

    Event 1: <0.0> <4.5> - Picking up the red cylindrical object with Robot Arm 1.
    Event 2: <3.5> <5.5> - Placing the red cylindrical object into the yellow bowl.
    Event 3: <7.2> <9.0> - Grasping the yellow bowl with Robot Arm 2 and lifting it onto the gray sheet.
    
  • After SFT (120 epochs):

    Event 1: <1.7> <5.8> Picking up the red cylinder from the table.
    Event 2: <5.8> <7.9> Placing the red cylinder in the blue tray.
    Event 3: <7.9> <10.6> Picking up the yellow bowl from the table.
    
  • Ground Truth:

    Event 1: <1.7> <6.2> Picking up the red cylinder from the table.
    Event 2: <6.2> <8.3> Placing the red cylinder in the blue tray.
    Event 3: <8.3> <10.6> Picking up the yellow bowl from the table.
    

The fine-tuned model provides more accurate subtask timestamp boundaries compared to the baseline.

Conclusion

Key Achievements

  • Reduced timestamp localization error by up to 2x after fine-tuning.
  • Standardized output formatting, improving consistency and downstream parsing reliability.
  • Demonstrated scalability for automated timestamp annotation in MimicGen pipelines.

Lessons Learned

  • Fine-tuning on task-specific datasets significantly improves timestamp consistency and reduces variance across model runs.
  • Simulation data provides more reliable information for SFT model than real-world data.