Cosmos Reason for Mimic Gen temporal localization

Authors: Aigul Dzhumamuratova • Alexander Efitorov • Hesam Rabeti • Jingyi Jin Organization: NVIDIA

Overview

Model	Workload	Use case
Cosmos Reason 1	Post-training	Temporal localization for MimicGen data generation

MimicGen is a system for automatically synthesizing large-scale, diverse robot-learning datasets from only a small number of human demonstrations.

It works by dividing each demonstration into object-centric subtasks annotated with timestamp boundaries.

Given a new scene with different object configurations, MimicGen:

selects an existing demonstration
spatially transforms each subtask to match the new context and stitches them together
executes the resulting trajectory to collect a new demonstration

This allows a small set of manually recorded demonstrations to be expanded into a large dataset with varied contexts and trajectories.

By leveraging the Cosmos Reason temporal localization capability, timestamp annotations for subtasks can be automatically generated from short simulation videos, reducing manual effort and improving scalability.

Setup and System Requirement

Benchmark Selection

For each specific task, we identify a set of subtasks that require timestamp boundary annotations. We overlay timestamps on the video and provide it as input to the model along with the following system and user prompts.

Isaac Lab Cube Stacking Events Timeline

System prompt:

You are a specialized behavior analyst. Your task is to analyze the video and identify MULTIPLE discrete events with precise timestamps. At each frame, the timestamp is embedded at the bottom of the video. You need to extract the timestamp and answer the user question
CRITICAL REQUIREMENTS:
1. Extract timestamps from the bottom of each frame
2. Extract timestamps for USER-DEFINED events

Answer the question in the following format:
<think>
I will analyze the video systematically:
1. First, identify ALL visible timestamps throughout the video
2. Identify USER-DEFINED events
3. Extract timestamps for identified USER-DEFINED events. There will be different timestamps for each video.
4. Always answer in English

Event 1: <start time> - <end time> - Event | reasoning
Event 2: <start time> - <end time> - Event | reasoning
Event 3: <start time> - <end time> - Event | reasoning

[Continue for all events identified]
</think>

<answer>
Event 1: <start time> - <end time> Specific Event | detailed explanation.
Event 2: <start time> - <end time> Specific Event | detailed explanation.
Event 3: <start time> - <end time> Specific Event | detailed explanation.
[Continue for all events identified]
</answer>

User prompt:

You should find the following 3 events in the input video
Event 1: grasping the red cube.
Event 2: releasing the red cube.
Event 3: grasping the green cube.
Extract the exact timestamps for each event.

Zero-shot Evaluation

Run Baseline Evaluation

For Zero-shot Evaluation we used the test dataset consisting of 60 videos across 7 different tasks. The test dataset includes both simulation environments from Isaac Lab Mimic (Cube Stacking, Humanoid Nut Pouring) and real-world datasets (AgiBot, BridgeData V2), and covers both robot-arm and humanoid scenarios. Looking ahead, the fine-tuning dataset will contain only robot-arm demonstrations. As a result, the humanoid scenarios in the test dataset are fully out-of-distribution, making them particularly useful for evaluating how well the model generalizes during re-evaluation.

Task	Frame number	Number of videos	Caption
Isaac Lab: Cube stacking	179-247	10	Event 1: grasping the red cube \| The robotic arm closes its grippers around the red cube, securing it firmly. Event 2: releasing the red cube \| The robotic arm releases the red cube, allowing it to rest atop the blue cube. Event 3: grasping the green cube \| The robotic arm closes its grippers around the green cube, picking it up from the surface.
Isaac Lab: Humanoid nut pouring	288-313	5	Event 1: Picking up the red cylinder from the table. Event 2: Placing the red cylinder in the blue tray. Event 3: Picking up the yellow bowl from the table.
BridgeData V2: Cube stacking	38-47	5	Event 1: grasping the cube. Event 2: releasing the cube.
AgiBot: task 358, Toaster	451	10	Event 1: grasping bread. Event 2: releasing bread. Event 3: pushing the toaster. Event 4: releasing the toaster.
AgiBot: task 366, Chips	397-451	10	Event 1: grasping a bag of chips. Event 2: releasing a bag of chips.
AgiBot: task 378, Fork	451	10	Event 1: grasping a fork. Event 2: releasing a fork. Event 3: grasping a bowl. Event 4: releasing a bowl.
AgiBot: task 412, Cup	451	10	Event 1: grasping a cup. Event 2: grasping a rag. Event 3: releasing a rag.

Timestamps were added to the videos using a custom adaptive script (compatible with different FPS values) using add_timestamps_to_all_videos_adaptive.py:

python add_timestamps_to_all_videos_adaptive.py \
-i /path/to/videos \
-o /path/to/videos_with_ts

We used the public checkpoint of Cosmos Reason 1 for evaluation with the following configuration:

max_tokens = 4096
fps = 4, 8, 12
temperature = 0.6

We tested the model using process_video_vllm_uni.py:

python process_video_vllm_uni.py \
  --model_path nvidia/Cosmos-Reason1-7B \
  --prompt cube \
  --video_dir /path/to/videos_with_ts \
  --output_dir /path/to/results \
  --fps_list 8

Optimal setting: fps = 8.

The target accuracy for timestamp annotation was defined as < 30% relative error (relative to subtask duration). Beyond this threshold, MimicGen’s trajectory generation time increased significantly.

We also compared Cosmos Reason against the Qwen3 model family. Among them, Qwen/Qwen3-VL-235B-A22B-Instruct achieved accuracy close to manual annotation. We evaluated Qwen3 models using this script process_video_vllm_qwen3_uni.py:

python process_video_uni_qwen3_cr2.py \
--model qwen-2b \
--prompt cube \
--num_trials 10 \
--video_dir /path/to/videos_with_ts \
--output_dir /path/to/results

Models tested: Qwen/Qwen3-VL-2B-Instruct, Qwen/Qwen3-VL-8B-Instruct, Qwen/Qwen3-VL-30B-A3B-Instruct.

For Qwen/Qwen3-VL-235B-A22B-Instruct we used this tutorial:

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8 \
--mm-encoder-tp-mode data \
--quantization fp8 \
--mm-processor-cache-type shm \
--async-scheduling \
--allowed-local-media-path /mnt/pvc/datasets

Processing with OpenAI API-compatible interface using process_video_openai_api.py:

python process_video_openai_api.py \
--model Qwen/Qwen3-VL-235B-A22B-Instruct \
--prompt cube \
--num_trials 10 \
--video_dir /path/to/videos_with_ts \
--output_dir /path/to/results

Example Baseline Results

Task	VLM (max_tokens = 4096, fps = 8)	Mean absolute error ⇩ (s)	Mean relative error ⇩ (%)	Notes
#1 Isaac Lab Cube Stacking 10 demos	Cosmos-Reason 1 7B	0.769	52.0	Error 52% for CR1 reduces MimicGen success rate by 3x (from 41% → 14.8%) and increases generation time by 3x.
	Qwen3 VL 8B	0.508	32.4
	Qwen3 VL 30B A3B	0.459	30.4
	Qwen3 VL 235B A22B	0.260	17.3
#2 Isaac Lab Humanoid Nut Pouring 5 demos	Cosmos-Reason 1 7B	1.487	61.3	Error higher than Cube Stacking task.
	Qwen3 VL 8B	1.001	38.1
	Qwen3 VL 30B A3B	1.407	56.4
	Qwen3 VL 235B A22B	0.310	13.7
#3 Real world Bridge 1 Cube Stacking 10 demos	Cosmos-Reason 1 7B	0.560	109.7	Real-world dataset performance worse than simulated data.
	Qwen3 VL 8B	0.489	94.4
	Qwen3 VL 30B A3B	0.560	109.7
	Qwen3 VL 235B A22B	0.158	26.5
#4 Real world AgiBot dataset 40 demos	Cosmos-Reason 1 7B	2.381	68.45	Accuracy is lower compared to humanoid tasks; worse than simulated datasets.
	Qwen3 VL 2B	2.965	93.2
	Qwen3 VL 8B	1.769	50.33
	Qwen3 VL 30B A3B	2.127	58.98
	Qwen3 VL 235B A22B	1.608	45.45

Problem Analysis

The zero-shot evaluation demonstrated that the current accuracy level is insufficient to automate the MimicGen data preparation pipeline.

Key Gaps Identified:

Gap 1: Inaccurate timestamps (mean relative error > 50%)
Gap 2: Inconsistent output formatting - different structures between runs.
Gap 3: Missed user-defined subtasks in some cases.

Success Criteria:

Mean relative error < 30%
All user-defined events are correctly identified and labeled.

Data Curation

For this project, we utilized the public MimicGen dataset, focusing on 6 representative manipulation tasks.

To improve visual quality and ensure consistency across samples, all videos were rerendered using NVIDIA Omniverse. Rerendering with Omniverse is required, since the default visual outputs from the RoboMimic MuJoCo simulator produce poor-quality results that negatively affect downstream video-based reasoning. For users who wish to reproduce the rendering process or adapt it to their own tasks, please refer to the robosuite documentation for instructions on enabling Isaac rendering backends to re-render previously collected demonstrations using either ray tracing or path tracing.

The videos were generated at a resolution of 512x512 pixels and a frame rate of 30 FPS. Ground-truth subtask timestamps were extracted semi-automatically based on:

Object position trajectories
Robot arm kinematics

The automatically extracted timestamps were then manually refined to ensure accurate alignment with task transitions and key manipulation events.

Task	Frame number	Number of actions	Caption
coffee, d0-d2	70	1	Event 1: <2.2> <5.6> grasping the pod
nut assembly, d0	308	3	Event 1: <2.7> <5.5> grasping the square nut Event 2: <5.5> <8.1> inserting the square nut Event 3: <8.1> <10.6> grasping the round nut
square, d0	56	1	Event 1: <2.1> <4.3> grasping the square nut
stack, d1	49	1	Event 1: <1.9> <3.3> grasping the red cube
threading, d0-d2	56	1	Event 1: <1.9> <6.5> grasping the mallet
three piece assembly, d0 d1	232	3	Event 1: <2.2> <5.4> grasping the first piece Event 2: <5.5> <7.7> inserting the first piece Event 3: <7.7> <10.5> grasping the second piece

Preprocessing

Add timestamps using add_timestamps_to_all_videos_adaptive.py:

python add_timestamps_to_all_videos_adaptive.py \
  -i mimicgen_dataset/videos \
  -o mimicgen_dataset/videos_ts

Convert to Hugging Face dataset format using create_dataset_from_local.py:

python create_dataset_from_local.py \
  --output mimicgen_dataset_hf \
  --prompts_pickle mimicgen_dataset/formatted_prompts_and_responses.pkl \
  --video_dir mimicgen_dataset/videos_ts

Each training example is stored as a conversation consisting of a list of messages. A sample annotation is shown below:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a specialized behavior analyst. Your task is to analyze the video and identify MULTIPLE discrete events with precise timestamps. At each frame, the timestamp is embedded at the bottom of the video. You need to extract the timestamp and answer the user question\nCRITICAL REQUIREMENTS:\n1. Extract timestamps from the bottom of each frame\n 2. Extract timestamps for USER-DEFINED events\n\nAnswer the question in the following format:\n<think>\nI will analyze the video systematically:\n. First, identify ALL visible timestamps throughout the video\n2. Identify USER-DEFINED events\n3. Extract timestamps for identified USER-DEFINED events. There will be different timestamps for each video.\n4. Always answer in English\n\nEvent 1: <start time> - <end time> - Event | reasoning\nEvent 2: <start time> - <end time> - Event | reasoning \nEvent 3: <start time> - <end time> - Event | reasoning\n\n[Continue for all events identified]\n</think>\n\n<answer>\nEvent 1: <start time> - <end time> Specific Event | detailed explanation.\nEvent 2: <start time> - <end time> Specific Event | detailed explanation.\nEvent 3: <start time> - <end time> Specific Event | detailed explanation.\n[Continue for all events identified]\n</answer>"
    },
    {
      "role": "user",
      "content": [
        {
          "type": "video",
          "video": "mimicgen_dataset/videos_ts/coffee_d0_demo2_agentview.mp4"
        },
        {
          "type": "text",
          "text": "Event 1: grasping the pod"
        }
      ]
    },
    {
      "role": "assistant",
      "content": "<answer>\nEvent 1: <2.9> <5.5> grasping the pod\n\n</answer>"
    }
  ]
}

Dataset configuration:

Input Type: Video
Resolution: 512×512
Encoding: H.264
Training Set Size: 3644 videos (MimicGen dataset)
Testing Set Size: 60 videos (Isaac Lab + Bridge1 + AgiBot)
Fps: 30
Caption or annotation: see table above

Dynamic Resolution and Pixel Budget Concept

Cosmos Reason 1 uses dynamic resolution: it automatically trades off frame rate and spatial resolution to fit a fixed per‑video total pixel budget:

budget = vision tokens × (2 × patch size)²

With patch size = 14; vision tokens ≤ 8192:

8192 × (2 × 14)² = 6,422,528 pixels per video

Because clip duration is fixed and you set fps, the system adjusts per‑frame resolution to stay within the budget:

pixels per frame ≈ budget / frames, where frames = duration × fps

with side resolution \( \approx \sqrt{\text{pixels per frame}} \) is then snapped to the patch grid of 14 pixels

Example: 7s input video (Cosmos Reason params: vision tokens = 8192, fps = 8):

7 × 8 = 56 frames
6,422,528 / 56 ≈ 114,700 pixels per frame with side resolution: \( \sqrt{114,700} \approx 338 \) pixels
snapped to patch grid: 24 × 14 = 336 pixels per side ⇒ 24² = 576 patches per frame

Guidance:

Ensure objects of interest cover multiple patches; otherwise they may be missed
If detail is insufficient, shorten the clip (trim or speed up) or lower fps to increase per‑frame spatial resolution

Supervised Fine-Tuning (SFT)

We apply supervised fine-tuning (SFT) to improve subtask boundary recognition, produce standardized timestamp formats, and enhance temporal reasoning accuracy for robotic manipulation sequences.

Training Configuration

We use the following mimicgen_sft.toml configuration optimized for 8 GPUs:

Training Configuration

[custom.dataset]
path = "data/mimicgen_dataset_hf"

[train]
epoch = 120
output_dir = "outputs/mimicgen_sft"
compile = false
train_batch_per_replica = 32

[policy]
model_name_or_path = "nvidia/Cosmos-Reason1-7B"
model_max_length = 4096

[logging]
logger = ['console']
project_name = "cosmos_reason1"
experiment_name = "post_training_hf/mimicgen_sft"

[train.train_policy]
type = "sft"
conversation_column_name = "conversations"
mini_batch = 4

[train.ckpt]
enable_checkpoint = true

[policy.parallelism]
tp_size = 1
cp_size = 1
dp_shard_size = 8
pp_size = 1

We use the cosmos-rl library for fine-tuning.:

# In the cosmos-reason1 root directory
cd examples/post_training_hf/
cosmos-rl --config configs/mimicgen_sft.toml scripts/custom_sft.py

Re-Evaluation

After fine-tuning, we evaluate the model on the test dataset from the Zero-shot Evaluation step. Re-evaluation was performed using postprocess.py:

python postprocess.py /path/to/results --gt-timestamps cube --use-start-time

--gt-timestamps options: nut, cube, bridge, toaster, chips, fork, cup

Results Comparison

Task	VLM (max_tokens = 4096, fps = 8)	Mean absolute error ⇩ (s)	Mean relative error ⇩ (%)	Improvement
#1 Isaac Lab Cube Stacking 10 demos	Cosmos-Reason 1 7B	0.769	52.0	baseline
	Cosmos-Reason 1 7B SFT	0.434	28.6	-23.4% MRE
	Qwen3 VL 8B	0.508	32.4	-
	Qwen3 VL 30B A3B	0.459	30.4	-
	Qwen3 VL 235B A22B	0.260	17.3	-
#2 Isaac Lab Humanoid Nut Pouring 5 demos	Cosmos-Reason 1 7B	1.487	61.3	baseline
	Cosmos-Reason 1 7B SFT	0.506	17.5	-43.8% MRE
	Qwen3 VL 8B	1.001	38.1	-
	Qwen3 VL 30B A3B	1.407	56.4	-
	Qwen3 VL 235B A22B	0.310	13.7	-
#3 Real world Bridge 1 Cube Stacking 10 demos	Cosmos-Reason 1 7B	0.560	109.7	baseline
	Cosmos-Reason 1 7B SFT	0.560	109.7	No change
	Qwen3 VL 8B	0.489	94.4	-
	Qwen3 VL 30B A3B	0.560	109.7	-
	Qwen3 VL 235B A22B	0.158	26.5	-
#4 Real world AgiBot dataset 40 demos	Qwen3 VL 2B	2.965	93.2	-
	Cosmos-Reason 1 7B	2.381	68.45	baseline
	Cosmos-Reason 1 7B SFT	1.571	43.03	-24.75%
	Qwen3 VL 8B	1.769	50.33	-
	Qwen3 VL 30B A3B	2.127	58.98	-
	Qwen3 VL 235B A22B	1.608	45.45	-

Analysis

Mean relative error reduced 2x for simulation datasets and 1.5x for real-world datasets.
Fine-tuning reduced error from 71% to 47%, improving timestamp accuracy and reliability.

Comparison Examples

The following example shows temporal localization improvements from fine-tuning:

Isaac Lab Humanoid Nut Pouring

Isaac Lab Nut Pouring

Before SFT:

Event 1: <0.0> <4.5> - Picking up the red cylindrical object with Robot Arm 1.
Event 2: <3.5> <5.5> - Placing the red cylindrical object into the yellow bowl.
Event 3: <7.2> <9.0> - Grasping the yellow bowl with Robot Arm 2 and lifting it onto the gray sheet.

After SFT (120 epochs):

Event 1: <1.7> <5.8> Picking up the red cylinder from the table.
Event 2: <5.8> <7.9> Placing the red cylinder in the blue tray.
Event 3: <7.9> <10.6> Picking up the yellow bowl from the table.

Ground Truth:

Event 1: <1.7> <6.2> Picking up the red cylinder from the table.
Event 2: <6.2> <8.3> Placing the red cylinder in the blue tray.
Event 3: <8.3> <10.6> Picking up the yellow bowl from the table.

The fine-tuned model provides more accurate subtask timestamp boundaries compared to the baseline.

Conclusion

Key Achievements

Reduced timestamp localization error by up to 2x after fine-tuning.
Standardized output formatting, improving consistency and downstream parsing reliability.
Demonstrated scalability for automated timestamp annotation in MimicGen pipelines.

Lessons Learned

Fine-tuning on task-specific datasets significantly improves timestamp consistency and reduces variance across model runs.
Simulation data provides more reliable information for SFT model than real-world data.