Skip to content

Intelligent Transportation Post-Training with Cosmos Reason 1

Authors: Paris ZhangChintan ShahTomasz Kornuta Organization: NVIDIA

Overview

Supervised Fine-Tuning (SFT) is used to improve the accuracy of a pre-trained model by teaching it to follow specific instructions or understand new tasks using labeled examples. While a base model learns general patterns from large, diverse data, SFT aligns the model to specific tasks with desired outputs by showing clear input–output pairs. Using domain-specific data is essential—it embeds the specialized vocabulary, visual patterns, and reasoning needed for real-world scenarios. In this recipe, we show how to fine-tune the Cosmos Reason 1-7B model to understand the world from a traffic point of view - scene understanding, road attributes, and pedestrian situation.

Before fine-tuning the model, let's look at the zero-shot performance of the model. The model spots some of the content correctly while identifying one of the pedestrians crossing the road incorrectly.


Here’s the end-to-end workflow to fine-tune Cosmos Reason - from data preparation, supervised fine-tuning on the prepared dataset to quantizing and deploying the model for inference.


Data Preparation

For this experiment, we used the WovenTraffic Safety (WTS) Dataset from Woven by Toyota, Inc.. This is a real-world pedestrian-centric traffic video dataset featuring 255 traffic scenarios, including staged pedestrian-related accidents across 1.2k video segments. It provides detailed textual descriptions of pedestrian and vehicle behavior, bounding box annotations, and traffic Visual Question Answering (VQA) with multiple-choice questions (MCQ).

Note: The goal of this cookbook is to demonstrate post-training on Cosmos-Reason 1. The dataset used in this experiment is governed by the terms specified in the WTS DATASET Terms of Use.

Here are examples of the video clips and textual descriptions of the pedestrian from the dataset. Image and text courtesy of Woven by Toyota, Inc.


For this experiment, we fine-tune the Cosmos Reason 1 model on the Environment VQA subset of the WTS dataset. The subset contains 341 videos with 5k MCQ question-answer pairs. The average video length is about 75 seconds.

Data Pre-processing

Step 1: Please request and download the dataset from the WTS Dataset homepage. This dataset is owned and managed by “Woven by Toyota”. You will have to request for access directly with them.

Step 2: Pre-process the VQA annotations into Llava format.

# From scripts/examples/reason1/intelligent-transportation directory
python data_preprocess.py --data_path /path/to/WTS/folder


Post-Training with Supervised Fine-Tuning (SFT)

After preprocessing, the WTS dataset is in Llava format and ready for training. To launch training, we follow the default cosmos-rl training command in Cosmos Reason 1 Post-Training Llava Example:

# From scripts/examples/reason1/intelligent-transportation directory
cosmos-rl --config sft_config.toml custom_sft.py

Hyperparameter optimization

For this SFT experiment, we updated the default configuration in Cosmos Reason 1’s post-training llava example. Here are the key updates that optimize this experiment on 8 A100.

Training Configuration
[custom.dataset]
annotation_path = "/path/to/WTS/environment_mcq_llava_train.py"
media_path = "/path/to/WTS/videos/train"
system_prompt = "You are a helpful assistant that can answer questions about a street-view CCTV footage."

[custom.vision]
fps = 1
max_pixels = 81920

[train]
output_dir = "outputs/wts_environment_mcq"
optm_warmup_steps = 0
train_batch_per_replica = 16
enable_validation = false

[policy]
model_name_or_path = "nvidia/Cosmos-Reason1-7B"
model_max_length = 81920

[logging]
logger = ['console', 'wandb']
project_name = "cosmos_reason1"
experiment_name = "post_training/wts_environment_mcq"

[train.train_policy]
type = "sft"
mini_batch = 1
dataset.test_size = 0

[train.ckpt]
enable_checkpoint = true

[policy.parallelism]
tp_size = 1
cp_size = 1
dp_shard_size = 8
pp_size = 1

Ablation Study

We ablate the training with two common configurations of different frame resolutions:

  1. Max pixels 81,920 per frame: This translates to roughly 4k total vision tokens according to the average WTS video length. Please see below for the calculation.
  2. Total pixels 6,422,528: This translates to a total of 8k vision tokens. Please see below for the calculation.

Calculation for max_pixels 81,920 to 4k total vision tokens:

  1. Frame Calculation: On average, our training video is about 75 seconds long. With 1 fps, it is: 75 seconds * 1 fps = 75 frames per video
  2. Frame Grouping: According to the Qwen2.5-VL paper (the backbone model of Cosmos-Reason 1), the model groups every two consecutive frames as a single unit input to the language model: 75 frames / 2 = 38 frame pairs per video
  3. Tokens per Frame: The Qwen2.5-VL model uses a vision transformer that processes images in patches. Each token corresponds to a 28 x 28 = 784 pixel area. Since we set max pixels per frame as 81,920 pixels: 81,920 pixels/frame / 784 pixels/token ≈ 104 tokens/frame
  4. Total Vision Tokens: Multiply the number of frame groups by the tokens per frame: 38 frame groups × 104 tokens/group = 3,952 tokens

Calculation for total_pixels 6,422,528 to 8k total vision tokens:

  1. As noted above, based on Qwen2.5-VL model's architecture, each vision token represents a 28 x 28 = 784 pixel area.
  2. Since we set total_pixels as 6,422,528: 6,422,528 pixels / 784 pixels/token = 8,192 tokens

Model Evaluation

After training, we evaluate the model on the validation set of the Environment VQA subset of the WTS dataset. The evaluation script below will save the model responses and accuracy score in the results directory.

Please update the eval_config.yaml with the path to the post-trained model and the validation set.

# From scripts/examples/reason1/intelligent-transportation directory
python evaluate.py --config eval_config.yaml


Results

Quantitative Results

First, let's review the quantitative results for both experiments - one with 81,920 pixels/frame (4K visual tokens) and second with 8K visual tokens on the environment VQA subset of the WTS dataset. This is a collection of multiple choice questions (MCQ) on traffic and pedestrian videos. Overall, we see the accuracy jump to over 90 for both experiments. There is not a lot of delta in training accuracy between the two experiments and 4K visual token converge faster and will give you faster inference time.

Training Time

We ran all the experiments on 1 node (8 GPUs) of A100. The table below captures the training time for the two different settings with one training epoch. Both were run with 1 FPS sampling rate. As expected with only 81,920 pixels/frame (4K vision tokens), the model converged in roughly half the time as 8K vision tokens. In summary, you can train a very good accurate model with this much amount of data in an hour or less.

Method FPS & Resolution Training Time (on 8 A100s)
SFT Fps 1, max pixels/frame 81920 35m
SFT Fps 1, total pixels 8k vision tokens 1h 3m

Qualitative Results

After SFT training with multiple choice questions (MCQ), the model achieves a significant accuracy improvement on the validation set MCQs on WTS videos. Additionally, the model is also able to answer open-ended questions more accurately than zero shot on external videos. Below we show qualitative comparison of open-ended questions on an unseen video outside of WTS dataset.


Model Deployment

The last step is to deploy the trained model for inference. You can deploy it using either NVIDIA optimized NIM or through VSS blueprint or else you can deploy it in your own application. Before deployment, we will first quantize the LLM portion of the VLM to FP8 for faster inference.

FP8 Quantization

The script to quantize the model to FP8 is provided in the NVIDIA Cosmos Reason 1 repo.

Step 1: Clone the Cosmos Reason 1 repo.

Step 2: To run post-training quantization (PTQ), you need following dependencies:

"vllm==0.9.2"
"transformers>=4.53.1"
"qwen-vl-utils[decord]"
"llmcompressor>=0.6.0"

Step 3: Run the quantization script.

python ./scripts/quantize_fp8.py --model_id 'nvidia/Cosmos-Reason1-7B' --save_dir 'Cosmos-Reason1-7B-W8A8-FP8'

Before deploying the quantized model for inference, please run evaluation on the model for accuracy and ensure quantization doesn’t introduce any accuracy regression.

Deploy on NIM

NVIDIA NIM™ provides containers to self-host GPU-accelerated inferencing microservices for pretrained and customized AI models across clouds, data centers, and RTX™ AI PCs and workstations, with industry-standard APIs for simple integration into AI applications.

To deploy a post-trained checkpoint, go to the fine-tune-model section in NIM documentation. Go to "Cosmos Reason 1-7B" tab. NIM will automatically serve an optimized vLLM engine for this model. The model needs to be in the Huggingface checkpoint or quantized checkpoint.

export CUSTOM_WEIGHTS=/path/to/customized/reason1
docker run -it --rm --name=cosmos-reason1-7b \
    --gpus all \
    --shm-size=32GB \
    -e NIM_MODEL_NAME=$CUSTOM_WEIGHTS \
    -e NIM_SERVED_MODEL_NAME="cosmos-reason1-7b" \
    -v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \
    -u $(id -u) \
    -p 8000:8000 \
    $NIM_IMAGE

Deploy with NVIDIA VSS Blueprint

NVIDIA Blueprint for video search and summarization (VSS) leverages both Vision-Language Models (VLM) and Large Language Models (LLM) to generate captions, answer questions, and summarize video content, enabling rapid search and understanding of videos on NVIDIA hardware.

By default, VSS is configured to use base Cosmos Reason 1-7B as the default VLM. Deployment instructions for VSS are available in its official documentation. VSS supports two primary deployment methods: Docker Compose and Helm charts.

For Docker Compose deployment, navigate to the Configure the VLM section under “Plug-and-Play Guide (Docker Compose Deployment)” to integrate a custom fine-tuned Cosmos Reason 1 checkpoint.

export VLM_MODEL_TO_USE=cosmos-reason1
export MODEL_PATH=</path/to/local/cosmos-reason1-checkpoint>
export MODEL_ROOT_DIR=<MODEL_ROOT_DIR_ON_HOST>

Similarly for Helm Chart deployment, navigate to this page - Configure the VLM section under “Plug-and-Play Guide (Helm Deployment)”.

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: cosmos-reason1
          - name: MODEL_PATH
            value: "/tmp/cosmos-reason1"
  extraPodVolumes:
  - name: local-cosmos-reason1-checkpoint
    hostPath:
      path: </path/to/local/cosmos-reason1-checkpoint>
  extraPodVolumeMounts:
  - name: local-cosmos-reason1-checkpoint
    mountPath: /tmp/cosmos-reason1


Conclusion

Supervised Fine-Tuning Cosmos Reason 1 on traffic-specific data boosts accuracy from zero-shot levels to over 90% on traffic VQA tasks. Key insights:

  • Domain data matters: Specialized datasets drive substantial performance gains.
  • Efficient training: 4K vision tokens per frame converged twice as fast as 8K with similar accuracy.
  • Seamless deployment: Workflow supports quantization and deployment via NIM or VSS.

This methodology can be applied to any physical AI domain by substituting relevant datasets.