Cosmos Transfer 1 Weather Augmentation for Intelligent Transportation System (ITS) Images

Authors: Reihaneh Entezari • Charul Verma • Arihant Jain • Dharshi Devendran • Ratnesh Kumar Organization: NVIDIA

Model	Workload	Use Case
Cosmos Transfer 1	Inference	Data augmentation

This tutorial demonstrates how to use the Cosmos Transfer 1 model for Synthetic Data Generation (SDG) to augment data and improve the accuracy of Computer Vision (CV) or Vision-Language Model (VLM) algorithms downstream.

Setup and System Requirement

Why Weather Augmentation Matters

Acquiring Intelligent Transportation System (ITS) data in adverse weather conditions is time consuming and challenging, and adding weather-based diversity to pre-recorded datasets is prohibitive and impossible. This data scarcity leads to poor performance of computer vision models in real-world adverse weather scenarios. We can address this critical gap by using Cosmos Transfer 1 to perform weather augmentation, generating synthetic weather-diverse training data from existing clear-weather datasets, thus significantly improving the robustness and accuracy of downstream ITS object detection models across various weather conditions.

Demonstration Overview

This is a demonstration of using Cosmos Transfer 1 for weather augmentation of ITS images. To showcase the impact, this tutorial walks through a step-by-step Cosmos Transfer 1 weather augmentation process for ITS images to improve a downstream ITS object detector RT-DETR model.

Cosmos Transfer 1 Pipeline Components

Sample Input ITS Images/Videos

Start with clear daylight highway scenes as the base input for augmentation.

Input image

Note: If you are dealing with images only, you need to create a video with a repeating image frame before proceeding with the next steps and Cosmos Transfer 1.

Video Captioning Using a VLM

First, generate detailed captions of the input images using a VLM as captioner:

"The video depicts a busy highway scene during what appears to be late afternoon, as indicated by the long shadows cast by the trees and vehicles. The sky is clear and blue, suggesting fair weather with no visible clouds. The highway is multi-lane, with traffic flowing in both directions. There are several cars on the road, varying in color and model, indicating a mix of personal vehicles. The surrounding landscape includes dense greenery on both sides of the highway, with trees lining the edges. In the background, there are hills or low mountains, adding depth to the scene. The overall atmosphere is calm yet active, typical of a weekday commute"

Note: We used Qwen 2.5 VL for captioning with the following input prompt:

Describe this video in detail and please include weather or time of the day detail at the start.

Prompt Augmentation for Rainy Night Conditions Using a LLM

Next, use a LLM to augment the caption for the desired weather conditions (rainy night):

"The video depicts a busy highway scene during a rainy night, shrouded in deep darkness. The sky is obscured, and the only visible light comes from scattered headlights and dim streetlights, casting faint reflections on the rain-soaked road. The highway is multi-lane, with traffic flowing in both directions, and the surrounding landscape is barely visible, with dense greenery on both sides of the highway and hills or low mountains in the background, but these features are only faintly discernible due to the darkness. There are several cars on the road, varying in color and model, their headlights illuminating the rain-soaked road, where puddles and thin streams of water shimmer faintly. The overall atmosphere is calm yet active, typical of a nighttime commute."

NOTE: We used Llama 3.1 with the following input prompt for weather augmentation:

I have the following caption describing a video captured by a CCTV camera. I want to adapt this caption so that it accurately reflects the following environmental condition. The updated caption will be used as a prompt for a video generation model.

Please update the original caption minimally and realistically — do not add any details that are not already implied or explicitly stated. Avoid hallucinations, assumptions, or dramatic interpretations. Do not add events, objects, or actions that do not exist in the original caption. Simply modify the scene description to reflect the given weather or lighting condition while preserving the factual content and structure.

If the environmental condition indicates daytime, remove any mention of headlights, artificial lighting, or shadows caused by artificial sources. If the condition implies nighttime or low visibility (e.g., fog, heavy rain, snow at night), you may include realistic visual effects such as headlights, reflections, or dim street lighting — but only if consistent with the original caption.

Format the output as a single paragraph enclosed within angle brackets <...>. Do not include any other text.

Original Caption: {video_caption}

Environmental Condition to Reflect: {condition}

The weather condition description for rainy night is as follows:

Rainy night shrouded in deep darkness, with only scattered headlights and dim streetlights casting faint reflections on the rain-soaked road, where puddles and thin streams of water shimmer faintly.

Cosmos Transfer 1 Output Image for Rainy Night

Using the augmented prompt, Cosmos Transfer 1 generates realistic rainy night scenes while preserving the structural elements of the original scene.

Rainy night

Note: We have chosen the middle frame of the output video from Cosmos Transfer 1.

Control Parameters in Cosmos Transfer 1

In general, it is possible to control for vis, edge, segmentation, and depth when running Cosmos Transfer 1. However, experiments have demonstrated that when controlling only for segmentation and depth, the generated images are well suited for daylight/weather augmentations, especially when generating night scenes.

Recommended Control Configuration

The recommended control config file is as follows:

{
    "input_video_path": "assets/example1_input_video.mp4",
    "negative_prompt": "The video shows an unrealistic traffic scene with floating or jittery cars that ignore physics, sliding without friction, turning sharply without steering, or vanishing mid-motion. Vehicles overlap unnaturally, lack weight or inertia, and don't align with the road. Road markings are inconsistent or missing, and lanes appear distorted. Lighting is flickering and fake, while backgrounds look melted or warped. Pedestrians and traffic signals are misshapen, duplicated, or misplaced. Overall, the scene feels chaotic, lacks depth, structure, and visual coherence.",
    "guidance": 8.0,
    "sigma_max": 90.0,
    "depth": {
        "control_weight": 0.9
    },
    "seg": {
        "control_weight": 0.9
    }
}

Note: Control weights with magnitude of 0.9 are used so that the output image adheres better to the input image.

Removing Vis and Edge Controls

To illustrate why vis and edge controls are removed when using Cosmos Transfer 1, two rainy night images were generated from the same input image:

With all controls (vis, edge, seg, depth): Night scenes are not generated as expected.
With only seg and depth controls: The generated image achieves the desired darkness level.

Cosmos Transfer 1 Generated Image - All Controls (vis, edge, depth, seg control weights of 0.9)

Rainy night with all controls

Cosmos Transfer 1 Generated Image - Depth and Seg Only (control weights of 0.9)

Rainy night with depth and segmentation controls

As demonstrated, using all controls (vis, edge, depth, seg) fails to generate proper night scenes, while using only depth and seg controls produces the expected dark imagery.

Training Downstream ITS Detector

To illustrate the impact of Cosmos Transfer 1 weather augmentation on a downstream ITS detector, we have trained an RT-DETR detector with and without the Cosmos Transfer 1 augmented images (using all possible weather/lighting conditions). We evaluated the trained models on three public KPIs.

Results

Experiments were conducted with a set of ~12k real ITS images and ~84K weather augmented Cosmos Transfer 1 images using the pipeline described in this tutorial. Below are the results on the three KPIs:

ACDC Dataset

The ACDC dataset is an Intelligent Transportation System (ITS)-related dataset with different weather conditions such as snow, fog, rain, and night. The dataset contains approximately ~2k images and can be found here: https://acdc.vision.ee.ethz.ch/

Below are the AP50 results of the most common objects (car, person, bicycle) in the dataset along all weather conditions:

Result plot ACDC As shown above, the blue curves (which are from the trained detector with Cosmos Transfer 1 augmented images) have consistently higher AP50 compared to the red curves, across all weather/lightings and objects.

SUTD Dataset

The SUTD dataset is another ITS-related dataset with more diverse weather conditions such as snow, fog, rain, night, cloudy, and sunny. The dataset contains approximately ~10k images and can be found here: https://sutdcv.github.io/SUTD-TrafficQA/#/download

Below are the AP50 results of the most common objects (car, person, bicycle) in the dataset along all weather conditions:

Result plot SUTD As shown above, the blue curves (which are from the trained detector with Cosmos Transfer 1 augmented images) have consistently higher AP50 compared to the red curves, across all weather/lightings and objects.

DAWN Dataset

The DAWN dataset is also another ITS-related dataset with different weather conditions such as snow, fog, rain, and sandy. The dataset contains approximately ~1k images and can be found here: https://www.kaggle.com/datasets/shuvoalok/dawn-dataset

Below are the AP50 results of the most common objects (car, person) in the dataset along all weather conditions:

Result plot DAWN As shown above, the blue curves (which are from the trained detector with Cosmos Transfer 1 augmented images) have consistently higher AP50 compared to the red curves, across all weather/lightings and objects.

Conclusion

This tutorial demonstrates how Cosmos Transfer 1 can effectively augment ITS datasets with challenging weather conditions, leading to improved performance of downstream object detection models. These are the key insights:

Use only segmentation and depth controls for optimal weather augmentation.
Proper prompt engineering is crucial for realistic weather condition generation.
Synthetic augmentation significantly improves model performance on rare weather conditions.

For more details on implementation and training configurations, refer to the accompanying setup and configuration files in this repository.