Skip to content

Style-Guided Video Generation with Cosmos Transfer 2.5

Authors: Fangyin WeiAryaman Gupta Organization: NVIDIA

Overview

Model Workload Use Case
Cosmos Transfer 2.5 Inference Style-guided video generation using image references

Cosmos Transfer 2.5 introduces a powerful new capability: generating videos that combine structural control (edge/depth/segmentation) with style guidance from reference images. This enables users to create videos that maintain specific visual aesthetics while following precise motion and structure patterns.

Key Features

  • Image-Guided Style Transfer: Use any image as a style reference for video generation
  • Multi-Modal Control: Combine edge/depth/segmentation control with image prompts
  • Flexible Style Application: Control how strongly the reference image influences the output
  • Temporal Consistency: Maintains coherent style across all video frames

How It Works

  1. Input Control Video: Provide structural guidance through edge, depth, or segmentation
  2. Style Reference Image: Supply an image that defines the desired visual style without changing the structure guided by the input control video
  3. Text Prompt: Describe the scene and desired output
  4. Model Processing: Transfer 2.5 combines all inputs to generate stylized video

Note: While Cosmos Transfer 2.5 includes four control checkpoints (edge, blur, depth, and segmentation), the image prompt feature is only supported for edge, depth, and segmentation controls. Blur control is not compatible with image prompts as it already incorporates color and style guidance, making additional image prompts redundant and potentially conflicting.

Dataset and Setup

Input Data Requirements

For style-guided video generation, you need:

  • A control video (edge, depth, or segmentation)
  • A style reference image (JPEG/PNG)
  • A text prompt describing the desired output

Data Structure

The pipeline expects inputs in the following format:

input_directory/
├── control_video.mp4    # Edge/depth/segmentation video
├── style_image.jpg      # Reference image for style
└── prompt.txt           # Text description

Results

The following two examples demonstrate how different environmental styles can be applied to the same edge-controlled motion:

Example 1

Text Prompt:

"The camera moves steadily forward, simulating the perspective of a vehicle driving down the street. This forward motion is smooth, without any noticeable shaking or abrupt changes in direction, providing a continuous view of the urban landscape. The video maintains a consistent focus on the road ahead, with the buildings gradually receding into the distance as the camera progresses. The overall atmosphere is calm and quiet, with no pedestrians or vehicles in sight, emphasizing the emptiness of the street."

Input
Edge Control
Sunny Style
Sunset Style
Output
Base Generation
Sunny Style Applied
Sunset Style Applied

Example 2

Text Prompt:

"A scenic drive unfolds along a coastal highway. The video captures a smooth, continuous journey along a multi-lane road, with the camera positioned as if from the perspective of a vehicle traveling in the right lane. The road is bordered by a tall, green mountain on the right, which casts a shadow over part of the highway, while the left side opens up to a view of the ocean, visible in the distance beyond a row of low-lying vegetation and a sidewalk. Several vehicles, including two red vehicles, travel ahead, maintaining a steady pace. The road is well-maintained, with clear white lane markings and a concrete barrier separating the lanes from the mountain covered by trees on the right. Utility poles and power lines run parallel to the road on the left, adding to the infrastructure of the scene. The camera remains static, providing a consistent view of the road and surroundings, emphasizing the serene and uninterrupted nature of the drive."

Input
Edge Control
Darker Style
Greener Style
Output
Base Generation
Darker Mood Applied
Greener Tone Applied

Key Observations

  • Style Preservation: The reference image's color palette, lighting, and mood are successfully transferred to the generated video
  • Structure Maintenance: Edge control ensures consistent motion and object boundaries across all style variations
  • Temporal Coherence: Style remains consistent throughout the video sequence
  • Flexible Application: Different styles can dramatically change the video's atmosphere while preserving the underlying motion

Configuration Examples

Basic Style-Guided Generation

Below is an example JSON input to run the released code and generate the sunset output shown in example 1.

{
    "name": "image_style",
    "prompt": "The camera moves steadily forward, simulating the perspective of a vehicle driving down the street. This forward motion is smooth, without any noticeable shaking or abrupt changes in direction, providing a continuous view of the urban landscape. The video maintains a consistent focus on the road ahead, with the buildings gradually receding into the distance as the camera progresses. The overall atmosphere is calm and quiet, with no pedestrians or vehicles in sight, emphasizing the emptiness of the street.",
    "video_path": "calm_street.mp4",
    "image_context_path": "sunset.jpg",
    "seed": 1,
    "edge": {
    }
}

Configuration Parameters

Parameter Description Required
name Identifier for this generation job Yes
prompt Text description of the desired output scene Yes
video_path Path to the input RGB video (used to generate control signals) Yes
image_context_path Path to the style reference image (JPEG/PNG) Yes
seed Random seed for reproducibility No
edge / depth / seg Control modality configuration (use one) Yes
control_weight Strength of structural control (0.0-1.0, default: 1.0) No

Control Modality Options

You can use different control types depending on your needs. Only edge, depth, and segmentation support image prompts:

// Edge control - preserves structure and shape
{ "edge": { "control_weight": 1.0 } }

// Depth control - maintains 3D spatial consistency
{ "depth": { "control_weight": 1.0 } }

// Segmentation control - enables semantic replacement
{ "seg": { "control_weight": 0.8 } }

Best Practices

Style Image Selection

  1. Lighting Consistency: Choose reference images with lighting that matches your intended scene
  2. Color Harmony: Select images with color palettes that complement your content
  3. Quality Matters: High-resolution reference images produce better style transfer
  4. Contextual Relevance: Images with similar environments work best

Parameter Tuning

  • Control Weight: Balance between structure preservation and style flexibility
  • Higher for precise motion tracking
  • Lower for more artistic interpretation

  • Guidance Scale: Affects adherence to both the text prompt and reference image

  • Higher values: Increased influence from both text prompt and reference image

Applications

  • Film and Animation: Apply consistent visual styles across scenes
  • Content Creation: Transform videos to match brand aesthetics
  • Artistic Expression: Create unique visual interpretations
  • Environmental Simulation: Generate videos in different lighting/weather conditions
  • Style Consistency: Maintain visual coherence across video series

Troubleshooting

Common Issues and Solutions

Style Not Applying Strongly Enough

  • Increase guidance parameter
  • Use more distinctive reference images
  • Adjust prompt to emphasize style elements

Loss of Motion Coherence

  • Increase control_weight for edge/depth/segmentation
  • Reduce guidance if it's too dominant
  • Ensure control video quality is high

Color Bleeding or Artifacts

  • Check reference image quality
  • Reduce guidance scale
  • Adjust guidance and control weight balance

Resources