Transfer Model Evaluation (ControlNet / Cosmos Transfer)

Evaluate multi‑modality ControlNet models (e.g., Cosmos Transfer) for fidelity to control signals and overall video quality.

Applicability of Predict Metrics

All metrics documented in evaluation_predict.md apply equally to Transfer (ControlNet) models. Use them alongside the ControlNet‑specific metrics below for a comprehensive evaluation.

Core Metrics (Control Fidelity & Technical Quality)

Blur SSIM (Structural Similarity Index Measure)

This metric measures perceptual similarity after applying identical blur to predicted and ground‑truth videos; it is robust to minor misalignment and can be reported per region.

How this metric works

Apply the same blur strength to both predicted and ground‑truth frames.
Compute SSIM considering luminance, contrast, and structure on blurred frames.
Average per‑pixel SSIM over frames; optionally compute FG/BG using masks.

Canny‑F1 Score

This metric measures edge preservation accuracy; it treats edge detection as a binary classification task and reports F1 with precision/recall.

How this metric works

Extract Canny edge maps for predicted and ground‑truth frames
Define positives as “edge” and negatives as “non‑edge”
Compute: TP (edge in both), FP (edge only in pred), FN (edge only in GT)
F1 = 2 × (Precision × Recall) / (Precision + Recall); also report precision/recall
(Optional) FG/BG evaluation via region masks

Depth RMSE (Root Mean Square Error)

This metric measures scale‑invariant depth error after median scaling; it supports log‑space computation and masking invalid values.

How this metric works

Use Scale‑Invariant RMSE (SI‑RMSE) for robustness to outliers
Median scaling: ratio = median(GT) / median(pred)
Compute RMSE after scaling: RMSE = sqrt(mean((GT − scaled_pred)²))
(Optional) compute in log‑space; mask zeros/invalid depth

Seg mIOU (Mean Intersection Over Union)

This metric measures segmentation fidelity between predicted and ground‑truth masks with flexible matching strategies.

How this metric works

For each object/segment: IOU = Intersection / Union
Matching strategies: max‑IOU per GT segment or Hungarian (1‑to‑1 optimal assignment)
Report the mean IOU across matches and recall (GT segments detected above threshold)

Dover Score (Video Quality Assessment)

This metric measures technical video quality, focusing on clarity, compression artifacts, and motion smoothness (not aesthetics).

How this metric works

Use DOVER (Disentangled Objective Video Quality Evaluator) on full videos.q
Assesses clarity/sharpness, compression artifacts, motion smoothness, and overall technical quality.
Returns a single quality score; compute for both predicted and ground‑truth videos for comparison.