Model Evaluation Predict
Author: Arslan Ali Organization: NVIDIA
This page focuses on evaluation of Predict (generative video) models. It introduces commonly used quality metrics, explains what each metric measures and how it works, and then provides source implementations with step‑by‑step instructions to run them.
Key Terms
| Term | Definition |
|---|---|
| FID (Fréchet Inception Distance) | A metric that measures the similarity between two sets of images by comparing their feature distributions extracted from a pre-trained neural network. |
| FVD (Fréchet Video Distance) | An extension of FID for videos that captures both spatial (appearance) and temporal (motion) quality by comparing video feature distributions. |
| Fréchet Distance | A statistical measure of similarity between two probability distributions; used in FID/FVD to quantify how "far apart" generated content is from real content. |
| Sampson Error | A geometric error metric that measures the distance from matched keypoints to their corresponding epipolar lines; used to evaluate multi-view consistency. |
| TSE (Temporal Sampson Error) | Sampson error computed between consecutive frames within a single camera view; measures temporal stability. |
| CSE (Cross-view Sampson Error) | Sampson error computed between simultaneous frames from different camera views; measures multi-view geometric alignment. |
| Epipolar Geometry | The geometric relationship between two camera views of the same 3D scene; defines constraints on where corresponding points can appear. |
Overview: Quality Metrics for Predict Models
Use these metrics to evaluate generative video models (Predict):
| Metric | Measures | Use Case | Better Direction |
|---|---|---|---|
| FID | Image realism and diversity | Single-frame quality assessment | Lower ↓ |
| FVD | Spatio-temporal video quality | Overall video quality with motion coherence | Lower ↓ |
| TSE | Temporal geometric consistency | Detecting flickering, jitter, or drift within views | Lower ↓ |
| CSE | Cross-view geometric consistency | Multi-camera alignment and 3D consistency | Lower ↓ |
For VLM-based assessment details, refer to Cosmos Reason as Reward and the Cosmos Reason Benchmark Example.
Metrics Cheat Sheet
Use this quick reference to interpret your evaluation scores:
Video Quality (FID/FVD)
| Rating | FID Score | FVD Score | Interpretation |
|---|---|---|---|
| Excellent | < 30 | < 100 | High-quality generation, close to ground truth |
| Good | 30 – 50 | 100 – 200 | Acceptable quality for most applications |
| Fair | 50 – 100 | 200 – 400 | Noticeable quality gaps; consider improvements |
| Poor | > 100 | > 400 | Significant quality issues; requires attention |
Geometric Consistency (Sampson Error)
| Rating | TSE/CSE (pixels) | Interpretation |
|---|---|---|
| Excellent | < 1.0 | Very high geometric consistency |
| Good | 1.0 – 3.0 | Acceptable for most applications |
| Fair | 3.0 – 5.0 | Noticeable inconsistencies; may need improvement |
| Poor | > 5.0 | Significant geometric errors |
Note: These thresholds are general guidelines. Acceptable ranges may vary depending on your specific use case, dataset characteristics, and downstream application requirements.
Video Quality Metrics (FID/FVD)
This metric evaluates the quality of generated videos using standardized metrics that compare predicted videos against ground truth.
Step 1: Install Metrics Dependencies
# Install all metrics dependencies
pip install -r scripts/metrics/requirements.txt
# Or install individually:
pip install decord torchmetrics[image] torch-fidelity # For FID
pip install cd-fvd decord einops scipy # For FVD
Step 2: Compute FID (Fréchet Inception Distance)
FID measures the quality and diversity of generated frames by comparing feature distributions from a pre‑trained Inception network.
What this metric measures
- Distance between distributions of real vs generated image features (lower is better)
How this metric works (high level)
- Extract 2048‑D features with InceptionV3 on real and generated frames.
- Fit Gaussians (means μ and covariances Σ) and compute Fréchet distance:
Example Command
Run the following command to compute FID:
python scripts/metrics/compute_fid_single_view.py \
--pred_video_paths "./path/to/predicted/*.mp4" \
--gt_video_paths "./path/to/ground_truth/*.mp4" \
--num_frames 57 \
--output_file fid_results.json
Arguments
--pred_video_paths,--gt_video_paths,--num_frames,--output_file
Step 3: Compute FVD (Fréchet Video Distance)
FVD extends FID to videos by evaluating spatio‑temporal features, capturing both appearance and motion.
What this metric measures
- Global video quality including temporal coherence and motion dynamics
- More faithful to user perception than purely frame‑based metrics
How this metric works (high level)
- Extract video features for real and generated clips.
- Fit Gaussians and compute Fréchet distance analogously to FID.
Best practices
- Use consistent clip length, frame rate, and spatial size.
- Tune
--batch_sizeto balance throughput and memory. - Ensure paired, alphabetically sorted file lists (pred ↔ GT).
Example Command
Run the following command to compute FVD:
python scripts/metrics/compute_fvd_single_view.py \
--pred_video_paths "./path/to/predicted/*.mp4" \
--gt_video_paths "./path/to/ground_truth/*.mp4" \
--num_frames 57 \
--batch_size 8 \
--target_size 224 224 \
--output_file fvd_results.json
Additional arguments
--batch_size: default: 8--target_size: default: 224 224
Output Format
FID Results (fid_results.json)
{
"FID": 45.23,
"num_pred_videos": 100,
"num_gt_videos": 100,
"num_frames_per_video": 57,
"total_pred_frames": 5700,
"total_gt_frames": 5700
}
FVD Results (fvd_results.json)
{
"FVD": 123.45,
"num_pred_videos": 100,
"num_gt_videos": 100,
"num_frames_per_video": 57,
"batch_size": 8,
"target_size": [224, 224]
}
Important Notes
- Matching Video Counts: Both metrics require equal numbers of predicted and ground truth videos.
- Video Ordering: Videos are sorted alphabetically—ensure consistent naming between predicted and GT sets
- Memory Management: FID loads all frames into memory; FVD processes in batches—adjust the
--batch_sizeparameter for FVD if needed - GPU Usage: These scripts use the GPU if available; otherwise, they fall back to the CPU.
- Supported Formats: These scripts support common video formats (MP4, AVI, MOV) via
decord.
Example: Complete Evaluation Pipeline
# Define video paths
VIDEO_PRED="./generated_videos/*.mp4"
VIDEO_GT="./ground_truth_videos/*.mp4"
# Run FID evaluation
python scripts/metrics/compute_fid_single_view.py \
--pred_video_paths "$VIDEO_PRED" \
--gt_video_paths "$VIDEO_GT" \
--output_file evaluation_fid.json
# Run FVD evaluation
python scripts/metrics/compute_fvd_single_view.py \
--pred_video_paths "$VIDEO_PRED" \
--gt_video_paths "$VIDEO_GT" \
--output_file evaluation_fvd.json
Geometrical Consistency Metrics (Sampson Error)
These metrics evaluate the geometric consistency of multi‑view videos and diagnose temporal instability and cross‑view misalignment that are not captured by FID/FVD.
This metric has the following benefits:
- Lower errors, leading to smoother motion (temporal consistency) and better multi‑view geometry
- Useful for multi‑camera or stitched 2×3 grid outputs
Step 1: Setup Conda Environment for Sampson Metrics
Create and activate the dedicated conda environment for Sampson error evaluation:
# Navigate to the sampson metrics directory
cd scripts/metrics/geometrical_consistency/sampson/
# Create the conda environment from the yml file
conda env create -f environment.yml
# Activate the environment
conda activate sampson
Note: The environment includes Python 3.10, CUDA 12.4.0 support, and the necessary vision packages (OpenCV, Kornia, PyColmap).
Step 2: Prepare Multi‑View Videos
Ensure your videos are in the required 2×3 grid format (in MP4):
Step 3: Compute Sampson Error Metrics
Run the evaluation script to compute both Temporal Sampson Error (TSE) and Cross‑view Sampson Error (CSE):
# Single video
python scripts/metrics/geometrical_consistency/sampson/run_cse_tse.py \
--input path/to/video.mp4 \
--output ./sampson_results \
--verbose
# Directory of videos
python scripts/metrics/geometrical_consistency/sampson/run_cse_tse.py \
--input ./path/to/videos/ \
--pattern "*.mp4" \
--output ./sampson_results
Arguments
--input,--output,--pattern,--verbose
Metrics Explanation
Sampson Error (overview)
- Provides airst‑order approximation of point‑to‑epipolar‑line distance given matched keypoints and the fundamental matrix; lower is better.
Temporal Sampson Error (TSE)
- Measures geometric consistency across consecutive frames within a view; lower values indicate smoother motion and fewer temporal artifacts.
Cross‑view Sampson Error (CSE)
- Measures geometric consistency across simultaneous views; lower values indicate better multi‑view alignment.
Output Format
Per‑Video Results (cse_tse/{video_id}.json)
{
"video_path": "/path/to/video.mp4",
"clip_id": "video_id",
"results": {
"T": { // Temporal errors
"front": {
"mean": 2.345,
"median": 2.123,
"frame_values": [...]
},
"cross_left": {...},
"cross_right": {...}
},
"C": { // Cross-view errors
"front-cross_right": {
"mean": 3.456,
"median": 3.234,
"frame_values": [...]
},
"front-cross_left": {...}
}
}
}
Aggregate Statistics (aggregate_stats.json)
{
"num_videos": 10,
"temporal": {
"front": {"mean": 2.5, "median": 2.3, "std": 0.8},
"overall": {"mean": 2.6, "median": 2.4, "std": 0.9}
},
"cross_view": {
"front-cross_right": {...},
"overall": {...}
}
}
Visualization Plots (cse_tse/{video_id}.png)
Generated plots show the following:
- Solid lines: Temporal Sampson Errors (TSE) for each view
- Dashed lines: Cross-view Sampson Errors (CSE) for view pairs
- Y-axis: Error in √pixels (capped at 10 for visibility)
- X-axis: Frame number
Interpreting Results
Error Value Ranges:
- Excellent (< 1.0 pixels): Very high geometric consistency
- Good (1.0 - 3.0 pixels): Acceptable consistency for most applications
- Fair (3.0 - 5.0 pixels): Noticeable inconsistencies; may need improvement
- Poor (> 5.0 pixels): Significant geometric errors
What High Errors Indicate:
- High TSE: Temporal instability (flickering, jitter, or drift in individual views)
- High CSE: Poor multi-view consistency (misaligned views, incorrect geometry)
- Frame spikes: Sudden error increases suggest problematic frames or scene changes
Example: Complete Sampson Evaluation Pipeline
# Setup environment
conda activate sampson
# Define video paths
VIDEO_DIR="./generated_multi_view_videos"
OUTPUT_DIR="./evaluation_sampson"
# Run Sampson error evaluation on all generated videos
python scripts/metrics/geometrical_consistency/sampson/run_cse_tse.py \
--input "$VIDEO_DIR" \
--pattern "*_gen.mp4" \
--output "$OUTPUT_DIR" \
--verbose
# Results will be in:
# - $OUTPUT_DIR/cse_tse/*.json (per-video metrics)
# - $OUTPUT_DIR/cse_tse/*.png (visualization plots)
# - $OUTPUT_DIR/aggregate_stats.json (summary statistics)
Important Notes
Note the following about Geometrical Consistency metrics:
- Video Format: Videos must be in 2×3 grid format, with 6 camera views.
- Feature Matching: These metrics use SIFT features for correspondence matching between frames/views.
- Memory Usage: Sufficient RAM is required for feature extraction and matching.
- GPU Support: These metrics automatically use GPU acceleration when available for faster processing.