Model Evaluation Reason
Standard Benchmarks
Cosmos Reason models can be evaluated using standardized benchmarks that assess reasoning capabilities across diverse scenarios. The Cosmos Reason 1 Benchmark Example provides instructions for running evaluation subsets, including physical reasoning, spatial understanding, and temporal consistency assessments.
Custom Evaluation on Your Data
Use your own video data (e.g., robotics, egocentric) to probe task‑specific reasoning.
Prompt Templates
- "What is happening in this clip?"
- "Describe the motion"
- Domain-specific questions tailored to your use case
What to Measure
- Answer correctness: Manual review or LLM-as-a-judge
- Consistency across time: Temporal coherence of responses
- Groundedness: References what is actually visible
- Precision vs hallucination: Especially important post-training
Automatic Metrics (During Post-Training)
Instruction Tuning (SFT)
Generate answers on a held‑out set and measure:
- Per-token loss / perplexity: Used on held‑out instruction–response pairs
- Text similarity: BLEU, ROUGE, METEOR vs ground‑truth captions
- Embedding similarity: CLIPScore, BERTScore vs reference answers
Video–Caption Post-Training
When post‑training on <video, caption>
pairs:
- Build an evaluation set in MCQ/BCQ format with ground truth
- Track whether the model improves at video understanding over time
- Monitor reasoning and comprehension improvements