Overview
Cosmos-RL provides native support for Vision-Language-Action (VLA) model reinforcement learning, enabling embodied AI agents to learn robotic manipulation tasks through interaction with simulators.
VLA Models
Cosmos-RL supports two major series of VLA models:
OpenVLA Series
OpenVLA is a vision-language-action model built on the Prismatic framework. Cosmos-RL supports two variants:
OpenVLA: The original OpenVLA model architecture
OpenVLA-OFT: OpenVLA with Online Fine-Tuning support, optimized for RL training
Key Features:
Based on SigLIP/Dinov2 + LLaMA architecture
Action prediction through language model token generation
Norm statistics for action normalization
Compatible with HuggingFace model hub
Model Configuration:
[policy]
model_name_or_path = "Haozhan72/Openvla-oft-SFT-libero10-trajall"
[vla]
vla_type = "openvla-oft"
training_chunk_size = 16
PI Series (PI0.5)
PI0.5 is a diffusion-based VLA model that uses flow-based action prediction for robotic manipulation.
Key Features:
Based on PaliGemma model with Diffusion
Flow-based action generation with configurable denoising steps
Expert network for action prediction
Supports multiple noise methods: flow_sde, flow_cps, flow_noise
Model Configuration:
[policy]
model_name_or_path = "sunshk/pi05_libero_pytorch"
[vla]
vla_type = "pi05"
training_chunk_size = 16
[custom.pi05]
num_steps = 5 # denoise steps
action_chunk = 10
action_env_dim = 7
noise_method = "flow_sde"
train_expert_only = true
Simulators
Cosmos-RL integrates with multiple robotics simulators for VLA training and evaluation:
LIBERO (MuJoCo-based)
LIBERO (Lifelong roBotic lEarning benchmaRk with lOng-horizon tasks) is a MuJoCo-based simulation environment for long-horizon manipulation tasks.
Supported Task Suites:
libero_spatial: 10 spatial reasoning tasks
libero_object: 10 object interaction tasks
libero_goal: 10 goal-oriented tasks
libero_10: 10 long-horizon manipulation tasks
libero_90: 90 diverse manipulation tasks
libero_all: 130 tasks from all suites
Features:
CPU/GPU rendering support
Multi-environment parallel rollout
Task initialization from dataset states
Each environment can run different tasks simultaneously
Action space: 7-DoF (position, rotation, gripper)
Configuration:
[validation]
dataset.name = "libero"
dataset.subset = "libero_10"
dataset.split = "val"
[vla]
num_envs = 8 # parallel environments per rank
Data Source: LIBERO GitHub
BEHAVIOR-1K (IsaacSim-based)
BEHAVIOR-1K is a large-scale benchmark built on OmniGibson/IsaacSim for household manipulation tasks.
Features:
GPU-accelerated physics simulation
1000+ diverse household tasks
Photorealistic rendering
Task descriptions from BEHAVIOR benchmark
Action space in B1K Neurips’25 challenge: 23-DoF
Configuration:
[validation]
dataset.name = "b1k"
dataset.subset = "b1k"
[vla]
num_envs = 4
height = 256
width = 256
Requirements:
NVIDIA GPU with RT Cores (L20, L40, or RTX series)
OmniGibson installation
BEHAVIOR-1K dataset
Data Source: Available through OmniGibson
Reinforcement Learning Algorithms
Cosmos-RL supports multiple RL algorithms optimized for VLA training:
GRPO (Group Relative Policy Optimization)
GRPO is a policy gradient algorithm designed for efficient on-policy learning without requiring a critic network.
Key Features:
No value function / critic network needed
Group-based advantage estimation using relative rewards
Efficient for VLA training with sparse rewards
Supports reward filtering to remove outliers
Algorithm Configuration:
[train.train_policy]
type = "grpo"
trainer_type = "grpo_vla" # or "grpo_pi05" for PI models
variant = "dapo"
temperature = 1.6
epsilon_low = 0.2
epsilon_high = 0.28
lower_bound_ratio = 10.0
kl_beta = 0.0
How It Works:
Generate multiple rollouts per task
Compute relative advantages within each group (task)
Use policy gradient with clipped objective
Update policy based on advantage signals
Change to dapo variant if need asymmetric clipping or dynamic sampling
Quick Start
1. Configure the training recipe by editing toml files under configs/openvla-oft/ or configs/pi05/.
2. Prepare the simulator environment:
For LIBERO:
cd /your/workspace/cosmos-rl
uv sync --extra vla
# or
pip install -e .[vla]
ROBOT_PLATFORM=LIBERO \
uv run cosmos-rl --config configs/openvla-oft/openvla-oft-7b-fsdp2-8p8r-colocate.toml \
--log-dir logs \
cosmos_rl/tools/dataset/libero_grpo.py
For BEHAVIOR-1K:
cd /your/workspace/cosmos-rl
apt install python3-dev # if necessary
uv sync
source .venv/bin/activate
cd /your/workspace
git clone -b v3.7.2 https://github.com/StanfordVL/BEHAVIOR-1K.git
cd BEHAVIOR-1K
uv add pip cffi==1.17.1 # if necessary
apt install -y libsm6 libxt6 libglu1-mesa
UV_LINK_MODE=hardlink ./setup.sh --omnigibson --bddl --joylo --confirm-no-conda --accept-nvidia-eula
3. Prepare the pretrained model:
For PI0.5:
We prepared PI0.5 models ranked 2nd in the Neurips’25 BEHAVIOR-1K challenge, converted to PyTorch format
Download from fwd4xl/pi05-b1k-pt12-cs32-v1
4. Launch training:
For OpenVLA-OFT:
ROBOT_PLATFORM=LIBERO \
uv run cosmos-rl --config configs/openvla-oft/openvla-oft-7b-fsdp2-8p8r-colocate.toml \
cosmos_rl/tools/dataset/libero_grpo.py
For PI0.5:
uv run cosmos-rl --config configs/pi05/pi05-b1k-grpo-colocate.toml \
cosmos_rl/tools/dataset/b1k_grpo.py