Overview
Cosmos-RL provides native support for SFT and RL of world foundational models.
Supported Models
Cosmos-RL uses diffusers-based training pipelines for both WFM SFT and WFM RL, so the same overall workflow applies across image and video diffusion models. The exported checkpoints are also diffusers-compatible, which makes post-training inference straightforward.
SD3
Cosmos-Predict2.5
SANA-Image/Video
Cosmos-RL supports both LoRA finetuning and full-model finetuning.
Configurations
The full configuration schema is defined in Configuration. In practice, the most important config groups for WFM jobs are:
[policy]: setsmodel_name_or_path, enables diffusers withis_diffusers = true, and optionally enables LoRA throughpolicy.lora.[policy.diffusers]: controls model-specific behavior such asis_video,max_prompt_length,inference_size,train_frames, and the sampling block underpolicy.diffusers.sample.[policy.diffusers.sample]: controls rollout and inference behavior such asnum_steps,eval_num_steps,guidance_scale,noise_level,solver, anddeterministic_sampling.[policy.parallelism]: definescp_size, anddp_shard_sizefor the trainer-side mesh.[train]: controls optimization and runtime, especiallyoutput_dir,param_dtype,fsdp_reduce_dtype,train_batch_per_replica,ema_enable, andcompile.[train.ckpt]: controls checkpoint cadence and diffusers export. Keepexport_safetensors = trueif you want ready-to-load diffusers checkpoints undertrain.output_dir/safetensors/.[validation]: configures periodic evaluation data and validation frequency.
SFT
Starter SFT recipes are available under configs/stable-diffusion-3-5/ and configs/sana/:
SD3: stable-diffusion-3-5-image-sft.toml, stable-diffusion-3-5-image-sft-lora.toml
SANA image: sana-image-sft.toml, sana-image-sft-lora.toml
SANA video: sana-video-sft.toml, sana-video-sft-lora.toml
The most important SFT-specific settings are:
Set
train.train_policy.type = "sft".Point
train.train_policy.dataset.local_dirat your local image or video dataset.Add
[policy.lora]for adapter finetuning. If you omit it, the trainer updates the full transformer.
RL
Starter RL recipes are also available under configs/stable-diffusion-3-5/ and configs/sana/:
SANA image: sana-image-nft.toml
SANA video: sana-video-nft.toml
Cosmos-Predict2.5: cosmos-predict2-5-2b-720-nft.toml
The most important RL-specific settings are:
Set
train.train_policy.type = "grpo"andtrain.train_policy.trainer_type = "diffusion_nft".Set
mode = "colocated"andtrain.train_policy.uncentralized_training = true.Configure prompt sampling and rollout behavior through
[rollout]and[policy.diffusers.sample].Enable remote rewards with
train.train_policy.use_remote_reward = trueand define one or more[[train.train_policy.remote_reward.reward_fns]]entries.Keep
rollout.parallelismandpolicy.parallelismaligned in colocated mode, especiallydp_shard_size.
Datasets
The dataset entrypoint is a Python launcher, so the main customization path is to edit the corresponding dataset file rather than changing a fixed built-in schema. For more details about dataset customization, please refer to Customization.
SFT
SFT uses cosmos_rl.tools.dataset.diffusers_dataset.
Point
train.train_policy.dataset.local_dirto a local directory of paired metadata and visual assets.Each sample is expected to share the same basename across metadata and asset files, for example
0001.jsonwith0001.jpgfor images, or0001.jsonwith0001.mp4for videos.policy.diffusers.is_videoandpolicy.diffusers.train_framesdetermine whether the loader expects images or videos and how many frames are sampled for video training.
RL
RL uses cosmos_rl.tools.dataset.diffusion_nft.
The built-in launcher currently supports prompt datasets such as
pickscore,ocr,geneval, anddance_grpo_t2vthroughtrain.train_policy.dataset.nameandsplit.These datasets are prompt-first rather than paired image/video supervision datasets; the reward signal is provided separately by the reward service.
For custom prompt sources, metadata packing, or reward-service payloads, edit
cosmos_rl.tools.dataset.diffusion_nft.py.
Launch Training
Install the WFM dependencies first:
pip install '.[wfm]'
If you are training Cosmos-Predict2.5, also install:
pip install cosmos_guardrail --no-deps
Training progress can be monitored in Weights & Biases when logging.logger includes wandb.
SFT
Launch SFT with cosmos_rl.tools.dataset.diffusers_dataset. For example, SD3 LoRA SFT:
cosmos-rl --config ./configs/stable-diffusion-3-5/stable-diffusion-3-5-image-sft-lora.toml cosmos_rl.tools.dataset.diffusers_dataset
To run full-model SFT, switch to stable-diffusion-3-5-image-sft.toml or one of the SANA *-sft.toml recipes.
RL
Launch a reward service first by following Reward Service. Then point the trainer to that service through environment variables:
export REMOTE_REWARD_TOKEN="your_token"
export REMOTE_REWARD_ENQUEUE_URL="https://reward-service-host:PORT/api/reward/enqueue"
export REMOTE_REWARD_FETCH_URL="https://reward-service-host:PORT/api/reward/pull"
After that, launch DiffusionNFT training with cosmos_rl.tools.dataset.diffusion_nft. For example, SD3 RL:
cosmos-rl --config ./configs/stable-diffusion-3-5/stable-diffusion-3-5-medium-nft.toml cosmos_rl.tools.dataset.diffusion_nft
To run SANA RL, switch to sana-image-nft.toml or sana-video-nft.toml.
Use Trained Models
When train.ckpt.export_safetensors = true, diffusers-compatible artifacts are exported under train.output_dir/safetensors/step_<N>/.
If its false, we will only export the final checkpoint to diffusers-compatible safetensors.
If
policy.lorais configured, the adapter is saved understep_<N>/lora/.If
policy.lorais not configured, the full diffusers pipeline is saved directly understep_<N>/.
LoRA
SFT: load the base pipeline from
policy.model_name_or_path, then attach the adapter from.../safetensors/step_<N>/lora.RL: the loading flow is the same. RL LoRA checkpoints can be used with regular diffusers inference; the reward service is only needed during training.
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video
base_model = "stabilityai/stable-diffusion-3.5-medium"
adapter_dir = "./outputs/stable-diffusion-3-5-image-sft-lora/safetensors/step_30/lora"
pipe = DiffusionPipeline.from_pretrained(base_model, torch_dtype=torch.bfloat16)
pipe.load_lora_weights(
adapter_dir,
weight_name="model.safetensors",
adapter_name="default",
)
pipe.set_adapters("default")
pipe = pipe.to("cuda")
result = pipe(
prompt="A cinematic photo of a corgi astronaut on the moon.",
num_inference_steps=40,
guidance_scale=4.5,
)
if hasattr(result, "images"):
result.images[0].save("sample.png")
else:
export_to_video(result.frames[0], "sample.mp4", fps=16)
Full Pipeline
SFT: load the exported step directory directly with
DiffusionPipeline.from_pretrained.RL: the same loading path applies when training without
policy.lora.
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video
checkpoint_dir = "./outputs/stable-diffusion-3-5-image-sft/safetensors/step_30"
pipe = DiffusionPipeline.from_pretrained(
checkpoint_dir,
torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")
result = pipe(
prompt="A cinematic photo of a corgi astronaut on the moon.",
num_inference_steps=40,
guidance_scale=4.5,
)
if hasattr(result, "images"):
result.images[0].save("sample.png")
else:
export_to_video(result.frames[0], "sample.mp4", fps=16)
Add New Models
To add support for a new diffusers-based WFM, the best references are the existing implementations in cosmos_rl/policy/model/diffusers/sd3_model.py, cosmos_rl/policy/model/diffusers/sana_model.py, and cosmos_rl/policy/model/diffusers/cosmos_predict2_5_model.py.
The integration pattern is:
Add a new file under
cosmos_rl/policy/model/diffusers/.Define a subclass of
DiffuserModeland register it with@ModelRegistry.register(DiffuserModelWeightMapper).Return the diffusers pipeline class name from
supported_model_types().
The supported_model_types() value must match DiffusionPipeline.load_config(model_name_or_path)["_class_name"], because ModelRegistry.build_diffusers_model() uses that value to choose the implementation class.
New files under cosmos_rl/policy/model/diffusers/ are auto-discovered by cosmos_rl.policy.model.__init__, so placing the new model file in that directory is enough as long as it is a regular .py module.
from cosmos_rl.policy.model.base import ModelRegistry
from cosmos_rl.policy.model.diffusers import DiffuserModel
from cosmos_rl.policy.model.diffusers.weight_mapper import DiffuserModelWeightMapper
from cosmos_rl.policy.config import DiffusersConfig
@ModelRegistry.register(DiffuserModelWeightMapper)
class MyModel(DiffuserModel):
@staticmethod
def supported_model_types():
return ["MyPipeline"]
def __init__(self, config: DiffusersConfig, **kwargs):
super().__init__(config, **kwargs)
self.set_scheduler_timestep(self.train_sampling_steps)
self.text_encoders = [self.text_encoder]
self.tokenizers = [self.tokenizer]
The loaded pipeline is expected to expose a few standard components:
pipeline.transformer: this is the trainable denoiser used by SFT, RL, EMA, and LoRA code paths.pipeline.vaeandpipeline.scheduler: used by latent encoding, decoding, and noise scheduling.pipeline.image_processororpipeline.video_processor: required byDiffuserModel.init_output_process().
If an upstream diffusers pipeline does not expose its denoiser as transformer and instead uses a different attribute such as unet, you should add a thin compatibility layer before trying to reuse the existing trainer stack. The current WFM diffusers path assumes the trainable module is available through self.transformer.
SFT
For SFT, the minimum required methods are:
text_embedding(): callpipeline.encode_prompt()and return a dictionary that matches the keyword arguments expected byself.transformer(...). For example, SD3 usesencoder_hidden_statesandpooled_projections, while SANA and Cosmos-Predict2.5 do not use pooled projections.visual_embedding(): encode input images or videos into training latents. This is usually model-specific because the VAE normalization can differ across pipelines.set_scheduler_timestep(): prepare the scheduler and cache the timestep map used by training.add_noise(): convert sampled timestep indices into actual scheduler timesteps and return(noised_latent, noise, timesteps).
The existing models illustrate the main latent-conversion differences:
SD3 and Cosmos-Predict2.5 use VAE
shift_factorandscaling_factorwhen converting pixels to latents.SANA image uses
scaling_factoronly.SANA video applies an extra normalization step with
latents_meanandlatents_std.
Once these methods are implemented correctly, DiffuserModel.training_sft_step() can usually be reused without trainer changes.
RL
To support DiffusionNFT RL, you need two more model-specific entry points:
pipeline_with_logprob(): performs rollout-time sampling and returns the generated visuals together withall_latentsandall_log_probs.nft_prepare_transformer_input(): prepares the keyword arguments forself.transformer(...)during the RL training step.diffusers_trainer/nft_trainer.pycalls this method directly.
There are two common implementation styles in the current codebase:
SD3 uses a relatively thin wrapper around diffusers prompt encoding and latent preparation, then delegates the denoising loop to the shared
run_sampling()helper.SANA and Cosmos-Predict2.5 implement a custom
sde_step_with_logprob()helper because their transition and log-probability computation is more model-specific.
Keep model-specific logic close to the model implementation rather than the trainer. Examples include:
classifier-free guidance details
prompt enhancement logic
resolution binning
video-specific conditioning inputs
custom latent packing or padding rules
Configuration
After the model class is added, the config side is usually straightforward:
Set
policy.is_diffusers = true.Set
policy.model_name_or_pathto the new diffusers repo or local exported pipeline.Set
policy.diffusers.is_videocorrectly so dataset preprocessing and inference use the right visual path.Tune
policy.diffusers.samplefor the new scheduler and guidance behavior.If you want LoRA training, set
policy.lora.target_modulesaccording to the module names underself.transformer.
Note
The current DiffuserModel implementation only supports tp_size = 1 and cp_size = 1. If a new diffusers backend needs tensor or context parallelism, that support has to be added explicitly.
Validation Checklist
Before wiring a new model into large training runs, it is worth validating these points with a tiny config:
Check that
DiffusionPipeline.load_config(model_name_or_path)["_class_name"]matches the string returned bysupported_model_types().Build the model once and confirm that
self.transformer,self.vae,self.scheduler, and the expected text encoders/tokenizers are registered.Run
text_embedding()andvisual_embedding()on one small batch and verify the output shapes.Run one SFT forward pass and confirm the returned transformer inputs match the model’s forward signature.
If RL is needed, run one
pipeline_with_logprob()call and onenft_prepare_transformer_input()call before launching a full RL job.
RL (deprecated)
Cosmos-RL supports FlowGRPO and DDRL algorithms for world foundational model reinforcement learning.
Quick start: A quick start guide for world foundational model’s RL:
Configure the training recipe by editing toml files under
configs/cosmos-predict2-5/.Launch the reward service, you can refer docs here: Reward Service.
Launch the training script with the configured recipe:
cosmos-rl --config ./configs/cosmos-predict2-5/cosmos-predict2-5-2b-480-grpo-mock-data.toml --wfm-mode cosmos_rl.tools.dataset.wfm_rl
Monitor training progress via Wandb.
Evaluate the trained world foundational model using the evaluation script. For Cosmos-Predict2.5, you can refer this repo: cosmos-predict2.5.
Note
You can find detailed tutorials for DDRL here: DDRL Tutorials.
For a quick rollout of the training pipeline, we recommend you use the mock_data config file, i.e., ./configs/cosmos-predict2-5/cosmos-predict2-5-2b-480-grpo-mock-data.toml
Reward services: Considering the computation overhead, it’s necessary to use a seperated async service for reward computing.
You can launch a reward service by following the instructions here: Reward Service.
Configure the environment variable
REMOTE_REWARD_TOKEN,REMOTE_REWARD_ENQUEUE_URL, andREMOTE_REWARD_FETCH_URLto make the trainer communicate with the reward service:export REMOTE_REWARD_TOKEN="your_token" export REMOTE_REWARD_ENQUEUE_URL="https://reward_service_host:PORT/api/reward/enqueue" export REMOTE_REWARD_FETCH_URL="https://reward_service_host:PORT/api/reward/pull"
Models:
Cosmos-Predict2.5-2B/14B
Parallelism: Support HSDP/FSDP, and context parallel (CP) for world foundational model training. You can edit the related configurations in the toml file to enable these parallelism techniques.:
[model]
fsdp_shard_size = 8
dp_replicate_size = 4
[model_parallel]
context_parallel_size = 2
Datasets:
Local dataset: you can use local dataset for training. We follows the local dataset structure as Cosmos-Predict2.5. The dataset folder format should be:
datasets/<your_local_dataset>/ ├── metas/ │ └── *.txt ├── videos/ │ └── *.mp4 └── text_embedding <optional> / └── *.pickleWebdataset: you need to configure the s3 access via environment variables, then you can use webdataset for training.
PROD_S3_CHECKPOINT_ACCESS_KEY_ID: Your S3 access key ID.
PROD_S3_CHECKPOINT_SECRET_ACCESS_KEY: Your S3 secret access key.
PROD_S3_CHECKPOINT_ENDPOINT_URL: Your S3 endpoint url.
PROD_S3_CHECKPOINT_REGION_NAME: Your S3 region name.
Storage:
Local storage: you can use local disk for storing checkpoints and logs.
S3 storage: you need to configure the s3 access via environment variables, then you can use s3 storage for storing checkpoints and logs.