Overview

Cosmos-RL is a fully native PyTorch, distributed reinforcement-learning framework built around a single, lightweight controller. By decoupling policy training from environment rollouts, it achieves:

Seamless scalability to thousands of GPUs
Modular, easy-to-extend design
Higher throughput via fully asynchronous execution

Key Features

Single-controller architecture – Coordinates all workers, eliminates heavyweight orchestration layers

Native PyTorch – Leverages familiar APIs and tooling; no custom C++/CUDA kernels required

Asynchronous, parallel policy and rollout – Maximizes hardware utilization; rollouts never sleep while the policy trains

Fine-grained scaling – Independently scale policy (training) and rollout (data generation) workers

Architecture

Disaggregated Policy & Rollout Workers

In Cosmos-RL, policy trainers and rollout actors run as separate worker pools, each of which can live on different hardware:

Flexibility – Spin up rollout workers on cost-effective GPUs (e.g. L40s) – Reserve high-end accelerators (e.g. H100s) for policy training

Scalability – Scale the data-generation layer independently when rollouts become the bottleneck

Performance – True parallelism: no idle time offloading model checkpoints between tasks

Unblocking Weight Synchronization

Policy workers periodically push updated model weights to rollout actors. This happens every config.train.sync_weight_interval iterations:

Rollout side

Rollout tasks are token-granular. Upon receiving a sync request, a worker will pause its current rollout at the next token boundary, apply the new weights, and resume immediately.

Policy side

Weight pushes are handled asynchronously so that training loops never stall.

High-Performance Weight Transfer

Transferring large models across machines presents unique challenges:

Network topologies vary (InfiniBand, Ethernet, NVLink).

Source and target may use different parallelisms (tensor-, pipeline-, or FSDP).

Cosmos-RL overcomes these via:

RDMA-accelerated transfers over InfiniBand and NVLink

A topology-aware weight-mapping algorithm that avoids global all-gather

Minimal peak memory footprint during synchronization

Putting It All Together

By combining asynchronous execution, fine-grained rollout interruption, and optimized weight transfers, Cosmos-RL delivers a highly efficient, scalable RL training stack that:

Keeps GPUs busy generating and consuming experience

Scales linearly as you add more workers

Requires zero custom kernels or external orchestration frameworks