Overview ======== Cosmos-RL is a fully native PyTorch, distributed reinforcement-learning framework built around a single, lightweight controller. By decoupling policy training from environment rollouts, it achieves: - Seamless scalability to thousands of GPUs - Modular, easy-to-extend design - Higher throughput via fully asynchronous execution Key Features ============ * Single-controller architecture – Coordinates all workers, eliminates heavyweight orchestration layers * Native PyTorch – Leverages familiar APIs and tooling; no custom C++/CUDA kernels required * Asynchronous, parallel policy and rollout – Maximizes hardware utilization; rollouts never sleep while the policy trains * Fine-grained scaling – Independently scale policy (training) and rollout (data generation) workers Architecture ============ Disaggregated Policy & Rollout Workers -------------------------------------- In Cosmos-RL, policy trainers and rollout actors run as separate worker pools, each of which can live on different hardware: * Flexibility – Spin up rollout workers on cost-effective GPUs (e.g. L40s) – Reserve high-end accelerators (e.g. H100s) for policy training * Scalability – Scale the data-generation layer independently when rollouts become the bottleneck * Performance – True parallelism: no idle time offloading model checkpoints between tasks Unblocking Weight Synchronization --------------------------------- Policy workers periodically push updated model weights to rollout actors. This happens every ``config.train.sync_weight_interval`` iterations: * Rollout side .. image:: /assets/rollout.png :alt: Rollout Worker Flow Rollout tasks are token-granular. Upon receiving a sync request, a worker will pause its current rollout at the next token boundary, apply the new weights, and resume immediately. * Policy side .. image:: /assets/policy.png :alt: Policy Worker Flow Weight pushes are handled asynchronously so that training loops never stall. High-Performance Weight Transfer -------------------------------- Transferring large models across machines presents unique challenges: 1. Network topologies vary (InfiniBand, Ethernet, NVLink). 2. Source and target may use different parallelisms (tensor-, pipeline-, or FSDP). Cosmos-RL overcomes these via: * RDMA-accelerated transfers over InfiniBand and NVLink * A topology-aware weight-mapping algorithm that avoids global all-gather * Minimal peak memory footprint during synchronization .. image:: /assets/weight_p2p.png Putting It All Together ======================= By combining asynchronous execution, fine-grained rollout interruption, and optimized weight transfers, Cosmos-RL delivers a highly efficient, scalable RL training stack that: * Keeps GPUs busy generating and consuming experience * Scales linearly as you add more workers * Requires zero custom kernels or external orchestration frameworks