Elastic Scaling and Fault Tolerance

Cosmos-RL supports elastic scaling and fault tolerance for RL jobs. This document provides a detailed explanation of how these features work.

Elastic Scaling

Cosmos-RL allows running multiple replicas across multiple nodes, for both Policy and Rollout components. Each replica shares the same parallelism configuration. When a replica starts, only the controller is aware of it, and it will publish commands to existing replicas to perform operations such as building global NCCL meshes, weight updates, etc.

Cosmos-RL supports elastic launching of replicas, enabling users to dynamically add or remove replicas during execution. The system handles necessary coordination automatically.

### Replica Initialization Modes

There are two initialization modes based on the configuration field n_init_replicas:

`n_init_replicas = N > 1` Cosmos-RL will wait until N replicas have joined before proceeding, and will treat later replicas as dynamic.
`n_init_replicas = 1` (default) Cosmos-RL immediately treats the first launched replica as active and dynamically integrates subsequent replicas.

Policy and Rollout components each maintain their own n_init_replicas setting, defaulting to 1.

### Policy Replica Initialization Flow

Users can launch policy replicas one by one.
Training begins only after N replicas have launched.
The first replica performs weight initialization (e.g., from checkpoint or Hugging Face model).
Once N replicas are active: - The controller sends a BuildMesh command. - A selected weight-initialized replica broadcasts weights to others.
The policy group is now ready to begin the RL workflow.

### Rollout Replica Initialization Flow

Rollout replicas are also launched one by one.
Only the first rollout replica starts generating rollouts initially.
It synchronizes weights from an initialized policy replica.
Once N rollout replicas are active: - The controller sends a BuildMesh command. - The first rollout replica broadcasts weights to others.
All rollout replicas now participate in rollout generation.

Note

For n_init_replicas = 1, the flow is the same with N = 1.

### Dynamically Launched Replicas

Cosmos-RL supports adding new replicas during runtime. These replicas will be integrated into the RL workflow as follows:

Policy Side - Controller sends a BuildMesh command. - A chosen initialized replica sends weights to the new one (via policy-policy unicast).
Rollout Side - Controller sends a BuildMesh command. - A chosen initialized rollout replica broadcasts weights to the new one (via rollout-rollout broadcast).

After that, the new replica is fully integrated.

Demo: Elastic Scaling in Action

### Step 1: Launch Controller

./tools/launch_controller.sh --port 8080 --config ./configs/qwen3/qwen3-8b-p-tp4-r-tp2-pp1-grpo.toml

### Step 2: Launch Initial Replicas (Assuming 8-GPU machine)

# Launch 1 policy replica
CUDA_VISIBLE_DEVICES=0,1 COSMOS_CONTROLLER_HOST=localhost:8080 ./tools/launch_replica.sh --ngpus 2 --type policy

# Launch 1 rollout replica
CUDA_VISIBLE_DEVICES=2,3 COSMOS_CONTROLLER_HOST=localhost:8080 ./tools/launch_replica.sh --ngpus 2 --type rollout

Wait a moment to observe the initial RL workflow and check the status in the controller’s web UI.

### Step 3: Add More Replicas Dynamically

# Add 1 more policy replica
CUDA_VISIBLE_DEVICES=4,5 COSMOS_CONTROLLER_HOST=localhost:8080 ./tools/launch_replica.sh --ngpus 2 --type policy

# Add 1 more rollout replica
CUDA_VISIBLE_DEVICES=6,7 COSMOS_CONTROLLER_HOST=localhost:8080 ./tools/launch_replica.sh --ngpus 2 --type rollout

You will see the new replicas appear in the web UI. Now, all four replicas (2 policy + 2 rollout) are active and working in sync.

Fault Tolerance

Cosmos-RL maintains heartbeat communication between the controller and all replicas.

If either:

a replica fails to send heartbeats within COSMOS_HEARTBEAT_TIMEOUT (default is 5 minutes) seconds
NCCL operation times out

the controller will consider it offline and remove it from the NCCL mesh. So it won’t block the ongoing RL workflow.