Elastic Scaling and Fault Tolerance
=============================================

Cosmos-RL supports **elastic scaling** and **fault tolerance** for RL jobs. This document provides a detailed explanation of how these features work.

Elastic Scaling
---------------

Cosmos-RL allows running multiple **replicas** across **multiple nodes**, for both **Policy** and **Rollout** components. Each replica shares the same parallelism configuration. When a replica starts, only the controller is aware of it, and it will publish commands to existing replicas to perform operations such as building global NCCL meshes, weight updates, etc.

Cosmos-RL supports **elastic launching** of replicas, enabling users to dynamically add or remove replicas during execution. The system handles necessary coordination automatically.

### Replica Initialization Modes

There are two initialization modes based on the configuration field `n_init_replicas`:

1. **`n_init_replicas = N > 1`**  
   Cosmos-RL will wait until `N` replicas have joined before proceeding, and will treat later replicas as dynamic.

2. **`n_init_replicas = 1` (default)**  
   Cosmos-RL immediately treats the first launched replica as active and dynamically integrates subsequent replicas.

Policy and Rollout components each maintain their own `n_init_replicas` setting, defaulting to `1`.

### Policy Replica Initialization Flow

1. Users can launch policy replicas one by one.
2. Training begins only **after** `N` replicas have launched.
3. The **first** replica performs weight initialization (e.g., from checkpoint or Hugging Face model).
4. Once `N` replicas are active:
   - The controller sends a `BuildMesh` command.
   - A selected weight-initialized replica broadcasts weights to others.
5. The policy group is now ready to begin the RL workflow.

### Rollout Replica Initialization Flow

1. Rollout replicas are also launched one by one.
2. Only the **first** rollout replica starts generating rollouts initially.
3. It synchronizes weights from an initialized policy replica.
4. Once `N` rollout replicas are active:
   - The controller sends a `BuildMesh` command.
   - The first rollout replica broadcasts weights to others.
5. All rollout replicas now participate in rollout generation.

.. note::
   For `n_init_replicas = 1`, the flow is the same with `N = 1`.

### Dynamically Launched Replicas

Cosmos-RL supports adding new replicas during runtime. These replicas will be integrated into the RL workflow as follows:

- **Policy Side**
  - Controller sends a `BuildMesh` command.
  - A chosen initialized replica sends weights to the new one (via policy-policy unicast).

- **Rollout Side**
  - Controller sends a `BuildMesh` command.
  - A chosen initialized rollout replica broadcasts weights to the new one (via rollout-rollout broadcast).

After that, the new replica is fully integrated.

Demo: Elastic Scaling in Action
-------------------------------

### Step 1: Launch Controller

::

  ./tools/launch_controller.sh --port 8080 --config ./configs/qwen3/qwen3-8b-p-tp4-r-tp2-pp1-grpo.toml

### Step 2: Launch Initial Replicas (Assuming 8-GPU machine)

::

  # Launch 1 policy replica
  CUDA_VISIBLE_DEVICES=0,1 COSMOS_CONTROLLER_HOST=localhost:8080 ./tools/launch_replica.sh --ngpus 2 --type policy

  # Launch 1 rollout replica
  CUDA_VISIBLE_DEVICES=2,3 COSMOS_CONTROLLER_HOST=localhost:8080 ./tools/launch_replica.sh --ngpus 2 --type rollout

Wait a moment to observe the initial RL workflow and check the status in the controller’s web UI.

### Step 3: Add More Replicas Dynamically

::

  # Add 1 more policy replica
  CUDA_VISIBLE_DEVICES=4,5 COSMOS_CONTROLLER_HOST=localhost:8080 ./tools/launch_replica.sh --ngpus 2 --type policy

  # Add 1 more rollout replica
  CUDA_VISIBLE_DEVICES=6,7 COSMOS_CONTROLLER_HOST=localhost:8080 ./tools/launch_replica.sh --ngpus 2 --type rollout

You will see the new replicas appear in the web UI. Now, all four replicas (2 policy + 2 rollout) are active and working in sync.

Fault Tolerance
---------------

Cosmos-RL maintains heartbeat communication between the controller and all replicas. 

If either:

- a replica fails to send heartbeats within `COSMOS_HEARTBEAT_TIMEOUT` (default is 5 minutes) seconds
- NCCL operation times out

the controller will consider it offline and remove it from the NCCL mesh. So it won't block the ongoing RL workflow.