Single node example

This example demonstrates how to run Qwen3-8B on a single node with 8 GPUs.

>>> cosmos-rl \
    --config configs/qwen3/qwen3-8b-p-tp4-r-tp2-pp1-grpo.toml \
    --policy 1 \
    --rollout 2

Explanation of the command:

--config: the path to the training config file.
--policy: the number of policy replicas.
--rollout: the number of rollout replicas.

As the toml file name suggests, this example uses Qwen3-8B model with:

4-way tensor parallelism for policy model
2-way tensor parallelism for rollout model

and total 8 GPUs are used since 1 policy and 2 rollout replicas are specified.

If everything goes well, you should see the training process like this:

[rank0]:[cosmos] 2025-06-09 22:14:29,220 - cosmos - INFO - Step: 1/4670, Loss: 0.00000
[rank1]:[cosmos] 2025-06-09 22:14:29,219 - cosmos - INFO - Step: 1/4670, Loss: 0.00000
...

Note

You may encounter loss values of 0.0 because the GRPO advantage is zero. Since it is a toy math example, it is expected.