Configuration
Config |
|||||
type |
object |
||||
properties |
|||||
|
#/$defs/TrainingConfig |
||||
|
#/$defs/RolloutConfig |
||||
|
#/$defs/PolicyConfig |
||||
|
#/$defs/LoggingConfig |
||||
|
#/$defs/ProfilerConfig |
||||
|
#/$defs/ValidationConfig |
||||
$defs |
|||||
|
CheckpointConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable Checkpoint |
||||
Enable checkpointing for training. If set to False, no checkpoint will be saved. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Save Freq |
||||
Checkpoint save frequency for training steps |
|||||
type |
integer |
||||
default |
20 |
||||
|
Save Mode |
||||
Checkpoint save mode for training steps |
|||||
type |
string |
||||
default |
async |
||||
|
Max Keep |
||||
Maximum number of checkpoints to keep. If set to -1, all checkpoints will be kept. |
|||||
type |
integer |
||||
default |
5 |
||||
|
Export Safetensors |
||||
Whether to export a safetensors weight for huggingface usage, include related config files. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Upload Hf |
||||
Whether to upload the safetensors weight to huggingface. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Hf Repo Name |
||||
The huggingface repo name to upload the safetensors weight. |
|||||
type |
string |
||||
default |
Comos-Reason1 |
||||
|
Upload S3 |
||||
Whether to upload the checkpoint and safetensors to S3. Default to False, set final will upload the final checkpoint, all will upload all checkpoints. |
|||||
default |
False |
||||
anyOf |
type |
boolean |
|||
type |
string |
||||
|
S3 Bucket |
||||
The S3 bucket name to upload the checkpoint and safetensors weight. |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
S3 Prefix |
||||
The S3 prefix to upload the checkpoint and safetensors weight. |
|||||
type |
string |
||||
default |
outputs |
||||
|
DatasetConfig |
||||
type |
object |
||||
properties |
|||||
|
Name |
||||
Huggingface dataset name or local path to parquet file |
|||||
type |
string |
||||
default |
|||||
|
Subset |
||||
Dataset subset if exists |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Revision |
||||
OrderedDict({‘help’: ‘Dataset git revision if exist, can be a branch name, a tag, or a commit hash.’}) |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Split |
||||
A list of dataset splits to train |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
|
Test Size |
||||
Size of the test set. If float, it is the ratio (between 0.0 and 1.0) of the dataset; if int, it is the absolute size of the test set. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
integer |
||||
type |
null |
||||
|
FP8Config |
||||
type |
object |
||||
properties |
|||||
|
Enable Fp8 |
||||
Whether to enable fp8. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Fp8 Recipe |
||||
Recipe for weight scale calculation. |
|||||
type |
string |
||||
default |
dynamic_scaling |
||||
|
Quant Recipe |
||||
Quantization strategy for weight. |
|||||
type |
string |
||||
default |
rowwise |
||||
|
GrpoConfig |
||||
type |
object |
||||
properties |
|||||
|
Type |
||||
type |
string |
||||
const |
grpo |
||||
|
Variant |
||||
Variant of the GRPO, currently support grpo, and dapo |
|||||
type |
string |
||||
default |
grpo |
||||
|
Dataset configuration for GRPO training. It includes dataset name, subset, revision, train split, test split and test size. |
||||
#/$defs/DatasetConfig |
|||||
|
Dataloader Shuffle |
||||
Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Enable Dataset Cache |
||||
Enable dataset cache process results, maybe accelerate the dataset loading |
|||||
type |
boolean |
||||
default |
False |
||||
|
Dataloader Num Workers |
||||
Number of subprocess to use for data loading |
|||||
type |
integer |
||||
default |
0 |
||||
|
Dataloader Prefetch Factor |
||||
Number of batches loaded in advance by each worker. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Prompt Column Name |
||||
Column name for prompt |
|||||
type |
string |
||||
default |
|||||
|
Response Column Name |
||||
Column name for response/reference answer |
|||||
type |
string |
||||
default |
|||||
|
Reward Function |
||||
A List of reward functions for the model. Currently support single_choice, boxed_math, and format. |
|||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
|
Temperature |
||||
Temperature for sampling. The higher the temperature, the more random the completions. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Epsilon Low |
||||
Epsilon value for clipping. |
|||||
type |
number |
||||
default |
0.2 |
||||
|
Epsilon High |
||||
Upper-bound epsilon value for clipping. If not specified, it defaults to the same value as the lower-bound specified in argument epsilon. Paper DAPO recommends 0.28. |
|||||
type |
number |
||||
default |
0.2 |
||||
|
Lower Bound Ratio |
||||
Lower-bound ratio for dual-clip. |
|||||
type |
number |
||||
default |
3.0 |
||||
|
Loss Type |
||||
The type of loss to use for GRPO training. |
|||||
type |
string |
||||
default |
token-mean |
||||
|
Unbiased Loss Max Tokens |
||||
Maximum number of tokens to use for unbiased loss introduced in Dr.GRPO. If set to None, will not use unbiased loss.Only available when loss_type is seq-mean-token-mean |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Unbiased Advantage |
||||
Whether to divide the advantage by the standard deviation of rewards. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Configuration for overlong reward penalty. If enabled, the output will be penalized for responses that are too long. |
||||
#/$defs/OverlongRewardConfig |
|||||
|
Kl Beta |
||||
KL coefficient. If 0.0, the reference model is not loaded, reducing memory usage and improving training speed, but may be numerically unstable for long training runs. |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Aipo Rho |
||||
Rho value for AIPO (Asynchronous Importance weighted Policy Optimization). The clipping constant of the importance sampling ratio, suggest [2,10]. reference: https://arxiv.org/pdf/2505.24034 |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Mu Iterations |
||||
Number of iterations per batch (denoted as μ in the algorithm). |
|||||
type |
integer |
||||
default |
1 |
||||
|
Mini Batch |
||||
mini-batch size for GRPO training. |
|||||
type |
integer |
||||
default |
2 |
||||
|
Allowed Outdated Steps |
||||
Allowed outdated-async steps for rollout engine. If the number of left pending rollouts is larger than the allowed_outdated_steps * n_policy_replicas * train_batch_per_replica, then rollout engine traffic will be throttled. |
|||||
type |
integer |
||||
default |
4 |
||||
|
Min Filter Prefix Tokens |
||||
Minimum number of tokens to filter the prefix tokens for the rollouts inside the same group. If the number of tokens is larger than the min_filter_prefix_tokens, the rollouts with the same prefix but different rewards will be filtered out in loss calculation. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
LoggingConfig |
||||
type |
object |
||||
properties |
|||||
|
Logger |
||||
List of loggers to use, e.g., [‘console’, ‘wandb’] |
|||||
type |
array |
||||
items |
type |
string |
|||
|
Project Name |
||||
Wandb project name for logging. If set, the training will be logged to this project. |
|||||
type |
string |
||||
default |
cosmos_rl |
||||
|
Experiment Name |
||||
A short display name for this run. If not set, will use the output_dir as the experiment name. |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
OverlongRewardConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable Overlong Penalty |
||||
Enable overlong penalty for the model. If set to True, the output will be penalized for responses that are too long. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Buffer Length |
||||
Length of the buffer for overlong penalty. If the response length exceeds this value, the output will be penalized. |
|||||
type |
integer |
||||
default |
4096 |
||||
|
Penalty Factor |
||||
Penalty factor for overlong penalty. The penalty increases linearly with the length of the response exceeding the buffer length from 0 to the penalty_factor. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
ParallelismConfig |
||||
type |
object |
||||
properties |
|||||
|
N Init Replicas |
||||
Number of initial replicas to be created |
|||||
type |
integer |
||||
default |
1 |
||||
|
Tp Size |
||||
Tensor parallelism size |
|||||
type |
integer |
||||
default |
2 |
||||
|
Cp Size |
||||
Context parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Shard Size |
||||
Data Parallelism size in sharded mode |
|||||
type |
integer |
||||
default |
-1 |
||||
|
Pp Size |
||||
Pipeline parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Pp Dynamic Shape |
||||
Pipeline parallelism dynamic shape |
|||||
type |
boolean |
||||
default |
False |
||||
|
Pp Micro Batch Size |
||||
Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Replicate Size |
||||
Data Parallelism size in replica mode. Only configurable in SFT type job, must be 1 in GRPO type job for dynamic scaling support purpose. |
|||||
type |
integer |
||||
default |
1 |
||||
|
PolicyConfig |
||||
type |
object |
||||
properties |
|||||
|
#/$defs/ParallelismConfig |
||||
|
Model Name Or Path |
||||
The model name or path, compatible with huggingface model name or local path |
|||||
type |
string |
||||
default |
Qwen/Qwen2.5-VL-7B-Instruct |
||||
|
Model Max Length |
||||
The maximum length for training, longer than this will be ignored for training stability |
|||||
type |
integer |
||||
default |
4096 |
||||
|
Model Gradient Checkpointing |
||||
Whether to use gradient checkpointing |
|||||
type |
boolean |
||||
default |
True |
||||
|
ProfilerConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable Profiler |
||||
Enable profiler for training |
|||||
type |
boolean |
||||
default |
False |
||||
|
Sub profiler config |
||||
#/$defs/SubProfilerConfig |
|||||
|
RolloutConfig |
||||
type |
object |
||||
properties |
|||||
|
#/$defs/RolloutParallelismConfig |
||||
|
Enforce Eager |
||||
Whether to enable eager execution for vLLM. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Include Stop Str In Output |
||||
Whether to include stop string in output. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Gpu Memory Utilization |
||||
GPU memory utilization factor for rollout backend. |
|||||
type |
number |
||||
default |
0.8 |
||||
|
Enable Chunked Prefill |
||||
Whether to enable chunked prefill for vLLM. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Max Response Length |
||||
Max output length of rollout generation. |
|||||
type |
integer |
||||
default |
2048 |
||||
|
N Generation |
||||
n parameter same like what in OpenAI chat API. |
|||||
type |
integer |
||||
default |
16 |
||||
|
Batch Size |
||||
Batch size for rollout. |
|||||
type |
integer |
||||
default |
1 |
||||
|
Val Batch Size |
||||
Batch size for rollout generation during validation. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Quantization |
||||
Quantization in vllm rollout generation. |
|||||
type |
string |
||||
default |
none |
||||
|
Seed |
||||
random seed for rollout. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
#/$defs/SamplingConfig |
||||
|
Vllm Use Flashinfer |
||||
Use flashinfer for vllm rollout. |
|||||
type |
boolean |
||||
default |
False |
||||
|
RolloutParallelismConfig |
||||
type |
object |
||||
properties |
|||||
|
N Init Replicas |
||||
Number of initial replicas to be created |
|||||
type |
integer |
||||
default |
1 |
||||
|
Tp Size |
||||
Tensor parallelism size |
|||||
type |
integer |
||||
default |
2 |
||||
|
Cp Size |
||||
Context parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Shard Size |
||||
Data Parallelism size in sharded mode |
|||||
type |
integer |
||||
default |
-1 |
||||
|
Pp Size |
||||
Pipeline parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Pp Dynamic Shape |
||||
Pipeline parallelism dynamic shape |
|||||
type |
boolean |
||||
default |
False |
||||
|
Pp Micro Batch Size |
||||
Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Replicate Size |
||||
Data Parallelism size in replica mode, only 1 is supported for dynamic scaling purpose. |
|||||
type |
integer |
||||
default |
1 |
||||
|
SFTDataConfig |
||||
type |
object |
||||
properties |
|||||
|
Type |
||||
type |
string |
||||
const |
sft |
||||
|
Dataset configuration for SFT training. It includes dataset name, subset, revision, train split, and test split. |
||||
#/$defs/DatasetConfig |
|||||
|
Dataloader Shuffle |
||||
Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Enable Dataset Cache |
||||
Enable dataset cache process results, maybe accelerate the dataset loading |
|||||
type |
boolean |
||||
default |
False |
||||
|
Dataloader Num Workers |
||||
Number of subprocess to use for data loading |
|||||
type |
integer |
||||
default |
0 |
||||
|
Dataloader Prefetch Factor |
||||
Number of batches loaded in advance by each worker. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Conversation Column Name |
||||
Column name for formated conversation json |
|||||
type |
string |
||||
default |
conversations |
||||
|
System Prompt |
||||
System prompt for the model, which will be prepended to the prompt |
|||||
type |
string |
||||
default |
|||||
|
SamplingConfig |
||||
type |
object |
||||
properties |
|||||
|
Temperature |
||||
Temperature for sampling. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Top P |
||||
Top-p for sampling. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Top K |
||||
Top-k for sampling. |
|||||
type |
integer |
||||
default |
-1 |
||||
|
Repetition Penalty |
||||
Repetition penalty for sampling. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Use Flashinfer |
||||
Use flashinfer for sampling. |
|||||
type |
boolean |
||||
default |
False |
||||
|
SubProfilerConfig |
||||
type |
object |
||||
properties |
|||||
|
Do Profile |
||||
Whether to profile, only used in runtime. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Active Steps |
||||
Number of active steps |
|||||
type |
integer |
||||
default |
1 |
||||
|
Rank Filter |
||||
Rank filter |
|||||
type |
array |
||||
items |
type |
integer |
|||
|
Record Shape |
||||
Whether to record shape |
|||||
type |
boolean |
||||
default |
False |
||||
|
Profile Memory |
||||
Whether to profile memory |
|||||
type |
boolean |
||||
default |
False |
||||
|
With Stack |
||||
Whether to profile stack |
|||||
type |
boolean |
||||
default |
False |
||||
|
With Modules |
||||
Whether to profile modules |
|||||
type |
boolean |
||||
default |
False |
||||
|
TrainingConfig |
||||
type |
object |
||||
properties |
|||||
|
Train Policy |
||||
default |
type |
grpo |
|||
variant |
grpo |
||||
dataset |
name |
||||
revision |
|||||
split |
|||||
subset |
|||||
test_size |
None |
||||
dataloader_shuffle |
True |
||||
enable_dataset_cache |
False |
||||
dataloader_num_workers |
0 |
||||
dataloader_prefetch_factor |
None |
||||
prompt_column_name |
|||||
response_column_name |
|||||
reward_function |
single_choice |
||||
temperature |
1.0 |
||||
epsilon_low |
0.2 |
||||
epsilon_high |
0.2 |
||||
lower_bound_ratio |
3.0 |
||||
loss_type |
token-mean |
||||
unbiased_loss_max_tokens |
None |
||||
unbiased_advantage |
False |
||||
overlong_reward |
buffer_length |
4096 |
|||
enable_overlong_penalty |
False |
||||
penalty_factor |
1.0 |
||||
kl_beta |
0.0 |
||||
aipo_rho |
None |
||||
mu_iterations |
1 |
||||
mini_batch |
2 |
||||
allowed_outdated_steps |
4 |
||||
min_filter_prefix_tokens |
None |
||||
oneOf |
#/$defs/SFTDataConfig |
||||
#/$defs/GrpoConfig |
|||||
|
#/$defs/FP8Config |
||||
|
#/$defs/CheckpointConfig |
||||
|
Resume |
||||
Resume training from a checkpoint. If True, will resume from the latest checkpoint of the output_dir. If a string, will resume from the specified checkpoint path. |
|||||
default |
False |
||||
anyOf |
type |
boolean |
|||
type |
string |
||||
|
Epoch |
||||
Number of epochs for training |
|||||
type |
integer |
||||
default |
1 |
||||
|
Output Dir |
||||
Output directory |
|||||
type |
string |
||||
default |
./outputs |
||||
|
Timestamp |
||||
Timestamp for the output directory and wandb ID, if not set, will be generated automatically |
|||||
type |
string |
||||
default |
|||||
|
Epsilon |
||||
Epsilon for optimizer |
|||||
type |
number |
||||
default |
1e-06 |
||||
|
Optm Name |
||||
Optimizer name |
|||||
type |
string |
||||
default |
AdamW |
||||
|
Optm Lr |
||||
Learning rate for optimizer, can be a float or a list of floats for multiple optimizers |
|||||
default |
1e-06 |
||||
anyOf |
type |
number |
|||
type |
array |
||||
items |
type |
number |
|||
|
Optm Impl |
||||
Implementation type for optimizer. More info: https://pytorch.org/docs/stable/optim.html, can be a list of strings for multiple optimizers |
|||||
default |
fused |
||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
|
Optm Weight Decay |
||||
Weight decay for optimizer |
|||||
type |
number |
||||
default |
0.01 |
||||
|
Optm Betas |
||||
Betas for optimizer |
|||||
type |
array |
||||
default |
0.9 |
||||
0.999 |
|||||
maxItems |
2 |
||||
minItems |
2 |
||||
|
Optm Warmup Steps |
||||
Warmup steps for optimizer |
|||||
type |
integer |
||||
default |
20 |
||||
|
Optm Grad Norm Clip |
||||
Gradient norm clip for optimizer |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Async Tp Enabled |
||||
Whether to use async tensor parallelism |
|||||
type |
boolean |
||||
default |
False |
||||
|
Compile |
||||
Whether to use torch.compile |
|||||
type |
boolean |
||||
default |
True |
||||
|
Param Dtype |
||||
The data type for parameters and activations |
|||||
type |
string |
||||
default |
bfloat16 |
||||
|
Fsdp Reduce Dtype |
||||
The data type for reduction in FSDP |
|||||
type |
string |
||||
default |
float32 |
||||
|
Fsdp Offload |
||||
Whether to offload the model to CPU if using FSDP |
|||||
type |
boolean |
||||
default |
False |
||||
|
Fsdp Reshard After Forward |
||||
Reshard the param after forward pass in FSDP |
|||||
type |
string |
||||
default |
default |
||||
|
Train Batch Per Replica |
||||
The batch size for training per iteration in one replica, this is the local batch size for each gradient accumulation step |
|||||
type |
integer |
||||
default |
8 |
||||
|
Enable Validation |
||||
Enable validation during training. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Validation Step |
||||
Validation frequency during training, in terms of training steps |
|||||
type |
integer |
||||
default |
20 |
||||
|
Validation Batch Per Replica |
||||
The batch size for validation per iteration in one replica. |
|||||
type |
integer |
||||
default |
24 |
||||
|
Sync Weight Interval |
||||
The interval of train step for synchronizing weights between replicas. |
|||||
type |
integer |
||||
default |
1 |
||||
|
ValidationConfig |
||||
type |
object |
||||
properties |
|||||
|
Dataset configuration for validation. It includes dataset name, subset, revision and test split. |
||||
#/$defs/DatasetConfig |
|||||
|
Temperature |
||||
Temperature for sampling during validation. |
|||||
type |
number |
||||
default |
0.9 |
||||
|
Top P |
||||
Top-p for sampling during validation. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Top K |
||||
Top-k for sampling during validation. |
|||||
type |
integer |
||||
default |
10 |
||||
|
Repetition Penalty |
||||
Repetition penalty for sampling during validation. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
N Generation |
||||
n parameter same like what in OpenAI chat API for validation. |
|||||
type |
integer |
||||
default |
1 |
||||
|
Max Response Length |
||||
Max output length of rollout generation during validation. |
|||||
type |
integer |
||||
default |
2048 |