Configuration
Config |
|||||
type |
object |
||||
properties |
|||||
|
Custom |
||||
Custom script configuration. |
|||||
type |
object |
||||
additionalProperties |
True |
||||
|
#/$defs/TrainingConfig |
||||
|
#/$defs/RolloutConfig |
||||
|
#/$defs/PolicyConfig |
||||
|
#/$defs/LoggingConfig |
||||
|
#/$defs/ProfilerConfig |
||||
|
#/$defs/ValidationConfig |
||||
$defs |
|||||
|
CheckpointConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable Checkpoint |
||||
Enable checkpointing for training. If set to False, no checkpoint will be saved. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Save Freq |
||||
Checkpoint save frequency for training steps |
|||||
type |
integer |
||||
default |
20 |
||||
|
Save Freq In Epoch |
||||
Checkpoint save frequency for training epochs. Default to 0 (disabled). |
|||||
type |
integer |
||||
default |
0 |
||||
|
Save Mode |
||||
Checkpoint save mode for training steps |
|||||
type |
string |
||||
default |
async |
||||
|
Max Keep |
||||
Maximum number of checkpoints to keep. If set to -1, all checkpoints will be kept. |
|||||
type |
integer |
||||
default |
5 |
||||
|
Export Safetensors |
||||
Whether to export a safetensors weight for huggingface usage, include related config files. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Upload Hf |
||||
Whether to upload the safetensors weight to huggingface. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Hf Repo Name |
||||
The huggingface repo name to upload the safetensors weight. |
|||||
type |
string |
||||
default |
Comos-Reason1 |
||||
|
Upload S3 |
||||
Whether to upload the checkpoint and safetensors to S3. Default to False, set final will upload the final checkpoint, all will upload all checkpoints. |
|||||
default |
False |
||||
anyOf |
type |
boolean |
|||
type |
string |
||||
|
S3 Bucket |
||||
The S3 bucket name to upload the checkpoint and safetensors weight. |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
S3 Prefix |
||||
The S3 prefix to upload the checkpoint and safetensors weight. |
|||||
type |
string |
||||
default |
outputs |
||||
|
DatasetConfig |
||||
type |
object |
||||
properties |
|||||
|
Name |
||||
Huggingface dataset name or local path to parquet file |
|||||
type |
string |
||||
default |
|||||
|
Subset |
||||
Dataset subset if exists |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Revision |
||||
OrderedDict({‘help’: ‘Dataset git revision if exist, can be a branch name, a tag, or a commit hash.’}) |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Split |
||||
A list of dataset splits to train |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
|
Test Size |
||||
Size of the test set. If float, it is the ratio (between 0.0 and 1.0) of the dataset; if int, it is the absolute size of the test set. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
integer |
||||
type |
null |
||||
|
FP4Config |
||||
type |
object |
||||
properties |
|||||
|
Enable Fp4 |
||||
Whether to enable fp4. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Fp4 Recipe |
||||
Recipe for weight scale calculation. |
|||||
type |
string |
||||
default |
dynamic_scaling |
||||
|
Quant Recipe |
||||
Quantization strategy for weight. |
|||||
type |
string |
||||
default |
rowwise |
||||
|
FP8Config |
||||
type |
object |
||||
properties |
|||||
|
Enable Fp8 |
||||
Whether to enable fp8. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Fp8 Recipe |
||||
Recipe for weight scale calculation. |
|||||
type |
string |
||||
default |
dynamic_scaling |
||||
|
Quant Recipe |
||||
Quantization strategy for weight. |
|||||
type |
string |
||||
default |
rowwise |
||||
|
GrpoConfig |
||||
type |
object |
||||
properties |
|||||
|
Type |
||||
type |
string |
||||
const |
grpo |
||||
|
Variant |
||||
Variant of the GRPO, currently support grpo, gspo, dapo |
|||||
type |
string |
||||
default |
grpo |
||||
|
Dataset configuration for GRPO training. It includes dataset name, subset, revision, train split, test split and test size. |
||||
#/$defs/DatasetConfig |
|||||
|
Dataloader Shuffle |
||||
Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Enable Dataset Cache |
||||
Enable dataset cache process results, maybe accelerate the dataset loading |
|||||
type |
boolean |
||||
default |
False |
||||
|
Dataloader Num Workers |
||||
Number of subprocess to use for data loading |
|||||
type |
integer |
||||
default |
0 |
||||
|
Dataloader Prefetch Factor |
||||
Number of batches loaded in advance by each worker. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Dataloader Batch Size |
||||
Batch size for each iteration of the dataloader for when fetch prompts from controller. This is only the setting of the dataloader iterator on the controller side. |
|||||
default |
1 |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Prompt Column Name |
||||
Column name for prompt |
|||||
type |
string |
||||
default |
|||||
|
Response Column Name |
||||
Column name for response/reference answer |
|||||
type |
string |
||||
default |
|||||
|
Reward Function |
||||
Reward functions for the model. Currently support single_choice, boxed_math, and format. You can add weight to each reward function by passing a dict, e.g., {‘single_choice’: 0.9, ‘format’: 0.1} |
|||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
type |
object |
||||
additionalProperties |
type |
number |
|||
|
Filter Reward Metric |
||||
Reward function to filter in dynamic sampling for DAPO. If specified, only samples with different this rewards will be used for training. If None, no filtering will be applied. |
|||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
|
Temperature |
||||
Temperature for sampling. The higher the temperature, the more random the completions. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Epsilon Low |
||||
Epsilon value for clipping. |
|||||
type |
number |
||||
default |
0.2 |
||||
|
Epsilon High |
||||
Upper-bound epsilon value for clipping. If not specified, it defaults to the same value as the lower-bound specified in argument epsilon. Paper DAPO recommends 0.28. |
|||||
type |
number |
||||
default |
0.2 |
||||
|
Positive Nll Coef |
||||
Coefficient for Positive Example LM Loss. Set a positive value to enable; None disables the feature. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Lower Bound Ratio |
||||
Lower-bound ratio for dual-clip. |
|||||
type |
number |
||||
default |
3.0 |
||||
|
Loss Type |
||||
The type of loss to use for GRPO training. |
|||||
type |
string |
||||
default |
token-mean |
||||
|
Unbiased Loss Max Tokens |
||||
Maximum number of tokens to use for unbiased loss introduced in Dr.GRPO. If set to None, will not use unbiased loss.Only available when loss_type is seq-mean-token-mean |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Unbiased Advantage |
||||
Whether to divide the advantage by the standard deviation of rewards. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Configuration for overlong reward penalty. If enabled, the output will be penalized for responses that are too long. |
||||
#/$defs/OverlongRewardConfig |
|||||
|
Kl Beta |
||||
KL coefficient. If 0.0, the reference model is not loaded, reducing memory usage and improving training speed, but may be numerically unstable for long training runs. |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Aipo Rho |
||||
Rho value for AIPO (Asynchronous Importance weighted Policy Optimization). The clipping constant of the importance sampling ratio, suggest [2,10]. reference: https://arxiv.org/pdf/2505.24034 |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Mu Iterations |
||||
Number of iterations per batch (denoted as μ in the algorithm). |
|||||
type |
integer |
||||
default |
1 |
||||
|
Mini Batch |
||||
mini-batch size for GRPO training. Mini-batch is used to split the batch per optimization into smaller batches to fit into GPU memory. |
|||||
type |
integer |
||||
default |
2 |
||||
|
Batch Size Per Optimize |
||||
batch size for each optimization in GRPO training. The batch in each training step is split into smaller batches which each performs one step optimization. If not set, it will be the same as the whole batch size per GPU for each training step. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Max Token Len Per Mini Batch |
||||
Maximum token length per mini batch. If set, dynamic mini-batch sizing will be applied based on this limit. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Entropy Coeff |
||||
Coefficient for entropy regularization. |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Allowed Outdated Steps |
||||
Allowed outdated-async steps for rollout engine. If the number of left uncompleted rollout samples is larger than the (allowed_outdated_steps + 1) * n_policy_replicas * train_batch_per_replica, then rollout engine traffic will be throttled. |
|||||
type |
integer |
||||
default |
4 |
||||
|
On Policy |
||||
Enable fully synchronized (on-policy) rollout. If set to True, the rollout engine will wait until the expected weight version is updated before next generation starts. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Outdated Rollout Fetch Batch Size |
||||
Number of outdated rollouts to fetch. If set to 0, the rollout engine will stop generating rollouts if the weight is outdated. |
|||||
type |
integer |
||||
default |
1 |
||||
|
Min Filter Prefix Tokens |
||||
Minimum number of tokens to filter the prefix tokens for the rollouts inside the same group. If the number of tokens is larger than the min_filter_prefix_tokens, the rollouts with the same prefix but different rewards will be filtered out in loss calculation. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Max Retry For On Policy |
||||
Maximum number of retries for on-policy rollout to have enough samples. If non-positive, will retry with no upper limit until enough samples are generated. |
|||||
type |
integer |
||||
default |
10 |
||||
|
Reference Reset Interval |
||||
Interval to reset the reference model to the current model. If set to None or 0, the reference model will not be reset during training. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Reset Optimizer With Reference |
||||
Whether to reset the optimizer state when the reference model is reset. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Balance Dp Token |
||||
Whether to balance the number of tokens in each data parallel replica when calculating the loss. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Use Decoupled Loss |
||||
Whether to use decoupled loss. A decoupled loss separates the optimization of the behavior policy and the target policy, which can help to reduce the variance of the gradient estimate. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Behav Imp Weight Cap |
||||
Clipping cap for behavior importance weights. Useful when decoupled loss is used to avoid large variance. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Rollout As Token Ids |
||||
Whether to use token ids for rollouts instead of text. This can save tokenization time during rollout generation. |
|||||
type |
boolean |
||||
default |
False |
||||
|
LoggingConfig |
||||
type |
object |
||||
properties |
|||||
|
Logger |
||||
List of loggers to use, e.g., [‘console’, ‘wandb’] |
|||||
type |
array |
||||
items |
type |
string |
|||
|
Project Name |
||||
Wandb project name for logging. If set, the training will be logged to this project. |
|||||
type |
string |
||||
default |
cosmos_rl |
||||
|
Experiment Name |
||||
A short display name for this run. If not set, will use the output_dir as the experiment name. |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
LoraConfig |
||||
type |
object |
||||
properties |
|||||
|
R |
||||
LoRA rank |
|||||
type |
integer |
||||
default |
8 |
||||
|
Lora Alpha |
||||
LoRA alpha |
|||||
type |
number |
||||
default |
8.0 |
||||
|
Lora Dropout |
||||
LoRA dropout |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Target Modules |
||||
LoRA target modules, can be a list of strings or ‘all-linear’ |
|||||
default |
None |
||||
anyOf |
type |
array |
|||
items |
type |
string |
|||
type |
string |
||||
|
Use Rslora |
||||
When set to True, uses [Rank-Stabilized LoRA](https://huggingface.co/papers/2312.03732) which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Modules To Save |
||||
List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. |
|||||
default |
None |
||||
anyOf |
type |
array |
|||
items |
type |
string |
|||
type |
null |
||||
|
Alpha Pattern |
||||
Per-module overrides for lora_alpha. Keys are regex patterns; evaluated in insertion order, first match wins. |
|||||
default |
None |
||||
anyOf |
type |
object |
|||
additionalProperties |
type |
number |
|||
type |
null |
||||
|
R Pattern |
||||
Per-module overrides for LoRA rank r. Keys are regex patterns; evaluated in insertion order, first match wins. |
|||||
default |
None |
||||
anyOf |
type |
object |
|||
additionalProperties |
type |
integer |
|||
type |
null |
||||
|
Init Lora Weights |
||||
How to initialize the weights of the adapter layers.Passing True (default) results in the default initialization from the reference implementation from Microsoft, with the LoRA B weight being set to 0. This means that without further training, the LoRA adapter will be a no-op.Setting the initialization to False leads to random initialization of LoRA A and B, meaning that LoRA is not a no-op before training; this setting is intended for debugging purposes.Passing ‘gaussian’ results in Gaussian initialization scaled by the LoRA rank for linear and layers. Pass ‘loftq’ to use LoftQ initialization. Passing ‘eva’ results in a data-driven initialization of Explained Variance Adaptation.EVA initializes LoRA based on the SVD of layer input activations and achieves SOTA performance due to its ability to adapt to the finetuning data. Pass ‘olora’ to use OLoRA initialization. Passing ‘pissa’ results in the initialization of https://huggingface.co/papers/2404.02948 |
|||||
default |
True |
||||
anyOf |
type |
boolean |
|||
type |
string |
||||
enum |
gaussian, eva, olora, pissa, pissa_niter_[number of iters] |
||||
|
MultiTurnRolloutConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable |
||||
Whether to enable multi-turn rollout. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Enable Tools |
||||
Whether to enable tools in multi-turn rollout. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Enable Thinking |
||||
Whether to enable thinking in multi-turn rollout. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Custom Chat Template Path |
||||
The path to the custom chat template in chat. |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Max Assistant Turns |
||||
Max assistant turn count for multi-turn rollout. |
|||||
type |
integer |
||||
default |
5 |
||||
|
Add Generation Prompt |
||||
Whether to add generation prompt in multi-turn rollout. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Continue Final Message |
||||
Whether to continue the final message in multi-turn rollout. |
|||||
type |
boolean |
||||
default |
False |
||||
|
OverlongRewardConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable Overlong Penalty |
||||
Enable overlong penalty for the model. If set to True, the output will be penalized for responses that are too long. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Buffer Length |
||||
Length of the buffer for overlong penalty. If the response length exceeds this value, the output will be penalized. |
|||||
type |
integer |
||||
default |
4096 |
||||
|
Penalty Factor |
||||
Penalty factor for overlong penalty. The penalty increases linearly with the length of the response exceeding the buffer length from 0 to the penalty_factor. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
ParallelismConfig |
||||
type |
object |
||||
properties |
|||||
|
N Init Replicas |
||||
Number of initial replicas to be created |
|||||
type |
integer |
||||
default |
1 |
||||
|
Tp Size |
||||
Tensor parallelism size |
|||||
type |
integer |
||||
default |
2 |
||||
|
Cp Size |
||||
Context parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Ep Size |
||||
Expert parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Shard Size |
||||
Data Parallelism size in sharded mode |
|||||
type |
integer |
||||
default |
1 |
||||
|
Pp Size |
||||
Pipeline parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Pp Dynamic Shape |
||||
Pipeline parallelism dynamic shape |
|||||
type |
boolean |
||||
default |
False |
||||
|
Pp Micro Batch Size |
||||
Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Replicate Size |
||||
Data Parallelism size in replica mode. Only configurable in SFT type job, must be 1 in GRPO type job for dynamic scaling support purpose. |
|||||
type |
integer |
||||
default |
1 |
||||
|
PolicyConfig |
||||
type |
object |
||||
properties |
|||||
|
#/$defs/ParallelismConfig |
||||
|
Model Name Or Path |
||||
The model name or path, compatible with huggingface model name or local path |
|||||
type |
string |
||||
default |
Qwen/Qwen2.5-VL-7B-Instruct |
||||
|
Model Revision |
||||
The revision of the model to use |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Model Max Length |
||||
The maximum length for training, longer than this will be ignored for training stability |
|||||
type |
integer |
||||
default |
4096 |
||||
|
Model Gradient Checkpointing |
||||
Whether to use gradient checkpointing |
|||||
type |
boolean |
||||
default |
True |
||||
|
LoRA configuration |
||||
default |
None |
||||
anyOf |
#/$defs/LoraConfig |
||||
type |
null |
||||
|
Trainable Map |
||||
Mapping of name -> bool. Keys can either be: - exact parameter names (from model.named_parameters()) - exact module paths (from model.named_modules()) |
|||||
default |
None |
||||
anyOf |
type |
object |
|||
additionalProperties |
type |
boolean |
|||
type |
null |
||||
|
Enable Liger Kernel |
||||
Whether to use liger kernel. |
|||||
type |
boolean |
||||
default |
False |
||||
|
ProfilerConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable Profiler |
||||
Enable profiler for training |
|||||
type |
boolean |
||||
default |
False |
||||
|
Enable Nsys |
||||
Enable nsys for training |
|||||
type |
boolean |
||||
default |
False |
||||
|
Sub profiler config |
||||
#/$defs/SubProfilerConfig |
|||||
|
RolloutConfig |
||||
type |
object |
||||
properties |
|||||
|
#/$defs/RolloutParallelismConfig |
||||
|
Enforce Eager |
||||
Whether to enable eager execution for vLLM. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Include Stop Str In Output |
||||
Whether to include stop string in output. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Gpu Memory Utilization |
||||
GPU memory utilization factor for rollout backend. |
|||||
type |
number |
||||
default |
0.8 |
||||
|
Enable Chunked Prefill |
||||
Whether to enable chunked prefill for vLLM. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Max Response Length |
||||
Max output length of rollout generation. |
|||||
type |
integer |
||||
default |
2048 |
||||
|
N Generation |
||||
n parameter same like what in OpenAI chat API. |
|||||
type |
integer |
||||
default |
16 |
||||
|
N Generation To Batch |
||||
Whether to treat n_generation as batch dimension in rollout generation. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Batch Size |
||||
Batch size for rollout. |
|||||
type |
integer |
||||
default |
1 |
||||
|
Quantization |
||||
Quantization in vllm rollout generation. |
|||||
type |
string |
||||
default |
none |
||||
|
Seed |
||||
random seed for rollout. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
#/$defs/SamplingConfig |
||||
|
Vllm Use Flashinfer |
||||
Use flashinfer for vllm rollout. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Backend |
||||
Backend for rollout. Currently support vllm and trtllm. |
|||||
type |
string |
||||
default |
vllm |
||||
|
Configuration for multi-turn rollout. |
||||
#/$defs/MultiTurnRolloutConfig |
|||||
|
RolloutParallelismConfig |
||||
type |
object |
||||
properties |
|||||
|
N Init Replicas |
||||
Number of initial replicas to be created |
|||||
type |
integer |
||||
default |
1 |
||||
|
Tp Size |
||||
Tensor parallelism size |
|||||
type |
integer |
||||
default |
2 |
||||
|
Cp Size |
||||
Context parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Ep Size |
||||
Expert parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Shard Size |
||||
Data Parallelism size in sharded mode |
|||||
type |
integer |
||||
default |
-1 |
||||
|
Pp Size |
||||
Pipeline parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Pp Dynamic Shape |
||||
Pipeline parallelism dynamic shape |
|||||
type |
boolean |
||||
default |
False |
||||
|
Pp Micro Batch Size |
||||
Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Replicate Size |
||||
Data Parallelism size in replica mode, only 1 is supported for dynamic scaling purpose. |
|||||
type |
integer |
||||
default |
1 |
||||
|
SFTDataConfig |
||||
type |
object |
||||
properties |
|||||
|
Type |
||||
type |
string |
||||
const |
sft |
||||
|
Dataset configuration for SFT training. It includes dataset name, subset, revision, train split, and test split. |
||||
#/$defs/DatasetConfig |
|||||
|
Mini Batch |
||||
mini-batch size for training. |
|||||
type |
integer |
||||
default |
2 |
||||
|
Dataloader Shuffle |
||||
Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Enable Dataset Cache |
||||
Enable dataset cache process results, maybe accelerate the dataset loading |
|||||
type |
boolean |
||||
default |
False |
||||
|
Dataloader Num Workers |
||||
Number of subprocess to use for data loading |
|||||
type |
integer |
||||
default |
0 |
||||
|
Dataloader Prefetch Factor |
||||
Number of batches loaded in advance by each worker. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Conversation Column Name |
||||
Column name for formated conversation json |
|||||
type |
string |
||||
default |
conversations |
||||
|
System Prompt |
||||
System prompt for the model, which will be prepended to the prompt |
|||||
type |
string |
||||
default |
|||||
|
Balance Dp Token |
||||
Whether to balance the number of tokens in each data parallel replica when calculating the loss. |
|||||
type |
boolean |
||||
default |
True |
||||
|
SamplingConfig |
||||
type |
object |
||||
properties |
|||||
|
Temperature |
||||
Temperature for sampling. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Top P |
||||
Top-p for sampling. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Top K |
||||
Top-k for sampling. |
|||||
type |
integer |
||||
default |
-1 |
||||
|
Repetition Penalty |
||||
Repetition penalty for sampling. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Use Flashinfer |
||||
Use flashinfer for sampling. |
|||||
type |
boolean |
||||
default |
False |
||||
|
SubProfilerConfig |
||||
type |
object |
||||
properties |
|||||
|
Do Profile |
||||
Whether to profile, only used in runtime. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Active Steps |
||||
Number of active steps |
|||||
type |
integer |
||||
default |
1 |
||||
|
Warmup Steps |
||||
Number of warmup steps |
|||||
type |
integer |
||||
default |
1 |
||||
|
Wait Steps |
||||
Number of wait steps |
|||||
type |
integer |
||||
default |
1 |
||||
|
Rank Filter |
||||
Rank filter |
|||||
type |
array |
||||
items |
type |
integer |
|||
|
Record Shape |
||||
Whether to record shape |
|||||
type |
boolean |
||||
default |
False |
||||
|
Profile Memory |
||||
Whether to profile memory |
|||||
type |
boolean |
||||
default |
False |
||||
|
With Stack |
||||
Whether to profile stack |
|||||
type |
boolean |
||||
default |
False |
||||
|
With Modules |
||||
Whether to profile modules |
|||||
type |
boolean |
||||
default |
False |
||||
|
TrainingConfig |
||||
type |
object |
||||
properties |
|||||
|
Train Policy |
||||
default |
type |
grpo |
|||
variant |
grpo |
||||
dataset |
name |
||||
revision |
|||||
split |
|||||
subset |
|||||
test_size |
None |
||||
dataloader_shuffle |
True |
||||
enable_dataset_cache |
False |
||||
dataloader_num_workers |
0 |
||||
dataloader_prefetch_factor |
None |
||||
dataloader_batch_size |
1 |
||||
prompt_column_name |
|||||
response_column_name |
|||||
reward_function |
single_choice |
1.0 |
|||
filter_reward_metric |
|||||
temperature |
1.0 |
||||
epsilon_low |
0.2 |
||||
epsilon_high |
0.2 |
||||
positive_nll_coef |
None |
||||
lower_bound_ratio |
3.0 |
||||
loss_type |
token-mean |
||||
unbiased_loss_max_tokens |
None |
||||
unbiased_advantage |
False |
||||
overlong_reward |
buffer_length |
4096 |
|||
enable_overlong_penalty |
False |
||||
penalty_factor |
1.0 |
||||
kl_beta |
0.0 |
||||
aipo_rho |
None |
||||
mu_iterations |
1 |
||||
mini_batch |
2 |
||||
batch_size_per_optimize |
None |
||||
max_token_len_per_mini_batch |
None |
||||
entropy_coeff |
0.0 |
||||
allowed_outdated_steps |
4 |
||||
on_policy |
False |
||||
outdated_rollout_fetch_batch_size |
1 |
||||
min_filter_prefix_tokens |
None |
||||
max_retry_for_on_policy |
10 |
||||
reference_reset_interval |
None |
||||
reset_optimizer_with_reference |
True |
||||
balance_dp_token |
False |
||||
use_decoupled_loss |
False |
||||
behav_imp_weight_cap |
None |
||||
rollout_as_token_ids |
False |
||||
oneOf |
#/$defs/SFTDataConfig |
||||
#/$defs/GrpoConfig |
|||||
|
Optm Name |
||||
Optimizer name |
|||||
type |
string |
||||
default |
AdamW |
||||
|
Optm Lr |
||||
Learning rate for optimizer, can be a float or a list of floats for multiple optimizers |
|||||
default |
1e-06 |
||||
anyOf |
type |
number |
|||
type |
array |
||||
items |
type |
number |
|||
|
Optm Impl |
||||
Implementation type for optimizer. More info: https://pytorch.org/docs/stable/optim.html, can be a list of strings for multiple optimizers |
|||||
default |
fused |
||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
|
Optm Weight Decay |
||||
Weight decay for optimizer |
|||||
type |
number |
||||
default |
0.01 |
||||
|
Optm Betas |
||||
Betas for optimizer |
|||||
type |
array |
||||
default |
0.9 |
||||
0.999 |
|||||
maxItems |
2 |
||||
minItems |
2 |
||||
|
Optm Warmup Steps |
||||
Warmup steps for optimizer, can be an integer or a float, if it is a float and range in [0.0, 1.0], it will be multiplied by the total steps |
|||||
default |
20 |
||||
anyOf |
type |
integer |
|||
type |
number |
||||
|
Optm Decay Ratio |
||||
Ratio of total steps for decay, range in [0.0, 1.0], 0 means no decay. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Optm Decay Type |
||||
Type of decay for optimizer |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Optm Min Lr Factor |
||||
Minimum lr factor for optimizer, range in [0.0, 1.0] |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Optm Grad Norm Clip |
||||
Gradient norm clip for optimizer |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Master Dtype |
||||
The master weight data type for optimizers, is orthognal to param_dtype. Should be high precision for convergence consideration |
|||||
type |
string |
||||
default |
float32 |
||||
|
Param Dtype |
||||
The data type for forward/backward. Outside forward/backward, params are in master_dtype |
|||||
type |
string |
||||
default |
bfloat16 |
||||
|
Transfer Dtype |
||||
The data type for transfer parameters between Policy and Rollout. |
|||||
type |
string |
||||
default |
None |
||||
|
Fsdp Reduce Dtype |
||||
The data type for reduction in FSDP |
|||||
type |
string |
||||
default |
float32 |
||||
|
Fsdp Offload |
||||
Whether to offload the model to CPU if using FSDP |
|||||
type |
boolean |
||||
default |
False |
||||
|
Fsdp Reshard After Forward |
||||
Reshard the param after forward pass in FSDP |
|||||
type |
string |
||||
default |
default |
||||
|
Train Batch Per Replica |
||||
The batch size for training per iteration in one replica, this is the local batch size for each gradient accumulation step |
|||||
type |
integer |
||||
default |
8 |
||||
|
#/$defs/FP8Config |
||||
|
#/$defs/FP4Config |
||||
|
#/$defs/CheckpointConfig |
||||
|
Resume |
||||
Resume training from a checkpoint. If True, will resume from the latest checkpoint of the output_dir. If a string, will resume from the specified checkpoint path. |
|||||
default |
False |
||||
anyOf |
type |
boolean |
|||
type |
string |
||||
|
Epoch |
||||
Number of epochs for training |
|||||
type |
integer |
||||
default |
1 |
||||
|
Output Dir |
||||
Output directory |
|||||
type |
string |
||||
default |
./outputs |
||||
|
Timestamp |
||||
Timestamp for the output directory and wandb ID, if not set, will be generated automatically |
|||||
type |
string |
||||
default |
|||||
|
Epsilon |
||||
Epsilon for optimizer |
|||||
type |
number |
||||
default |
1e-06 |
||||
|
Async Tp Enabled |
||||
Whether to use async tensor parallelism |
|||||
type |
boolean |
||||
default |
False |
||||
|
Compile |
||||
Whether to use torch.compile |
|||||
type |
boolean |
||||
default |
True |
||||
|
Sync Weight Interval |
||||
The interval of train step for synchronizing weights between replicas. |
|||||
type |
integer |
||||
default |
1 |
||||
|
Deterministic |
||||
Whether to use deterministic training. If set to True, will use deterministic training, which is expected to be slower. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Activation Offload |
||||
Whether to use activation offload |
|||||
type |
boolean |
||||
default |
False |
||||
|
Fa Version |
||||
FlashAttention version to use. If None, will use the default version. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Seed |
||||
Random seed for training. If deterministic is set to True, will by default be set to 42. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Local Dataset |
||||
Whether to use local dataset to query sample. If set to True, will use the local dataset. |
|||||
default |
True |
||||
anyOf |
type |
boolean |
|||
type |
null |
||||
|
Max Num Steps |
||||
Optional upper bound on total training steps. If set, training stops when either this step count or the epoch-based limit is reached (whichever comes first). Handy for quick smoke tests. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Sequence Packing |
||||
Whether to enable sequence packing for training. If set to True, the input sequences will be packed into a single tensor for training. |
|||||
type |
boolean |
||||
default |
False |
||||
|
ValidationConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable |
||||
Enable validation during training. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Freq |
||||
Validation frequency during training, in terms of training steps |
|||||
type |
integer |
||||
default |
20 |
||||
|
Batch Size |
||||
Batch size for validation, will use the same batch size as training if not set. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Dataset configuration for validation. It includes dataset name, subset, revision and test split. |
||||
#/$defs/DatasetConfig |
|||||
|
Temperature |
||||
Temperature for sampling during validation. |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Top P |
||||
Top-p for sampling during validation. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Top K |
||||
Top-k for sampling during validation. |
|||||
default |
1 |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Repetition Penalty |
||||
Repetition penalty for sampling during validation. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
N Generation |
||||
n parameter same like what in OpenAI chat API for validation. |
|||||
type |
integer |
||||
default |
1 |
||||
|
Max Response Length |
||||
Max output length of rollout generation during validation. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Reward Function |
||||
Reward functions for the model. Currently support single_choice, boxed_math, and format. You can add weight to each reward function by passing a dict, e.g., {‘single_choice’: 0.9, ‘format’: 0.1} |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
type |
object |
||||
additionalProperties |
type |
number |
|||