Configuration
Config |
|||||
type |
object |
||||
properties |
|||||
|
Custom |
||||
Custom script configuration. |
|||||
type |
object |
||||
additionalProperties |
True |
||||
|
#/$defs/TrainingConfig |
||||
|
#/$defs/RolloutConfig |
||||
|
#/$defs/PolicyConfig |
||||
|
#/$defs/LoggingConfig |
||||
|
#/$defs/ProfilerConfig |
||||
|
#/$defs/ValidationConfig |
||||
$defs |
|||||
|
CheckpointConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable Checkpoint |
||||
Enable checkpointing for training. If set to False, no checkpoint will be saved. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Save Freq |
||||
Checkpoint save frequency for training steps |
|||||
type |
integer |
||||
default |
20 |
||||
|
Save Mode |
||||
Checkpoint save mode for training steps |
|||||
type |
string |
||||
default |
async |
||||
|
Max Keep |
||||
Maximum number of checkpoints to keep. If set to -1, all checkpoints will be kept. |
|||||
type |
integer |
||||
default |
5 |
||||
|
Export Safetensors |
||||
Whether to export a safetensors weight for huggingface usage, include related config files. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Upload Hf |
||||
Whether to upload the safetensors weight to huggingface. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Hf Repo Name |
||||
The huggingface repo name to upload the safetensors weight. |
|||||
type |
string |
||||
default |
Comos-Reason1 |
||||
|
Upload S3 |
||||
Whether to upload the checkpoint and safetensors to S3. Default to False, set final will upload the final checkpoint, all will upload all checkpoints. |
|||||
default |
False |
||||
anyOf |
type |
boolean |
|||
type |
string |
||||
|
S3 Bucket |
||||
The S3 bucket name to upload the checkpoint and safetensors weight. |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
S3 Prefix |
||||
The S3 prefix to upload the checkpoint and safetensors weight. |
|||||
type |
string |
||||
default |
outputs |
||||
|
DatasetConfig |
||||
type |
object |
||||
properties |
|||||
|
Name |
||||
Huggingface dataset name or local path to parquet file |
|||||
type |
string |
||||
default |
|||||
|
Subset |
||||
Dataset subset if exists |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Revision |
||||
OrderedDict({‘help’: ‘Dataset git revision if exist, can be a branch name, a tag, or a commit hash.’}) |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Split |
||||
A list of dataset splits to train |
|||||
default |
|||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
|
Test Size |
||||
Size of the test set. If float, it is the ratio (between 0.0 and 1.0) of the dataset; if int, it is the absolute size of the test set. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
integer |
||||
type |
null |
||||
|
FP8Config |
||||
type |
object |
||||
properties |
|||||
|
Enable Fp8 |
||||
Whether to enable fp8. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Fp8 Recipe |
||||
Recipe for weight scale calculation. |
|||||
type |
string |
||||
default |
dynamic_scaling |
||||
|
Quant Recipe |
||||
Quantization strategy for weight. |
|||||
type |
string |
||||
default |
rowwise |
||||
|
GrpoConfig |
||||
type |
object |
||||
properties |
|||||
|
Type |
||||
type |
string |
||||
const |
grpo |
||||
|
Variant |
||||
Variant of the GRPO, currently support grpo, and dapo |
|||||
type |
string |
||||
default |
grpo |
||||
|
Dataset configuration for GRPO training. It includes dataset name, subset, revision, train split, test split and test size. |
||||
#/$defs/DatasetConfig |
|||||
|
Dataloader Shuffle |
||||
Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Enable Dataset Cache |
||||
Enable dataset cache process results, maybe accelerate the dataset loading |
|||||
type |
boolean |
||||
default |
False |
||||
|
Dataloader Num Workers |
||||
Number of subprocess to use for data loading |
|||||
type |
integer |
||||
default |
0 |
||||
|
Dataloader Prefetch Factor |
||||
Number of batches loaded in advance by each worker. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Prompt Column Name |
||||
Column name for prompt |
|||||
type |
string |
||||
default |
|||||
|
Response Column Name |
||||
Column name for response/reference answer |
|||||
type |
string |
||||
default |
|||||
|
Reward Function |
||||
Reward functions for the model. Currently support single_choice, boxed_math, and format. You can add weight to each reward function by passing a dict, e.g., {‘single_choice’: 0.9, ‘format’: 0.1} |
|||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
type |
object |
||||
additionalProperties |
type |
number |
|||
|
Temperature |
||||
Temperature for sampling. The higher the temperature, the more random the completions. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Epsilon Low |
||||
Epsilon value for clipping. |
|||||
type |
number |
||||
default |
0.2 |
||||
|
Epsilon High |
||||
Upper-bound epsilon value for clipping. If not specified, it defaults to the same value as the lower-bound specified in argument epsilon. Paper DAPO recommends 0.28. |
|||||
type |
number |
||||
default |
0.2 |
||||
|
Positive Nll Coef |
||||
Coefficient for Positive Example LM Loss. Set a positive value to enable; None disables the feature. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Lower Bound Ratio |
||||
Lower-bound ratio for dual-clip. |
|||||
type |
number |
||||
default |
3.0 |
||||
|
Loss Type |
||||
The type of loss to use for GRPO training. |
|||||
type |
string |
||||
default |
token-mean |
||||
|
Unbiased Loss Max Tokens |
||||
Maximum number of tokens to use for unbiased loss introduced in Dr.GRPO. If set to None, will not use unbiased loss.Only available when loss_type is seq-mean-token-mean |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Unbiased Advantage |
||||
Whether to divide the advantage by the standard deviation of rewards. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Configuration for overlong reward penalty. If enabled, the output will be penalized for responses that are too long. |
||||
#/$defs/OverlongRewardConfig |
|||||
|
Kl Beta |
||||
KL coefficient. If 0.0, the reference model is not loaded, reducing memory usage and improving training speed, but may be numerically unstable for long training runs. |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Aipo Rho |
||||
Rho value for AIPO (Asynchronous Importance weighted Policy Optimization). The clipping constant of the importance sampling ratio, suggest [2,10]. reference: https://arxiv.org/pdf/2505.24034 |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Mu Iterations |
||||
Number of iterations per batch (denoted as μ in the algorithm). |
|||||
type |
integer |
||||
default |
1 |
||||
|
Mini Batch |
||||
mini-batch size for GRPO training. |
|||||
type |
integer |
||||
default |
2 |
||||
|
Allowed Outdated Steps |
||||
Allowed outdated-async steps for rollout engine. If the number of left pending rollouts is larger than the allowed_outdated_steps * n_policy_replicas * train_batch_per_replica, then rollout engine traffic will be throttled. |
|||||
type |
integer |
||||
default |
4 |
||||
|
On Policy |
||||
Enable fully synchronized (on-policy) rollout. If set to True, the rollout engine will wait until the expected weight version is updated before next generation starts. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Min Filter Prefix Tokens |
||||
Minimum number of tokens to filter the prefix tokens for the rollouts inside the same group. If the number of tokens is larger than the min_filter_prefix_tokens, the rollouts with the same prefix but different rewards will be filtered out in loss calculation. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
LoggingConfig |
||||
type |
object |
||||
properties |
|||||
|
Logger |
||||
List of loggers to use, e.g., [‘console’, ‘wandb’] |
|||||
type |
array |
||||
items |
type |
string |
|||
|
Project Name |
||||
Wandb project name for logging. If set, the training will be logged to this project. |
|||||
type |
string |
||||
default |
cosmos_rl |
||||
|
Experiment Name |
||||
A short display name for this run. If not set, will use the output_dir as the experiment name. |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
LoraConfig |
||||
type |
object |
||||
properties |
|||||
|
R |
||||
LoRA rank |
|||||
type |
integer |
||||
default |
8 |
||||
|
Lora Alpha |
||||
LoRA alpha |
|||||
type |
number |
||||
default |
8.0 |
||||
|
Lora Dropout |
||||
LoRA dropout |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Target Modules |
||||
LoRA target modules, can be a list of strings or ‘all-linear’ |
|||||
default |
None |
||||
anyOf |
type |
array |
|||
items |
type |
string |
|||
type |
string |
||||
|
Use Rslora |
||||
When set to True, uses [Rank-Stabilized LoRA](https://huggingface.co/papers/2312.03732) which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Modules To Save |
||||
List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. |
|||||
default |
None |
||||
anyOf |
type |
array |
|||
items |
type |
string |
|||
type |
null |
||||
|
Init Lora Weights |
||||
How to initialize the weights of the adapter layers.Passing True (default) results in the default initialization from the reference implementation from Microsoft, with the LoRA B weight being set to 0. This means that without further training, the LoRA adapter will be a no-op.Setting the initialization to False leads to random initialization of LoRA A and B, meaning that LoRA is not a no-op before training; this setting is intended for debugging purposes.Passing ‘gaussian’ results in Gaussian initialization scaled by the LoRA rank for linear and layers. Pass ‘loftq’ to use LoftQ initialization. Passing ‘eva’ results in a data-driven initialization of Explained Variance Adaptation.EVA initializes LoRA based on the SVD of layer input activations and achieves SOTA performance due to its ability to adapt to the finetuning data. Pass ‘olora’ to use OLoRA initialization. Passing ‘pissa’ results in the initialization of https://huggingface.co/papers/2404.02948 |
|||||
default |
True |
||||
anyOf |
type |
boolean |
|||
type |
string |
||||
enum |
gaussian, eva, olora, pissa, pissa_niter_[number of iters] |
||||
|
OverlongRewardConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable Overlong Penalty |
||||
Enable overlong penalty for the model. If set to True, the output will be penalized for responses that are too long. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Buffer Length |
||||
Length of the buffer for overlong penalty. If the response length exceeds this value, the output will be penalized. |
|||||
type |
integer |
||||
default |
4096 |
||||
|
Penalty Factor |
||||
Penalty factor for overlong penalty. The penalty increases linearly with the length of the response exceeding the buffer length from 0 to the penalty_factor. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
ParallelismConfig |
||||
type |
object |
||||
properties |
|||||
|
N Init Replicas |
||||
Number of initial replicas to be created |
|||||
type |
integer |
||||
default |
1 |
||||
|
Tp Size |
||||
Tensor parallelism size |
|||||
type |
integer |
||||
default |
2 |
||||
|
Cp Size |
||||
Context parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Ep Size |
||||
Expert parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Shard Size |
||||
Data Parallelism size in sharded mode |
|||||
type |
integer |
||||
default |
-1 |
||||
|
Pp Size |
||||
Pipeline parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Pp Dynamic Shape |
||||
Pipeline parallelism dynamic shape |
|||||
type |
boolean |
||||
default |
False |
||||
|
Pp Micro Batch Size |
||||
Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Replicate Size |
||||
Data Parallelism size in replica mode. Only configurable in SFT type job, must be 1 in GRPO type job for dynamic scaling support purpose. |
|||||
type |
integer |
||||
default |
1 |
||||
|
PolicyConfig |
||||
type |
object |
||||
properties |
|||||
|
#/$defs/ParallelismConfig |
||||
|
Model Name Or Path |
||||
The model name or path, compatible with huggingface model name or local path |
|||||
type |
string |
||||
default |
Qwen/Qwen2.5-VL-7B-Instruct |
||||
|
Model Revision |
||||
The revision of the model to use |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Model Max Length |
||||
The maximum length for training, longer than this will be ignored for training stability |
|||||
type |
integer |
||||
default |
4096 |
||||
|
Model Gradient Checkpointing |
||||
Whether to use gradient checkpointing |
|||||
type |
boolean |
||||
default |
True |
||||
|
LoRA configuration |
||||
default |
None |
||||
anyOf |
#/$defs/LoraConfig |
||||
type |
null |
||||
|
Trainable Map |
||||
Mapping of name -> bool. Keys can either be: - exact parameter names (from model.named_parameters()) - exact module paths (from model.named_modules()) |
|||||
default |
None |
||||
anyOf |
type |
object |
|||
additionalProperties |
type |
boolean |
|||
type |
null |
||||
|
Enable Liger Kernel |
||||
Whether to use liger kernel. |
|||||
type |
boolean |
||||
default |
False |
||||
|
ProfilerConfig |
||||
type |
object |
||||
properties |
|||||
|
Enable Profiler |
||||
Enable profiler for training |
|||||
type |
boolean |
||||
default |
False |
||||
|
Enable Nsys |
||||
Enable nsys for training |
|||||
type |
boolean |
||||
default |
False |
||||
|
Sub profiler config |
||||
#/$defs/SubProfilerConfig |
|||||
|
RolloutConfig |
||||
type |
object |
||||
properties |
|||||
|
#/$defs/RolloutParallelismConfig |
||||
|
Enforce Eager |
||||
Whether to enable eager execution for vLLM. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Include Stop Str In Output |
||||
Whether to include stop string in output. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Gpu Memory Utilization |
||||
GPU memory utilization factor for rollout backend. |
|||||
type |
number |
||||
default |
0.8 |
||||
|
Enable Chunked Prefill |
||||
Whether to enable chunked prefill for vLLM. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Max Response Length |
||||
Max output length of rollout generation. |
|||||
type |
integer |
||||
default |
2048 |
||||
|
N Generation |
||||
n parameter same like what in OpenAI chat API. |
|||||
type |
integer |
||||
default |
16 |
||||
|
Batch Size |
||||
Batch size for rollout. |
|||||
type |
integer |
||||
default |
1 |
||||
|
Val Batch Size |
||||
Batch size for rollout generation during validation. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Quantization |
||||
Quantization in vllm rollout generation. |
|||||
type |
string |
||||
default |
none |
||||
|
Seed |
||||
random seed for rollout. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
#/$defs/SamplingConfig |
||||
|
Vllm Use Flashinfer |
||||
Use flashinfer for vllm rollout. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Backend |
||||
Backend for rollout. Currently support vllm and trtllm. |
|||||
type |
string |
||||
default |
vllm |
||||
|
RolloutParallelismConfig |
||||
type |
object |
||||
properties |
|||||
|
N Init Replicas |
||||
Number of initial replicas to be created |
|||||
type |
integer |
||||
default |
1 |
||||
|
Tp Size |
||||
Tensor parallelism size |
|||||
type |
integer |
||||
default |
2 |
||||
|
Cp Size |
||||
Context parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Ep Size |
||||
Expert parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Shard Size |
||||
Data Parallelism size in sharded mode |
|||||
type |
integer |
||||
default |
-1 |
||||
|
Pp Size |
||||
Pipeline parallelism size |
|||||
type |
integer |
||||
default |
1 |
||||
|
Pp Dynamic Shape |
||||
Pipeline parallelism dynamic shape |
|||||
type |
boolean |
||||
default |
False |
||||
|
Pp Micro Batch Size |
||||
Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages |
|||||
type |
integer |
||||
default |
1 |
||||
|
Dp Replicate Size |
||||
Data Parallelism size in replica mode, only 1 is supported for dynamic scaling purpose. |
|||||
type |
integer |
||||
default |
1 |
||||
|
SFTDataConfig |
||||
type |
object |
||||
properties |
|||||
|
Type |
||||
type |
string |
||||
const |
sft |
||||
|
Dataset configuration for SFT training. It includes dataset name, subset, revision, train split, and test split. |
||||
#/$defs/DatasetConfig |
|||||
|
Mini Batch |
||||
mini-batch size for training. |
|||||
type |
integer |
||||
default |
2 |
||||
|
Dataloader Shuffle |
||||
Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded. |
|||||
type |
boolean |
||||
default |
True |
||||
|
Enable Dataset Cache |
||||
Enable dataset cache process results, maybe accelerate the dataset loading |
|||||
type |
boolean |
||||
default |
False |
||||
|
Dataloader Num Workers |
||||
Number of subprocess to use for data loading |
|||||
type |
integer |
||||
default |
0 |
||||
|
Dataloader Prefetch Factor |
||||
Number of batches loaded in advance by each worker. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Conversation Column Name |
||||
Column name for formated conversation json |
|||||
type |
string |
||||
default |
conversations |
||||
|
System Prompt |
||||
System prompt for the model, which will be prepended to the prompt |
|||||
type |
string |
||||
default |
|||||
|
SamplingConfig |
||||
type |
object |
||||
properties |
|||||
|
Temperature |
||||
Temperature for sampling. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Top P |
||||
Top-p for sampling. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Top K |
||||
Top-k for sampling. |
|||||
type |
integer |
||||
default |
-1 |
||||
|
Repetition Penalty |
||||
Repetition penalty for sampling. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Use Flashinfer |
||||
Use flashinfer for sampling. |
|||||
type |
boolean |
||||
default |
False |
||||
|
SubProfilerConfig |
||||
type |
object |
||||
properties |
|||||
|
Do Profile |
||||
Whether to profile, only used in runtime. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Active Steps |
||||
Number of active steps |
|||||
type |
integer |
||||
default |
1 |
||||
|
Warmup Steps |
||||
Number of warmup steps |
|||||
type |
integer |
||||
default |
1 |
||||
|
Wait Steps |
||||
Number of wait steps |
|||||
type |
integer |
||||
default |
1 |
||||
|
Rank Filter |
||||
Rank filter |
|||||
type |
array |
||||
items |
type |
integer |
|||
|
Record Shape |
||||
Whether to record shape |
|||||
type |
boolean |
||||
default |
False |
||||
|
Profile Memory |
||||
Whether to profile memory |
|||||
type |
boolean |
||||
default |
False |
||||
|
With Stack |
||||
Whether to profile stack |
|||||
type |
boolean |
||||
default |
False |
||||
|
With Modules |
||||
Whether to profile modules |
|||||
type |
boolean |
||||
default |
False |
||||
|
TrainingConfig |
||||
type |
object |
||||
properties |
|||||
|
Train Policy |
||||
default |
type |
grpo |
|||
variant |
grpo |
||||
dataset |
name |
||||
revision |
|||||
split |
|||||
subset |
|||||
test_size |
None |
||||
dataloader_shuffle |
True |
||||
enable_dataset_cache |
False |
||||
dataloader_num_workers |
0 |
||||
dataloader_prefetch_factor |
None |
||||
prompt_column_name |
|||||
response_column_name |
|||||
reward_function |
single_choice |
1.0 |
|||
temperature |
1.0 |
||||
epsilon_low |
0.2 |
||||
epsilon_high |
0.2 |
||||
positive_nll_coef |
None |
||||
lower_bound_ratio |
3.0 |
||||
loss_type |
token-mean |
||||
unbiased_loss_max_tokens |
None |
||||
unbiased_advantage |
False |
||||
overlong_reward |
buffer_length |
4096 |
|||
enable_overlong_penalty |
False |
||||
penalty_factor |
1.0 |
||||
kl_beta |
0.0 |
||||
aipo_rho |
None |
||||
mu_iterations |
1 |
||||
mini_batch |
2 |
||||
allowed_outdated_steps |
4 |
||||
on_policy |
False |
||||
min_filter_prefix_tokens |
None |
||||
oneOf |
#/$defs/SFTDataConfig |
||||
#/$defs/GrpoConfig |
|||||
|
Optm Name |
||||
Optimizer name |
|||||
type |
string |
||||
default |
AdamW |
||||
|
Optm Lr |
||||
Learning rate for optimizer, can be a float or a list of floats for multiple optimizers |
|||||
default |
1e-06 |
||||
anyOf |
type |
number |
|||
type |
array |
||||
items |
type |
number |
|||
|
Optm Impl |
||||
Implementation type for optimizer. More info: https://pytorch.org/docs/stable/optim.html, can be a list of strings for multiple optimizers |
|||||
default |
fused |
||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
|
Optm Weight Decay |
||||
Weight decay for optimizer |
|||||
type |
number |
||||
default |
0.01 |
||||
|
Optm Betas |
||||
Betas for optimizer |
|||||
type |
array |
||||
default |
0.9 |
||||
0.999 |
|||||
maxItems |
2 |
||||
minItems |
2 |
||||
|
Optm Warmup Steps |
||||
Warmup steps for optimizer, can be an integer or a float, if it is a float and range in [0.0, 1.0], it will be multiplied by the total steps |
|||||
default |
20 |
||||
anyOf |
type |
integer |
|||
type |
number |
||||
|
Optm Decay Ratio |
||||
Ratio of total steps for decay, range in [0.0, 1.0], 0 means no decay. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Optm Decay Type |
||||
Type of decay for optimizer |
|||||
default |
None |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Optm Min Lr Factor |
||||
Minimum lr factor for optimizer, range in [0.0, 1.0] |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Optm Grad Norm Clip |
||||
Gradient norm clip for optimizer |
|||||
type |
number |
||||
default |
1.0 |
||||
|
Master Dtype |
||||
The master weight data type for optimizers, is orthognal to param_dtype. |
|||||
default |
float32 |
||||
anyOf |
type |
string |
|||
type |
null |
||||
|
Param Dtype |
||||
The data type for forward/backward. Outside forward/backward, params are in master_dtype |
|||||
type |
string |
||||
default |
bfloat16 |
||||
|
Transfer Dtype |
||||
The data type for transfer parameters between Policy and Rollout. |
|||||
type |
string |
||||
default |
None |
||||
|
Fsdp Reduce Dtype |
||||
The data type for reduction in FSDP |
|||||
type |
string |
||||
default |
float32 |
||||
|
Fsdp Offload |
||||
Whether to offload the model to CPU if using FSDP |
|||||
type |
boolean |
||||
default |
False |
||||
|
Fsdp Reshard After Forward |
||||
Reshard the param after forward pass in FSDP |
|||||
type |
string |
||||
default |
default |
||||
|
Train Batch Per Replica |
||||
The batch size for training per iteration in one replica, this is the local batch size for each gradient accumulation step |
|||||
type |
integer |
||||
default |
8 |
||||
|
Enable Validation |
||||
Enable validation during training. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Validation Step |
||||
Validation frequency during training, in terms of training steps |
|||||
type |
integer |
||||
default |
20 |
||||
|
Validation Batch Per Replica |
||||
The batch size for validation per iteration in one replica. |
|||||
type |
integer |
||||
default |
24 |
||||
|
#/$defs/FP8Config |
||||
|
#/$defs/CheckpointConfig |
||||
|
Resume |
||||
Resume training from a checkpoint. If True, will resume from the latest checkpoint of the output_dir. If a string, will resume from the specified checkpoint path. |
|||||
default |
False |
||||
anyOf |
type |
boolean |
|||
type |
string |
||||
|
Epoch |
||||
Number of epochs for training |
|||||
type |
integer |
||||
default |
1 |
||||
|
Output Dir |
||||
Output directory |
|||||
type |
string |
||||
default |
./outputs |
||||
|
Timestamp |
||||
Timestamp for the output directory and wandb ID, if not set, will be generated automatically |
|||||
type |
string |
||||
default |
|||||
|
Epsilon |
||||
Epsilon for optimizer |
|||||
type |
number |
||||
default |
1e-06 |
||||
|
Async Tp Enabled |
||||
Whether to use async tensor parallelism |
|||||
type |
boolean |
||||
default |
False |
||||
|
Compile |
||||
Whether to use torch.compile |
|||||
type |
boolean |
||||
default |
True |
||||
|
Sync Weight Interval |
||||
The interval of train step for synchronizing weights between replicas. |
|||||
type |
integer |
||||
default |
1 |
||||
|
Deterministic |
||||
Whether to use deterministic training. If set to True, will use deterministic training, which is expected to be slower. |
|||||
type |
boolean |
||||
default |
False |
||||
|
Seed |
||||
Random seed for training. If deterministic is set to True, will by default be set to 42. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Max Num Steps |
||||
Optional upper bound on total training steps. If set, training stops when either this step count or the epoch-based limit is reached (whichever comes first). Handy for quick smoke tests. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
ValidationConfig |
||||
type |
object |
||||
properties |
|||||
|
Dataset configuration for validation. It includes dataset name, subset, revision and test split. |
||||
#/$defs/DatasetConfig |
|||||
|
Temperature |
||||
Temperature for sampling during validation. |
|||||
type |
number |
||||
default |
0.0 |
||||
|
Top P |
||||
Top-p for sampling during validation. |
|||||
default |
None |
||||
anyOf |
type |
number |
|||
type |
null |
||||
|
Top K |
||||
Top-k for sampling during validation. |
|||||
default |
1 |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Repetition Penalty |
||||
Repetition penalty for sampling during validation. |
|||||
type |
number |
||||
default |
1.0 |
||||
|
N Generation |
||||
n parameter same like what in OpenAI chat API for validation. |
|||||
type |
integer |
||||
default |
1 |
||||
|
Max Response Length |
||||
Max output length of rollout generation during validation. |
|||||
default |
None |
||||
anyOf |
type |
integer |
|||
type |
null |
||||
|
Reward Function |
||||
Reward functions for the model. Currently support single_choice, boxed_math, and format. You can add weight to each reward function by passing a dict, e.g., {‘single_choice’: 0.9, ‘format’: 0.1} |
|||||
anyOf |
type |
string |
|||
type |
array |
||||
items |
type |
string |
|||
type |
object |
||||
additionalProperties |
type |
number |