Configuration

Config

type

object

properties

  • custom

Custom

Custom script configuration.

type

object

additionalProperties

True

  • train

#/$defs/TrainingConfig

  • rollout

#/$defs/RolloutConfig

  • policy

#/$defs/PolicyConfig

  • logging

#/$defs/LoggingConfig

  • profiler

#/$defs/ProfilerConfig

  • validation

#/$defs/ValidationConfig

  • distillation

#/$defs/DistillationConfig

  • vla

#/$defs/VLAConfig

  • mode

Mode

Running mode, could be ‘disaggregated’ or ‘colocated’

type

string

default

disaggregated

$defs

  • CheckpointConfig

CheckpointConfig

type

object

properties

  • enable_checkpoint

Enable Checkpoint

Enable checkpointing for training. If set to False, no checkpoint will be saved.

type

boolean

default

False

  • save_freq

Save Freq

Checkpoint save frequency for training steps

type

integer

default

20

  • save_freq_in_epoch

Save Freq In Epoch

Checkpoint save frequency for training epochs. Default to 0 (disabled).

type

integer

default

0

  • save_mode

Save Mode

Checkpoint save mode for training steps

type

string

default

async

  • max_keep

Max Keep

Maximum number of checkpoints to keep. If set to -1, all checkpoints will be kept.

type

integer

default

5

  • export_safetensors

Export Safetensors

Whether to export a safetensors weight for huggingface usage, include related config files.

type

boolean

default

True

  • upload_hf

Upload Hf

Whether to upload the safetensors weight to huggingface.

type

boolean

default

False

  • hf_repo_name

Hf Repo Name

The huggingface repo name to upload the safetensors weight.

type

string

default

Comos-Reason1

  • upload_s3

Upload S3

Whether to upload the checkpoint and safetensors to S3. Default to False, set final will upload the final checkpoint, all will upload all checkpoints.

default

False

anyOf

type

boolean

type

string

  • s3_bucket

S3 Bucket

The S3 bucket name to upload the checkpoint and safetensors weight.

default

None

anyOf

type

string

type

null

  • s3_prefix

S3 Prefix

The S3 prefix to upload the checkpoint and safetensors weight.

type

string

default

outputs

  • DatasetConfig

DatasetConfig

type

object

properties

  • name

Name

Huggingface dataset name or local path to parquet file

type

string

default

  • subset

Subset

Dataset subset if exists

default

anyOf

type

string

type

null

  • revision

Revision

OrderedDict({‘help’: ‘Dataset git revision if exist, can be a branch name, a tag, or a commit hash.’})

default

anyOf

type

string

type

null

  • split

Split

A list of dataset splits to train

default

anyOf

type

string

type

array

items

type

string

  • test_size

Test Size

Size of the test set. If float, it is the ratio (between 0.0 and 1.0) of the dataset; if int, it is the absolute size of the test set.

default

None

anyOf

type

number

type

integer

type

null

  • local_dir

Local Dir

Local path to load dataset

type

string

default

  • DiffusersConfig

DiffusersConfig

type

object

properties

  • dtype

Dtype

Data type for diffusers model

type

string

default

float16

  • is_video

Is Video

True if this model is video generate model

type

boolean

default

False

  • max_prompt_length

Max Prompt Length

Maximum sequence length to use for the prompt

type

integer

default

300

  • weighting_scheme

Weighting Scheme

Method used to sample timestep

type

string

default

logit_normal

  • train_flow_shift

Train Flow Shift

flow shift used for training

type

number

default

3.0

  • offload

Offload

Whether to dynamic offload model parts from cuda to cpu

type

boolean

default

True

  • logit_mean

Logit Mean

random sampling timestep logits mean for noise addition

type

number

default

0.0

  • logit_std

Logit Std

random sampling timestep logits std for noise addition

type

number

default

1.0

  • inference_size

Inference Size

Image/video size for generation, [height, width]

type

array

default

1024

1024

items

type

integer

  • inference_frames

Inference Frames

Total frame of video size for generation

type

integer

default

41

  • train_frames

Train Frames

Total frame of video size for training

type

integer

default

41

  • timesteps_fraction

Timesteps Fraction

Fraction of timesteps to use during training. if set to less than 1.0, the model will be trained on a subset of the timesteps for each sample. this will speed up training but reduce the accuracy of policy gradient estimates.

type

number

default

1.0

  • weight_copy_decay_type

Weight Copy Decay Type

Weight copy decay type for diffusers model in rl training

type

integer

default

0

  • lora

LoRA configuration for diffusers model

default

None

anyOf

#/$defs/LoraConfig

type

null

  • sample

Sampling configuration

#/$defs/SampleConfig

  • tokenizer

#/$defs/TokenizerConfig

  • DistillationConfig

DistillationConfig

type

object

properties

  • enable

Enable

Whether to enable distillation.

type

boolean

default

False

  • parallelism

#/$defs/ParallelismConfig

  • model_name_or_path

Model Name Or Path

The teacher model name or path, compatible with huggingface model name or local path

type

string

default

Qwen/Qwen2.5-VL-7B-Instruct

  • model_revision

Model Revision

The revision of the teacher model to use

default

None

anyOf

type

string

type

null

  • compile

Compile

Whether to use torch.compile for teacher model.

type

boolean

default

True

  • master_dtype

Master Dtype

The master weight data type for teacher model, is orthognal to param_dtype. Should be high precision for convergence consideration

type

string

default

float32

  • param_dtype

Param Dtype

The data type for forward/backward of teacher model. Outside forward/backward, params are in master_dtype

type

string

default

bfloat16

  • logprob_dtype

Logprob Dtype

The data type for logprobs calculation of teacher model.

type

string

default

float32

  • fsdp_reduce_dtype

Fsdp Reduce Dtype

The data type for reduction in FSDP for teacher model.

type

string

default

float32

  • fsdp_offload

Fsdp Offload

Whether to offload the teacher model to CPU if using FSDP

type

boolean

default

False

  • fsdp_reshard_after_forward

Fsdp Reshard After Forward

Reshard the param after forward pass in FSDP for teacher model. Default to ‘never’ to avoid unnecessary overhead.

type

string

default

never

  • batch_size_per_replica

Batch Size Per Replica

Batch size for teacher model per replica.

type

integer

default

1

  • max_token_len_per_mini_batch

Max Token Len Per Mini Batch

Maximum token length per mini batch. If set, dynamic mini-batch sizing will be applied based on this limit for teacher model.

default

None

anyOf

type

integer

type

null

  • sequence_packing

Sequence Packing

Whether to enable sequence packing for teacher model. If set to True, the input sequences will be packed into a single tensor for training stability.

type

boolean

default

False

  • mini_batch

Mini Batch

mini batch size for teacher model in each replica.

type

integer

default

2

  • seed

Seed

Random seed for teacher model.

default

None

anyOf

type

integer

type

null

  • kl_penalty_coef

Kl Penalty Coef

The coefficient for KL penalty.

type

number

default

1.0

  • kl_discount_factor

Kl Discount Factor

The discount factor for KL penalty.

type

number

default

0.0

  • include_prompt

Include Prompt

Whether to include prompt in the teacher model KL calculation.

type

boolean

default

False

  • top_k

Top K

Top-k filtering for teacher model logits before KL calculation. If larger than 0, generalized Jensen-Shannon Divergence will be used.

type

integer

default

0

  • jsd_beta

Jsd Beta

Interpolation coefficient between 0.0 and 1.0 of the Generalized Jensen-Shannon Divergence loss. When beta is 0.0, the loss is the KL divergence. When beta is 1.0, the loss is the Inverse KL Divergence.

type

number

default

0.5

  • trainer_token_ids_from_teacher

Trainer Token Ids From Teacher

Whether the trainer gets all top_k token ids directly from its redis interacted teacher model during distillation rather than from the rollout structure. This can simplify the rollout payload when being transferred in the framework.

type

boolean

default

True

  • rollout_top_k_recompute

Rollout Top K Recompute

Whether to recompute all top-k logprobs with top-k token ids after the full sequence generated during rollout for distillation. This can ensure the completion generation process with no large top-k kept so that not degrade the generation efficiency.

type

boolean

default

False

  • FP4Config

FP4Config

type

object

properties

  • enable_fp4

Enable Fp4

Whether to enable fp4.

type

boolean

default

False

  • fp4_recipe

Fp4 Recipe

Recipe for weight scale calculation.

type

string

default

dynamic_scaling

  • quant_recipe

Quant Recipe

Quantization strategy for weight.

type

string

default

rowwise

  • FP8Config

FP8Config

type

object

properties

  • enable_fp8

Enable Fp8

Whether to enable fp8.

type

boolean

default

False

  • fp8_recipe

Fp8 Recipe

Recipe for weight scale calculation.

type

string

default

dynamic_scaling

  • quant_recipe

Quant Recipe

Quantization strategy for weight.

type

string

default

rowwise

  • GrpoConfig

GrpoConfig

type

object

properties

  • type

Type

type

string

const

grpo

  • trainer_type

Trainer Type

Type of the trainer for GRPO.

type

string

default

grpo

  • variant

Variant

Variant of the GRPO, currently support grpo, gspo, dapo

type

string

default

grpo

  • dataset

Dataset configuration for GRPO training. It includes dataset name, subset, revision, train split, test split and test size.

#/$defs/DatasetConfig

  • dataloader_shuffle

Dataloader Shuffle

Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded.

type

boolean

default

True

  • dataloader_seed

Dataloader Seed

random seed for dataloader shuffling

type

integer

default

0

  • data_dispatch_as_rank_in_mesh

Data Dispatch As Rank In Mesh

Whether to dispatch data according to rank in global mesh. If True, each rank will get its specific data shard based on its rank in the global mesh.

type

boolean

default

False

  • enable_dataset_cache

Enable Dataset Cache

Enable dataset cache process results, maybe accelerate the dataset loading

type

boolean

default

False

  • dataloader_num_workers

Dataloader Num Workers

Number of subprocess to use for data loading

type

integer

default

0

  • dataloader_prefetch_factor

Dataloader Prefetch Factor

Number of batches loaded in advance by each worker.

default

None

anyOf

type

integer

type

null

  • dataloader_batch_size

Dataloader Batch Size

Batch size for each iteration of the dataloader for when fetch prompts from controller. This is only the setting of the dataloader iterator on the controller side.

default

1

anyOf

type

integer

type

null

  • prompt_column_name

Prompt Column Name

Column name for prompt

type

string

default

  • response_column_name

Response Column Name

Column name for response/reference answer

type

string

default

  • reward_function

Reward Function

Reward functions for the model. Currently support single_choice, boxed_math, and format. You can add weight to each reward function by passing a dict, e.g., {‘single_choice’: 0.9, ‘format’: 0.1}

anyOf

type

string

type

array

items

type

string

type

object

additionalProperties

type

number

  • use_remote_reward

Use Remote Reward

Whether to use remote reward calculation. If set to True, the reward calculation will be done in a remote worker. If False, the reward calculation will be done in the local process.

type

boolean

default

False

  • remote_reward

Configuration for remote reward calculation.

#/$defs/RemoteRewardConfig

  • filter_reward_metric

Filter Reward Metric

Reward function to filter in dynamic sampling for DAPO. If specified, only samples with different this rewards will be used for training. If None, no filtering will be applied.

anyOf

type

string

type

array

items

type

string

  • bypass_reward

Bypass Reward

Bypass reward computation and use fixed reward of 0.0 for all samples. Useful for distillation or debugging.

type

boolean

default

False

  • temperature

Temperature

Temperature for sampling. The higher the temperature, the more random the completions.

type

number

default

1.0

  • epsilon_low

Epsilon Low

Epsilon value for clipping.

type

number

default

0.2

  • epsilon_high

Epsilon High

Upper-bound epsilon value for clipping. If not specified, it defaults to the same value as the lower-bound specified in argument epsilon. Paper DAPO recommends 0.28.

type

number

default

0.2

  • advantage_low

Advantage Low

Lower-bound advantage value for clipping.

type

number

default

-5.0

  • advantage_high

Advantage High

Upper-bound advantage value for clipping.

type

number

default

5.0

  • positive_nll_coef

Positive Nll Coef

[Optional] Coefficient for Positive Example LM Loss. Set a positive value to enable; None disables.nRef: VAPO Sec. 4.3 (Positive Example LM Loss): https://arxiv.org/pdf/2504.05118

default

None

anyOf

type

number

type

null

  • lower_bound_ratio

Lower Bound Ratio

Lower-bound ratio for dual-clip.

type

number

default

3.0

  • loss_type

Loss Type

The type of loss to use for GRPO training.

type

string

default

token-mean

  • unbiased_loss_max_tokens

Unbiased Loss Max Tokens

Maximum number of tokens to use for unbiased loss introduced in Dr.GRPO. If set to None, will not use unbiased loss.Only available when loss_type is seq-mean-token-mean

default

None

anyOf

type

integer

type

null

  • unbiased_advantage

Unbiased Advantage

Whether to divide the advantage by the standard deviation of rewards.

type

boolean

default

False

  • overlong_reward

Configuration for overlong reward penalty. If enabled, the output will be penalized for responses that are too long.

#/$defs/OverlongRewardConfig

  • kl_beta

Kl Beta

KL coefficient. If 0.0, the reference model is not loaded, reducing memory usage and improving training speed, but may be numerically unstable for long training runs.

type

number

default

0.0

  • unbiased_kl_estimate

Unbiased Kl Estimate

[Optional] Unbiased K3 with IS: D_KL ≈ E_{π_old}[ w · ( r − log r − 1 ) ], w=π_θ/π_old, r=π_ref/π_θ.nNote: This option is ignored when kl_beta is 0.0.nRef: DeepSeek-V3.2 Sec.3.1 (Unbiased KL Estimate): https://huggingface.co/deepseek-ai/DeepSeek-V3.2/resolve/main/assets/paper.pdf

type

boolean

default

False

  • off_policy_masking_delta

Off Policy Masking Delta

Off-Policy Sequence Masking threshold δ (None disables). Per-sequence mask: M_i = 0 if Â_i < 0 and (1/|o_i|)∑_t log[π_old(o_{i,t}|·)/π_θ(o_{i,t}|·)] > δ; else M_i = 1.Ref: DeepSeek-V3.2 Sec.3.1 Off-Policy Sequence Masking : https://huggingface.co/deepseek-ai/DeepSeek-V3.2/resolve/main/assets/paper.pdf.

default

None

anyOf

type

number

type

null

  • aipo_rho

Aipo Rho

Rho value for AIPO (Asynchronous Importance weighted Policy Optimization). The clipping constant of the importance sampling ratio, suggest [2,10]. reference: https://arxiv.org/pdf/2505.24034

default

None

anyOf

type

number

type

null

  • mu_iterations

Mu Iterations

Number of iterations per batch (denoted as μ in the algorithm).

type

integer

default

1

  • mini_batch

Mini Batch

mini-batch size for GRPO training. Mini-batch is used to split the batch per optimization into smaller batches to fit into GPU memory.

type

integer

default

2

  • batch_size_per_optimize

Batch Size Per Optimize

batch size for each optimization in GRPO training. The batch in each training step is split into smaller batches which each performs one step optimization. If not set, it will be the same as the whole batch size per GPU for each training step.

default

None

anyOf

type

integer

type

null

  • max_token_len_per_mini_batch

Max Token Len Per Mini Batch

Maximum token length per mini batch. If set, dynamic mini-batch sizing will be applied based on this limit.

default

None

anyOf

type

integer

type

null

  • entropy_coeff

Entropy Coeff

Coefficient for entropy regularization.

type

number

default

0.0

  • allowed_outdated_steps

Allowed Outdated Steps

Allowed outdated-async steps for rollout engine. If the number of left uncompleted rollout samples is larger than the (allowed_outdated_steps + 1) * n_policy_replicas * train_batch_per_replica, then rollout engine traffic will be throttled.

type

integer

default

4

  • on_policy

On Policy

Enable fully synchronized (on-policy) rollout. If set to True, the rollout engine will wait until the expected weight version is updated before next generation starts.

type

boolean

default

False

  • outdated_rollout_fetch_batch_size

Outdated Rollout Fetch Batch Size

Number of outdated rollouts to fetch. If set to 0, the rollout engine will stop generating rollouts if the weight is outdated.

type

integer

default

0

  • min_filter_prefix_tokens

Min Filter Prefix Tokens

Minimum number of tokens to filter the prefix tokens for the rollouts inside the same group. If the number of tokens is larger than the min_filter_prefix_tokens, the rollouts with the same prefix but different rewards will be filtered out in loss calculation.

default

None

anyOf

type

integer

type

null

  • max_retry_for_on_policy

Max Retry For On Policy

Maximum number of retries for on-policy rollout to have enough samples. If non-positive, will retry with no upper limit until enough samples are generated.

type

integer

default

-1

  • reference_reset_interval

Reference Reset Interval

Interval to reset the reference model to the current model. If set to None or 0, the reference model will not be reset during training.

default

None

anyOf

type

integer

type

null

  • reset_optimizer_with_reference

Reset Optimizer With Reference

Whether to reset the optimizer state when the reference model is reset.

type

boolean

default

True

  • balance_dp_token

Balance Dp Token

Whether to balance the number of tokens in each data parallel replica when calculating the loss.

type

boolean

default

False

  • use_decoupled_loss

Use Decoupled Loss

Whether to use decoupled loss. A decoupled loss separates the optimization of the behavior policy and the target policy, which can help to reduce the variance of the gradient estimate.

type

boolean

default

False

  • behav_imp_weight_cap

Behav Imp Weight Cap

Clipping cap for behavior importance weights. Useful when decoupled loss is used to avoid large variance.

default

None

anyOf

type

number

type

null

  • rollout_as_token_ids

Rollout As Token Ids

Whether to use token ids for rollouts instead of text. This can save tokenization time during rollout generation.

type

boolean

default

False

  • collect_rollout_logprobs

Collect Rollout Logprobs

Whether to collect logprobs for rollouts instead of text. This can save logprob calculation time during rollout generation.

type

boolean

default

False

  • use_rollout_logprobs_for_loss

Use Rollout Logprobs For Loss

Whether to use collected logprobs from rollouts for loss calculation. This is an alternative to calculating logprobs during training as old logprobs for importance sampling.

type

boolean

default

False

  • LoggingConfig

LoggingConfig

type

object

properties

  • logger

Logger

List of loggers to use, e.g., [‘console’, ‘wandb’]

type

array

items

type

string

  • project_name

Project Name

Wandb project name for logging. If set, the training will be logged to this project.

type

string

default

cosmos_rl

  • group_name

Group Name

Wandb group name for logging. If set, the training will be logged to this group.

default

None

anyOf

type

string

type

null

  • experiment_name

Experiment Name

A short display name for this run. If not set, will use the output_dir as the experiment name.

default

None

anyOf

type

string

type

null

  • LoraConfig

LoraConfig

type

object

properties

  • r

R

LoRA rank

type

integer

default

8

  • lora_names

Lora Names

A List of name for the LoRA adapters. If multiple names are provided, then multiple LoRA adapters will be created and trained simultaneously.

type

array

default

default

items

type

string

  • lora_path

Lora Path

Path to pre-trained LoRA weights

default

None

anyOf

type

string

type

null

  • lora_alpha

Lora Alpha

LoRA alpha

type

number

default

8.0

  • lora_dropout

Lora Dropout

LoRA dropout

type

number

default

0.0

  • target_modules

Target Modules

LoRA target modules, can be a list of strings or ‘all-linear’

default

None

anyOf

type

array

items

type

string

type

string

  • use_rslora

Use Rslora

When set to True, uses [Rank-Stabilized LoRA](https://huggingface.co/papers/2312.03732) which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r.

type

boolean

default

False

  • modules_to_save

Modules To Save

List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.

default

None

anyOf

type

array

items

type

string

type

null

  • alpha_pattern

Alpha Pattern

Per-module overrides for lora_alpha. Keys are regex patterns; evaluated in insertion order, first match wins.

default

None

anyOf

type

object

additionalProperties

type

number

type

null

  • r_pattern

R Pattern

Per-module overrides for LoRA rank r. Keys are regex patterns; evaluated in insertion order, first match wins.

default

None

anyOf

type

object

additionalProperties

type

integer

type

null

  • init_lora_weights

Init Lora Weights

How to initialize the weights of the adapter layers.Passing True (default) results in the default initialization from the reference implementation from Microsoft, with the LoRA B weight being set to 0. This means that without further training, the LoRA adapter will be a no-op.Setting the initialization to False leads to random initialization of LoRA A and B, meaning that LoRA is not a no-op before training; this setting is intended for debugging purposes.Passing ‘gaussian’ results in Gaussian initialization scaled by the LoRA rank for linear and layers. Pass ‘loftq’ to use LoftQ initialization. Passing ‘eva’ results in a data-driven initialization of Explained Variance Adaptation.EVA initializes LoRA based on the SVD of layer input activations and achieves SOTA performance due to its ability to adapt to the finetuning data. Pass ‘olora’ to use OLoRA initialization. Passing ‘pissa’ results in the initialization of https://huggingface.co/papers/2404.02948

default

True

anyOf

type

boolean

type

string

enum

gaussian, eva, olora, pissa, pissa_niter_[number of iters]

  • MultiTurnRolloutConfig

MultiTurnRolloutConfig

type

object

properties

  • enable

Enable

Whether to enable multi-turn rollout.

type

boolean

default

False

  • enable_tools

Enable Tools

Whether to enable tools in multi-turn rollout.

type

boolean

default

False

  • enable_thinking

Enable Thinking

Whether to enable thinking in multi-turn rollout.

type

boolean

default

False

  • custom_chat_template_path

Custom Chat Template Path

The path to the custom chat template in chat.

default

None

anyOf

type

string

type

null

  • max_assistant_turns

Max Assistant Turns

Max assistant turn count for multi-turn rollout.

type

integer

default

5

  • add_generation_prompt

Add Generation Prompt

Whether to add generation prompt in multi-turn rollout.

type

boolean

default

True

  • continue_final_message

Continue Final Message

Whether to continue the final message in multi-turn rollout.

type

boolean

default

False

  • OverlongRewardConfig

OverlongRewardConfig

type

object

properties

  • enable_overlong_penalty

Enable Overlong Penalty

Enable overlong penalty for the model. If set to True, the output will be penalized for responses that are too long.

type

boolean

default

False

  • buffer_length

Buffer Length

Length of the buffer for overlong penalty. If the response length exceeds this value, the output will be penalized.

type

integer

default

4096

  • penalty_factor

Penalty Factor

Penalty factor for overlong penalty. The penalty increases linearly with the length of the response exceeding the buffer length from 0 to the penalty_factor.

type

number

default

1.0

  • ParallelismConfig

ParallelismConfig

type

object

properties

  • n_init_replicas

N Init Replicas

Number of initial replicas to be created

type

integer

default

1

  • tp_size

Tp Size

Tensor parallelism size

type

integer

default

2

  • cp_size

Cp Size

Context parallelism size

type

integer

default

1

  • ep_size

Ep Size

Expert parallelism size

type

integer

default

1

  • dp_shard_size

Dp Shard Size

Data Parallelism size in sharded mode

type

integer

default

1

  • pp_size

Pp Size

Pipeline parallelism size

type

integer

default

1

  • pp_dynamic_shape

Pp Dynamic Shape

Pipeline parallelism dynamic shape

type

boolean

default

False

  • pp_micro_batch_size

Pp Micro Batch Size

Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages

type

integer

default

1

  • dp_replicate_size

Dp Replicate Size

Data Parallelism size in replica mode. Only configurable in SFT type job, must be 1 in GRPO type job for dynamic scaling support purpose.

type

integer

default

1

  • PolicyConfig

PolicyConfig

type

object

properties

  • parallelism

#/$defs/ParallelismConfig

  • diffusers

anyOf

#/$defs/DiffusersConfig

type

null

  • is_diffusers

Is Diffusers

Whether this model is diffusers or not

type

boolean

default

False

  • model_name_or_path

Model Name Or Path

The model name or path, compatible with huggingface model name or local path

type

string

default

Qwen/Qwen2.5-VL-7B-Instruct

  • model_revision

Model Revision

The revision of the model to use

default

None

anyOf

type

string

type

null

  • model_max_length

Model Max Length

The maximum length for training, longer than this will be ignored for training stability

type

integer

default

4096

  • model_gradient_checkpointing

Model Gradient Checkpointing

Whether to use gradient checkpointing

type

boolean

default

True

  • lora

LoRA configuration

default

None

anyOf

#/$defs/LoraConfig

type

null

  • trainable_map

Trainable Map

Mapping of name -> bool. Keys can either be: - exact parameter names (from model.named_parameters()) - exact module paths (from model.named_modules())

default

None

anyOf

type

object

additionalProperties

type

boolean

type

null

  • freeze_pattern

Freeze Pattern

Pattern-based configuration to freeze parts of the model. A list of regex patterns that match against parameter names; matched parameters will be frozen (requires_grad=False). Example: freeze_pattern = [‘^visual..*’] freezes all visual components; freeze_pattern = [‘^model.layers.[0-9]+.’] freezes layers 0-9.

default

None

anyOf

type

array

items

type

string

type

null

  • enable_liger_kernel

Enable Liger Kernel

Whether to use liger kernel.

type

boolean

default

False

  • ProfilerConfig

ProfilerConfig

type

object

properties

  • enable_profiler

Enable Profiler

Enable profiler for training

type

boolean

default

False

  • enable_nsys

Enable Nsys

Enable nsys for training

type

boolean

default

False

  • sub_profiler_config

Sub profiler config

#/$defs/SubProfilerConfig

  • RemoteRewardConfig

RemoteRewardConfig

type

object

properties

  • enabled

Enabled

type

boolean

default

True

  • score_key

Score Key

type

string

default

overall_reward

  • scale

Scale

type

number

default

1.0

  • reward_fn

Reward Fn

Dictionary of reward functions and their weights for remote reward calculation.

type

object

additionalProperties

type

number

  • reward_clip_min

Reward Clip Min

type

number

default

-5.0

  • reward_clip_max

Reward Clip Max

type

number

default

5.0

  • RolloutAsyncConfig

RolloutAsyncConfig

type

object

properties

  • max_concurrent_requests

Max Concurrent Requests

Maximum number of concurrent requests for rollout engine.

type

integer

default

10

  • RolloutConfig

RolloutConfig

type

object

properties

  • parallelism

#/$defs/RolloutParallelismConfig

  • enforce_eager

Enforce Eager

Whether to enable eager execution for vLLM.

type

boolean

default

True

  • include_stop_str_in_output

Include Stop Str In Output

Whether to include stop string in output.

type

boolean

default

False

  • gpu_memory_utilization

Gpu Memory Utilization

GPU memory utilization factor for rollout backend.

type

number

default

0.8

  • enable_chunked_prefill

Enable Chunked Prefill

Whether to enable chunked prefill for vLLM.

type

boolean

default

False

  • max_response_length

Max Response Length

Max output length of rollout generation.

type

integer

default

2048

  • n_generation

N Generation

n parameter same like what in OpenAI chat API.

type

integer

default

16

  • n_generation_to_batch

N Generation To Batch

Whether to treat n_generation as batch dimension in rollout generation.

type

boolean

default

False

  • batch_size

Batch Size

Batch size for rollout.

type

integer

default

1

  • quantization

Quantization

Quantization in vllm rollout generation.

type

string

default

none

  • seed

Seed

random seed for rollout.

default

None

anyOf

type

integer

type

null

  • sampling_config

#/$defs/SamplingConfig

  • vllm_use_flashinfer

Vllm Use Flashinfer

Use flashinfer for vllm rollout.

type

boolean

default

False

  • backend

Backend

Backend for rollout. Currently support vllm, vllm_async and trtllm, and other custom backends.

type

string

default

vllm

  • multi_turn_config

Configuration for multi-turn rollout.

#/$defs/MultiTurnRolloutConfig

  • mode

Mode

Rollout mode, could be ‘sync’ or ‘async’.

type

string

default

sync

  • async_config

Configuration for async rollout.

#/$defs/RolloutAsyncConfig

  • RolloutParallelismConfig

RolloutParallelismConfig

type

object

properties

  • n_init_replicas

N Init Replicas

Number of initial replicas to be created

type

integer

default

1

  • tp_size

Tp Size

Tensor parallelism size

type

integer

default

2

  • cp_size

Cp Size

Context parallelism size

type

integer

default

1

  • ep_size

Ep Size

Expert parallelism size

type

integer

default

1

  • dp_shard_size

Dp Shard Size

Data Parallelism size in sharded mode

type

integer

default

1

  • pp_size

Pp Size

Pipeline parallelism size

type

integer

default

1

  • pp_dynamic_shape

Pp Dynamic Shape

Pipeline parallelism dynamic shape

type

boolean

default

False

  • pp_micro_batch_size

Pp Micro Batch Size

Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages

type

integer

default

1

  • dp_replicate_size

Dp Replicate Size

Data Parallelism size in replica mode. Only configurable in SFT type job, must be 1 in GRPO type job for dynamic scaling support purpose.

type

integer

default

1

  • SFTDataConfig

SFTDataConfig

type

object

properties

  • type

Type

type

string

const

sft

  • trainer_type

Trainer Type

Type of the trainer for SFT.

type

string

default

sft

  • dataset

Dataset configuration for SFT training. It includes dataset name, subset, revision, train split, and test split.

#/$defs/DatasetConfig

  • mini_batch

Mini Batch

mini-batch size for training.

type

integer

default

2

  • dataloader_shuffle

Dataloader Shuffle

Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded.

type

boolean

default

True

  • dataloader_seed

Dataloader Seed

random seed for dataloader shuffling

type

integer

default

0

  • enable_dataset_cache

Enable Dataset Cache

Enable dataset cache process results, maybe accelerate the dataset loading

type

boolean

default

False

  • dataloader_num_workers

Dataloader Num Workers

Number of subprocess to use for data loading

type

integer

default

0

  • dataloader_prefetch_factor

Dataloader Prefetch Factor

Number of batches loaded in advance by each worker.

default

None

anyOf

type

integer

type

null

  • dataloader_drop_last

Dataloader Drop Last

Whether to drop the last batch of the dataloader if it is not complete.

type

boolean

default

True

  • data_dispatch_as_rank_in_mesh

Data Dispatch As Rank In Mesh

Whether to dispatch data according to rank in global mesh. If True, each rank will get its specific data shard based on its rank in the global mesh.

type

boolean

default

False

  • conversation_column_name

Conversation Column Name

Column name for formated conversation json

type

string

default

conversations

  • system_prompt

System Prompt

System prompt for the model, which will be prepended to the prompt

type

string

default

  • balance_dp_token

Balance Dp Token

Whether to balance the number of tokens in each data parallel replica when calculating the loss.

type

boolean

default

True

  • SampleConfig

SampleConfig

type

object

properties

  • num_steps

Num Steps

Number of sampler inference steps for training

type

integer

default

40

  • eval_num_steps

Eval Num Steps

Number of sampler inference steps for evaluation

type

integer

default

40

  • guidance_scale

Guidance Scale

Classifier-free guidance weight

type

number

default

4.5

  • global_std

Global Std

Whether to use all samples in a batch to compute std

type

boolean

default

True

  • noise_level

Noise Level

Noise level for sampling

type

number

default

1.0

  • deterministic_sampling

Deterministic Sampling

Whether to use deterministic sampling

type

boolean

default

False

  • solver

Solver

Sampler solver to be used

type

string

default

dpm2

  • SamplingConfig

SamplingConfig

type

object

properties

  • temperature

Temperature

Temperature for sampling.

type

number

default

1.0

  • top_p

Top P

Top-p for sampling.

type

number

default

1.0

  • top_k

Top K

Top-k for sampling.

type

integer

default

-1

  • repetition_penalty

Repetition Penalty

Repetition penalty for sampling.

type

number

default

1.0

  • use_flashinfer

Use Flashinfer

Use flashinfer for sampling.

type

boolean

default

False

  • SubProfilerConfig

SubProfilerConfig

type

object

properties

  • do_profile

Do Profile

Whether to profile, only used in runtime.

type

boolean

default

False

  • active_steps

Active Steps

Number of active steps

type

integer

default

1

  • warmup_steps

Warmup Steps

Number of warmup steps

type

integer

default

1

  • wait_steps

Wait Steps

Number of wait steps

type

integer

default

1

  • rank_filter

Rank Filter

Rank filter

type

array

items

type

integer

  • record_shape

Record Shape

Whether to record shape

type

boolean

default

False

  • profile_memory

Profile Memory

Whether to profile memory

type

boolean

default

False

  • with_stack

With Stack

Whether to profile stack

type

boolean

default

False

  • with_modules

With Modules

Whether to profile modules

type

boolean

default

False

  • TokenizerConfig

TokenizerConfig

type

object

properties

  • chunk_duration

Chunk Duration

type

integer

default

81

  • load_mean_std

Load Mean Std

type

boolean

default

False

  • compile_encode

Compile Encode

type

boolean

default

False

  • temporal_window

Temporal Window

type

integer

default

16

  • TrainingConfig

TrainingConfig

type

object

properties

  • train_policy

Train Policy

default

type

grpo

trainer_type

grpo

variant

grpo

dataset

local_dir

name

revision

split

subset

test_size

None

dataloader_shuffle

True

dataloader_seed

0

data_dispatch_as_rank_in_mesh

False

enable_dataset_cache

False

dataloader_num_workers

0

dataloader_prefetch_factor

None

dataloader_batch_size

1

prompt_column_name

response_column_name

reward_function

single_choice

1.0

use_remote_reward

False

remote_reward

enabled

True

reward_clip_max

5.0

reward_clip_min

-5.0

reward_fn

dance_grpo

1.0

scale

1.0

score_key

overall_reward

filter_reward_metric

bypass_reward

False

temperature

1.0

epsilon_low

0.2

epsilon_high

0.2

advantage_low

-5.0

advantage_high

5.0

positive_nll_coef

None

lower_bound_ratio

3.0

loss_type

token-mean

unbiased_loss_max_tokens

None

unbiased_advantage

False

overlong_reward

buffer_length

4096

enable_overlong_penalty

False

penalty_factor

1.0

kl_beta

0.0

unbiased_kl_estimate

False

off_policy_masking_delta

None

aipo_rho

None

mu_iterations

1

mini_batch

2

batch_size_per_optimize

None

max_token_len_per_mini_batch

None

entropy_coeff

0.0

allowed_outdated_steps

4

on_policy

False

outdated_rollout_fetch_batch_size

0

min_filter_prefix_tokens

None

max_retry_for_on_policy

-1

reference_reset_interval

None

reset_optimizer_with_reference

True

balance_dp_token

False

use_decoupled_loss

False

behav_imp_weight_cap

None

rollout_as_token_ids

False

collect_rollout_logprobs

False

use_rollout_logprobs_for_loss

False

oneOf

#/$defs/SFTDataConfig

#/$defs/GrpoConfig

  • optm_name

Optm Name

Optimizer name

type

string

default

AdamW

  • optm_lr

Optm Lr

Learning rate for optimizer, can be a float or a list of floats for multiple optimizers

default

1e-06

anyOf

type

number

type

array

items

type

number

  • optm_impl

Optm Impl

Implementation type for optimizer. More info: https://pytorch.org/docs/stable/optim.html, can be a list of strings for multiple optimizers

default

fused

anyOf

type

string

type

array

items

type

string

  • optm_weight_decay

Optm Weight Decay

Weight decay for optimizer

type

number

default

0.01

  • optm_betas

Optm Betas

Betas for optimizer

type

array

default

0.9

0.999

maxItems

2

minItems

2

  • optm_warmup_steps

Optm Warmup Steps

Warmup steps for optimizer, can be an integer or a float, if it is a float and range in [0.0, 1.0], it will be multiplied by the total steps

default

20

anyOf

type

integer

type

number

  • optm_decay_ratio

Optm Decay Ratio

Ratio of total steps for decay, range in [0.0, 1.0], 0 means no decay.

default

None

anyOf

type

number

type

null

  • optm_decay_type

Optm Decay Type

Type of decay for optimizer

default

None

anyOf

type

string

type

null

  • optm_min_lr_factor

Optm Min Lr Factor

Minimum lr factor for optimizer, range in [0.0, 1.0]

type

number

default

0.0

  • optm_grad_norm_clip

Optm Grad Norm Clip

Gradient norm clip for optimizer

type

number

default

1.0

  • ema_enable

Ema Enable

Whether to enable EMA for model parameters. Only support diffusers models for now.

type

boolean

default

False

  • ema_decay

Ema Decay

Decay rate for EMA

type

number

default

0.9999

  • ema_update_step_interval

Ema Update Step Interval

Interval steps to update EMA parameters, 0 means update every step

type

integer

default

0

  • master_dtype

Master Dtype

The master weight data type for optimizers, is orthognal to param_dtype. Should be high precision for convergence consideration

type

string

default

float32

  • param_dtype

Param Dtype

The data type for forward/backward. Outside forward/backward, params are in master_dtype

type

string

default

bfloat16

  • transfer_dtype

Transfer Dtype

The data type for transfer parameters between Policy and Rollout.

type

string

default

None

  • logprob_dtype

Logprob Dtype

The data type for logprobs calculation.

type

string

default

float32

  • fsdp_reduce_dtype

Fsdp Reduce Dtype

The data type for reduction in FSDP

type

string

default

float32

  • fsdp_offload

Fsdp Offload

Whether to offload the model to CPU if using FSDP

type

boolean

default

False

  • fsdp_reshard_after_forward

Fsdp Reshard After Forward

Reshard the param after forward pass in FSDP

type

string

default

default

  • train_batch_per_replica

Train Batch Per Replica

The batch size for training per iteration in one replica, this is the local batch size for each gradient accumulation step

type

integer

default

8

  • fp8

#/$defs/FP8Config

  • fp4

#/$defs/FP4Config

  • ckpt

#/$defs/CheckpointConfig

  • resume

Resume

Resume training from a checkpoint. If True, will resume from the latest checkpoint of the output_dir. If a string, will resume from the specified checkpoint path.

default

False

anyOf

type

boolean

type

string

  • epoch

Epoch

Number of epochs for training

type

integer

default

1

  • output_dir

Output Dir

Output directory

type

string

default

./outputs

  • timestamp

Timestamp

Timestamp for the output directory and wandb ID, if not set, will be generated automatically

type

string

default

  • epsilon

Epsilon

Epsilon for optimizer

type

number

default

1e-06

  • async_tp_enabled

Async Tp Enabled

Whether to use async tensor parallelism

type

boolean

default

False

  • compile

Compile

Whether to use torch.compile

type

boolean

default

True

  • sync_weight_interval

Sync Weight Interval

The interval of train step for synchronizing weights between replicas.

type

integer

default

1

  • deterministic

Deterministic

Whether to use deterministic training. If set to True, will use deterministic training, which is expected to be slower.

type

boolean

default

False

  • activation_offload

Activation Offload

Whether to use activation offload

type

boolean

default

False

  • fa_version

Fa Version

FlashAttention version to use. If None, will use the default version.

default

None

anyOf

type

integer

type

null

  • seed

Seed

Random seed for training. If deterministic is set to True, will by default be set to 42.

default

None

anyOf

type

integer

type

null

  • local_dataset

Local Dataset

Whether to use local dataset to query sample. If set to True, will use the local dataset.

default

True

anyOf

type

boolean

type

null

  • force_use_hf

Force Use Hf

Whether to force using Huggingface dataset even if local dataset is available.

default

False

anyOf

type

boolean

type

null

  • non_text

Non Text

Whether train in non-text mode. If set to True, the inputs and outputs are not pure text, but may contain other modalities like images, videos, tensors, etc.

type

boolean

default

False

  • max_num_steps

Max Num Steps

Optional upper bound on total training steps. If set, training stops when either this step count or the epoch-based limit is reached (whichever comes first). Handy for quick smoke tests.

default

None

anyOf

type

integer

type

null

  • sequence_packing

Sequence Packing

Whether to enable sequence packing for training. If set to True, the input sequences will be packed into a single tensor for training.

type

boolean

default

False

  • VLAConfig

VLAConfig

type

object

properties

  • vla_type

Vla Type

VLA type, could be ‘openvla-oft’ or ‘openvla’

type

string

default

openvla-oft

  • num_envs

Num Envs

Number of environments to rollout.

type

integer

default

1

  • use_proprio

Use Proprio

Whether to use proprioceptive information.

type

boolean

default

False

  • proprio_dim

Proprio Dim

Dimension of proprioceptive information.

type

integer

default

7

  • num_images_in_input

Num Images In Input

Number of images in input.

type

integer

default

1

  • training_chunk_size

Training Chunk Size

Number of chunks to train in one iteration.

type

integer

default

16

  • save_video

Save Video

Whether to save video of validation rollout.

type

boolean

default

False

  • continuous

Continuous

Whether to enable continuous simulation + rollout.

type

boolean

default

False

  • trace_verbosity

Trace Verbosity

Verbosity level for tracing. 0=disabled, 1=validation only, 2=all.

type

integer

default

1

  • ValidationConfig

ValidationConfig

type

object

properties

  • enable

Enable

Enable validation during training.

type

boolean

default

False

  • val_before_train

Val Before Train

Enable validation before training starts (at step 0, after weight initialization).

type

boolean

default

False

  • freq

Freq

Validation frequency during training, in terms of training steps

type

integer

default

20

  • batch_size

Batch Size

Batch size for validation, will use the same batch size as training if not set.

default

None

anyOf

type

integer

type

null

  • dataset

Dataset configuration for validation. It includes dataset name, subset, revision and test split.

#/$defs/DatasetConfig

  • temperature

Temperature

Temperature for sampling during validation.

type

number

default

0.0

  • top_p

Top P

Top-p for sampling during validation.

default

None

anyOf

type

number

type

null

  • top_k

Top K

Top-k for sampling during validation.

default

1

anyOf

type

integer

type

null

  • repetition_penalty

Repetition Penalty

Repetition penalty for sampling during validation.

type

number

default

1.0

  • n_generation

N Generation

n parameter same like what in OpenAI chat API for validation.

type

integer

default

1

  • max_response_length

Max Response Length

Max output length of rollout generation during validation.

default

None

anyOf

type

integer

type

null

  • reward_function

Reward Function

Reward functions for the model. Currently support single_choice, boxed_math, and format. You can add weight to each reward function by passing a dict, e.g., {‘single_choice’: 0.9, ‘format’: 0.1}

default

anyOf

type

string

type

array

items

type

string

type

object

additionalProperties

type

number