Configuration

Config

type

object

properties

  • custom

Custom

Custom script configuration.

type

object

additionalProperties

True

  • train

#/$defs/TrainingConfig

  • rollout

#/$defs/RolloutConfig

  • policy

#/$defs/PolicyConfig

  • logging

#/$defs/LoggingConfig

  • profiler

#/$defs/ProfilerConfig

  • validation

#/$defs/ValidationConfig

$defs

  • CheckpointConfig

CheckpointConfig

type

object

properties

  • enable_checkpoint

Enable Checkpoint

Enable checkpointing for training. If set to False, no checkpoint will be saved.

type

boolean

default

False

  • save_freq

Save Freq

Checkpoint save frequency for training steps

type

integer

default

20

  • save_mode

Save Mode

Checkpoint save mode for training steps

type

string

default

async

  • max_keep

Max Keep

Maximum number of checkpoints to keep. If set to -1, all checkpoints will be kept.

type

integer

default

5

  • export_safetensors

Export Safetensors

Whether to export a safetensors weight for huggingface usage, include related config files.

type

boolean

default

True

  • upload_hf

Upload Hf

Whether to upload the safetensors weight to huggingface.

type

boolean

default

False

  • hf_repo_name

Hf Repo Name

The huggingface repo name to upload the safetensors weight.

type

string

default

Comos-Reason1

  • upload_s3

Upload S3

Whether to upload the checkpoint and safetensors to S3. Default to False, set final will upload the final checkpoint, all will upload all checkpoints.

default

False

anyOf

type

boolean

type

string

  • s3_bucket

S3 Bucket

The S3 bucket name to upload the checkpoint and safetensors weight.

default

None

anyOf

type

string

type

null

  • s3_prefix

S3 Prefix

The S3 prefix to upload the checkpoint and safetensors weight.

type

string

default

outputs

  • DatasetConfig

DatasetConfig

type

object

properties

  • name

Name

Huggingface dataset name or local path to parquet file

type

string

default

  • subset

Subset

Dataset subset if exists

default

anyOf

type

string

type

null

  • revision

Revision

OrderedDict({‘help’: ‘Dataset git revision if exist, can be a branch name, a tag, or a commit hash.’})

default

anyOf

type

string

type

null

  • split

Split

A list of dataset splits to train

default

anyOf

type

string

type

array

items

type

string

  • test_size

Test Size

Size of the test set. If float, it is the ratio (between 0.0 and 1.0) of the dataset; if int, it is the absolute size of the test set.

default

None

anyOf

type

number

type

integer

type

null

  • FP8Config

FP8Config

type

object

properties

  • enable_fp8

Enable Fp8

Whether to enable fp8.

type

boolean

default

False

  • fp8_recipe

Fp8 Recipe

Recipe for weight scale calculation.

type

string

default

dynamic_scaling

  • quant_recipe

Quant Recipe

Quantization strategy for weight.

type

string

default

rowwise

  • GrpoConfig

GrpoConfig

type

object

properties

  • type

Type

type

string

const

grpo

  • variant

Variant

Variant of the GRPO, currently support grpo, and dapo

type

string

default

grpo

  • dataset

Dataset configuration for GRPO training. It includes dataset name, subset, revision, train split, test split and test size.

#/$defs/DatasetConfig

  • dataloader_shuffle

Dataloader Shuffle

Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded.

type

boolean

default

True

  • enable_dataset_cache

Enable Dataset Cache

Enable dataset cache process results, maybe accelerate the dataset loading

type

boolean

default

False

  • dataloader_num_workers

Dataloader Num Workers

Number of subprocess to use for data loading

type

integer

default

0

  • dataloader_prefetch_factor

Dataloader Prefetch Factor

Number of batches loaded in advance by each worker.

default

None

anyOf

type

integer

type

null

  • prompt_column_name

Prompt Column Name

Column name for prompt

type

string

default

  • response_column_name

Response Column Name

Column name for response/reference answer

type

string

default

  • reward_function

Reward Function

Reward functions for the model. Currently support single_choice, boxed_math, and format. You can add weight to each reward function by passing a dict, e.g., {‘single_choice’: 0.9, ‘format’: 0.1}

anyOf

type

string

type

array

items

type

string

type

object

additionalProperties

type

number

  • temperature

Temperature

Temperature for sampling. The higher the temperature, the more random the completions.

type

number

default

1.0

  • epsilon_low

Epsilon Low

Epsilon value for clipping.

type

number

default

0.2

  • epsilon_high

Epsilon High

Upper-bound epsilon value for clipping. If not specified, it defaults to the same value as the lower-bound specified in argument epsilon. Paper DAPO recommends 0.28.

type

number

default

0.2

  • positive_nll_coef

Positive Nll Coef

Coefficient for Positive Example LM Loss. Set a positive value to enable; None disables the feature.

default

None

anyOf

type

number

type

null

  • lower_bound_ratio

Lower Bound Ratio

Lower-bound ratio for dual-clip.

type

number

default

3.0

  • loss_type

Loss Type

The type of loss to use for GRPO training.

type

string

default

token-mean

  • unbiased_loss_max_tokens

Unbiased Loss Max Tokens

Maximum number of tokens to use for unbiased loss introduced in Dr.GRPO. If set to None, will not use unbiased loss.Only available when loss_type is seq-mean-token-mean

default

None

anyOf

type

integer

type

null

  • unbiased_advantage

Unbiased Advantage

Whether to divide the advantage by the standard deviation of rewards.

type

boolean

default

False

  • overlong_reward

Configuration for overlong reward penalty. If enabled, the output will be penalized for responses that are too long.

#/$defs/OverlongRewardConfig

  • kl_beta

Kl Beta

KL coefficient. If 0.0, the reference model is not loaded, reducing memory usage and improving training speed, but may be numerically unstable for long training runs.

type

number

default

0.0

  • aipo_rho

Aipo Rho

Rho value for AIPO (Asynchronous Importance weighted Policy Optimization). The clipping constant of the importance sampling ratio, suggest [2,10]. reference: https://arxiv.org/pdf/2505.24034

default

None

anyOf

type

number

type

null

  • mu_iterations

Mu Iterations

Number of iterations per batch (denoted as μ in the algorithm).

type

integer

default

1

  • mini_batch

Mini Batch

mini-batch size for GRPO training.

type

integer

default

2

  • allowed_outdated_steps

Allowed Outdated Steps

Allowed outdated-async steps for rollout engine. If the number of left pending rollouts is larger than the allowed_outdated_steps * n_policy_replicas * train_batch_per_replica, then rollout engine traffic will be throttled.

type

integer

default

4

  • on_policy

On Policy

Enable fully synchronized (on-policy) rollout. If set to True, the rollout engine will wait until the expected weight version is updated before next generation starts.

type

boolean

default

False

  • min_filter_prefix_tokens

Min Filter Prefix Tokens

Minimum number of tokens to filter the prefix tokens for the rollouts inside the same group. If the number of tokens is larger than the min_filter_prefix_tokens, the rollouts with the same prefix but different rewards will be filtered out in loss calculation.

default

None

anyOf

type

integer

type

null

  • LoggingConfig

LoggingConfig

type

object

properties

  • logger

Logger

List of loggers to use, e.g., [‘console’, ‘wandb’]

type

array

items

type

string

  • project_name

Project Name

Wandb project name for logging. If set, the training will be logged to this project.

type

string

default

cosmos_rl

  • experiment_name

Experiment Name

A short display name for this run. If not set, will use the output_dir as the experiment name.

default

None

anyOf

type

string

type

null

  • LoraConfig

LoraConfig

type

object

properties

  • r

R

LoRA rank

type

integer

default

8

  • lora_alpha

Lora Alpha

LoRA alpha

type

number

default

8.0

  • lora_dropout

Lora Dropout

LoRA dropout

type

number

default

0.0

  • target_modules

Target Modules

LoRA target modules, can be a list of strings or ‘all-linear’

default

None

anyOf

type

array

items

type

string

type

string

  • use_rslora

Use Rslora

When set to True, uses [Rank-Stabilized LoRA](https://huggingface.co/papers/2312.03732) which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r.

type

boolean

default

False

  • modules_to_save

Modules To Save

List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.

default

None

anyOf

type

array

items

type

string

type

null

  • init_lora_weights

Init Lora Weights

How to initialize the weights of the adapter layers.Passing True (default) results in the default initialization from the reference implementation from Microsoft, with the LoRA B weight being set to 0. This means that without further training, the LoRA adapter will be a no-op.Setting the initialization to False leads to random initialization of LoRA A and B, meaning that LoRA is not a no-op before training; this setting is intended for debugging purposes.Passing ‘gaussian’ results in Gaussian initialization scaled by the LoRA rank for linear and layers. Pass ‘loftq’ to use LoftQ initialization. Passing ‘eva’ results in a data-driven initialization of Explained Variance Adaptation.EVA initializes LoRA based on the SVD of layer input activations and achieves SOTA performance due to its ability to adapt to the finetuning data. Pass ‘olora’ to use OLoRA initialization. Passing ‘pissa’ results in the initialization of https://huggingface.co/papers/2404.02948

default

True

anyOf

type

boolean

type

string

enum

gaussian, eva, olora, pissa, pissa_niter_[number of iters]

  • OverlongRewardConfig

OverlongRewardConfig

type

object

properties

  • enable_overlong_penalty

Enable Overlong Penalty

Enable overlong penalty for the model. If set to True, the output will be penalized for responses that are too long.

type

boolean

default

False

  • buffer_length

Buffer Length

Length of the buffer for overlong penalty. If the response length exceeds this value, the output will be penalized.

type

integer

default

4096

  • penalty_factor

Penalty Factor

Penalty factor for overlong penalty. The penalty increases linearly with the length of the response exceeding the buffer length from 0 to the penalty_factor.

type

number

default

1.0

  • ParallelismConfig

ParallelismConfig

type

object

properties

  • n_init_replicas

N Init Replicas

Number of initial replicas to be created

type

integer

default

1

  • tp_size

Tp Size

Tensor parallelism size

type

integer

default

2

  • cp_size

Cp Size

Context parallelism size

type

integer

default

1

  • ep_size

Ep Size

Expert parallelism size

type

integer

default

1

  • dp_shard_size

Dp Shard Size

Data Parallelism size in sharded mode

type

integer

default

-1

  • pp_size

Pp Size

Pipeline parallelism size

type

integer

default

1

  • pp_dynamic_shape

Pp Dynamic Shape

Pipeline parallelism dynamic shape

type

boolean

default

False

  • pp_micro_batch_size

Pp Micro Batch Size

Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages

type

integer

default

1

  • dp_replicate_size

Dp Replicate Size

Data Parallelism size in replica mode. Only configurable in SFT type job, must be 1 in GRPO type job for dynamic scaling support purpose.

type

integer

default

1

  • PolicyConfig

PolicyConfig

type

object

properties

  • parallelism

#/$defs/ParallelismConfig

  • model_name_or_path

Model Name Or Path

The model name or path, compatible with huggingface model name or local path

type

string

default

Qwen/Qwen2.5-VL-7B-Instruct

  • model_revision

Model Revision

The revision of the model to use

default

None

anyOf

type

string

type

null

  • model_max_length

Model Max Length

The maximum length for training, longer than this will be ignored for training stability

type

integer

default

4096

  • model_gradient_checkpointing

Model Gradient Checkpointing

Whether to use gradient checkpointing

type

boolean

default

True

  • lora

LoRA configuration

default

None

anyOf

#/$defs/LoraConfig

type

null

  • trainable_map

Trainable Map

Mapping of name -> bool. Keys can either be: - exact parameter names (from model.named_parameters()) - exact module paths (from model.named_modules())

default

None

anyOf

type

object

additionalProperties

type

boolean

type

null

  • enable_liger_kernel

Enable Liger Kernel

Whether to use liger kernel.

type

boolean

default

False

  • ProfilerConfig

ProfilerConfig

type

object

properties

  • enable_profiler

Enable Profiler

Enable profiler for training

type

boolean

default

False

  • enable_nsys

Enable Nsys

Enable nsys for training

type

boolean

default

False

  • sub_profiler_config

Sub profiler config

#/$defs/SubProfilerConfig

  • RolloutConfig

RolloutConfig

type

object

properties

  • parallelism

#/$defs/RolloutParallelismConfig

  • enforce_eager

Enforce Eager

Whether to enable eager execution for vLLM.

type

boolean

default

True

  • include_stop_str_in_output

Include Stop Str In Output

Whether to include stop string in output.

type

boolean

default

False

  • gpu_memory_utilization

Gpu Memory Utilization

GPU memory utilization factor for rollout backend.

type

number

default

0.8

  • enable_chunked_prefill

Enable Chunked Prefill

Whether to enable chunked prefill for vLLM.

type

boolean

default

False

  • max_response_length

Max Response Length

Max output length of rollout generation.

type

integer

default

2048

  • n_generation

N Generation

n parameter same like what in OpenAI chat API.

type

integer

default

16

  • batch_size

Batch Size

Batch size for rollout.

type

integer

default

1

  • val_batch_size

Val Batch Size

Batch size for rollout generation during validation.

default

None

anyOf

type

integer

type

null

  • quantization

Quantization

Quantization in vllm rollout generation.

type

string

default

none

  • seed

Seed

random seed for rollout.

default

None

anyOf

type

integer

type

null

  • sampling_config

#/$defs/SamplingConfig

  • vllm_use_flashinfer

Vllm Use Flashinfer

Use flashinfer for vllm rollout.

type

boolean

default

False

  • backend

Backend

Backend for rollout. Currently support vllm and trtllm.

type

string

default

vllm

  • RolloutParallelismConfig

RolloutParallelismConfig

type

object

properties

  • n_init_replicas

N Init Replicas

Number of initial replicas to be created

type

integer

default

1

  • tp_size

Tp Size

Tensor parallelism size

type

integer

default

2

  • cp_size

Cp Size

Context parallelism size

type

integer

default

1

  • ep_size

Ep Size

Expert parallelism size

type

integer

default

1

  • dp_shard_size

Dp Shard Size

Data Parallelism size in sharded mode

type

integer

default

-1

  • pp_size

Pp Size

Pipeline parallelism size

type

integer

default

1

  • pp_dynamic_shape

Pp Dynamic Shape

Pipeline parallelism dynamic shape

type

boolean

default

False

  • pp_micro_batch_size

Pp Micro Batch Size

Pipeline parallelism micro batch size, n_micro_batch = batch_size / pp_micro_batch_size, which must be divisible by pp stages

type

integer

default

1

  • dp_replicate_size

Dp Replicate Size

Data Parallelism size in replica mode, only 1 is supported for dynamic scaling purpose.

type

integer

default

1

  • SFTDataConfig

SFTDataConfig

type

object

properties

  • type

Type

type

string

const

sft

  • dataset

Dataset configuration for SFT training. It includes dataset name, subset, revision, train split, and test split.

#/$defs/DatasetConfig

  • mini_batch

Mini Batch

mini-batch size for training.

type

integer

default

2

  • dataloader_shuffle

Dataloader Shuffle

Shuffle the dataloader. If False, the dataloader will be used in the order it is loaded.

type

boolean

default

True

  • enable_dataset_cache

Enable Dataset Cache

Enable dataset cache process results, maybe accelerate the dataset loading

type

boolean

default

False

  • dataloader_num_workers

Dataloader Num Workers

Number of subprocess to use for data loading

type

integer

default

0

  • dataloader_prefetch_factor

Dataloader Prefetch Factor

Number of batches loaded in advance by each worker.

default

None

anyOf

type

integer

type

null

  • conversation_column_name

Conversation Column Name

Column name for formated conversation json

type

string

default

conversations

  • system_prompt

System Prompt

System prompt for the model, which will be prepended to the prompt

type

string

default

  • SamplingConfig

SamplingConfig

type

object

properties

  • temperature

Temperature

Temperature for sampling.

type

number

default

1.0

  • top_p

Top P

Top-p for sampling.

type

number

default

1.0

  • top_k

Top K

Top-k for sampling.

type

integer

default

-1

  • repetition_penalty

Repetition Penalty

Repetition penalty for sampling.

type

number

default

1.0

  • use_flashinfer

Use Flashinfer

Use flashinfer for sampling.

type

boolean

default

False

  • SubProfilerConfig

SubProfilerConfig

type

object

properties

  • do_profile

Do Profile

Whether to profile, only used in runtime.

type

boolean

default

False

  • active_steps

Active Steps

Number of active steps

type

integer

default

1

  • warmup_steps

Warmup Steps

Number of warmup steps

type

integer

default

1

  • wait_steps

Wait Steps

Number of wait steps

type

integer

default

1

  • rank_filter

Rank Filter

Rank filter

type

array

items

type

integer

  • record_shape

Record Shape

Whether to record shape

type

boolean

default

False

  • profile_memory

Profile Memory

Whether to profile memory

type

boolean

default

False

  • with_stack

With Stack

Whether to profile stack

type

boolean

default

False

  • with_modules

With Modules

Whether to profile modules

type

boolean

default

False

  • TrainingConfig

TrainingConfig

type

object

properties

  • train_policy

Train Policy

default

type

grpo

variant

grpo

dataset

name

revision

split

subset

test_size

None

dataloader_shuffle

True

enable_dataset_cache

False

dataloader_num_workers

0

dataloader_prefetch_factor

None

prompt_column_name

response_column_name

reward_function

single_choice

1.0

temperature

1.0

epsilon_low

0.2

epsilon_high

0.2

positive_nll_coef

None

lower_bound_ratio

3.0

loss_type

token-mean

unbiased_loss_max_tokens

None

unbiased_advantage

False

overlong_reward

buffer_length

4096

enable_overlong_penalty

False

penalty_factor

1.0

kl_beta

0.0

aipo_rho

None

mu_iterations

1

mini_batch

2

allowed_outdated_steps

4

on_policy

False

min_filter_prefix_tokens

None

oneOf

#/$defs/SFTDataConfig

#/$defs/GrpoConfig

  • optm_name

Optm Name

Optimizer name

type

string

default

AdamW

  • optm_lr

Optm Lr

Learning rate for optimizer, can be a float or a list of floats for multiple optimizers

default

1e-06

anyOf

type

number

type

array

items

type

number

  • optm_impl

Optm Impl

Implementation type for optimizer. More info: https://pytorch.org/docs/stable/optim.html, can be a list of strings for multiple optimizers

default

fused

anyOf

type

string

type

array

items

type

string

  • optm_weight_decay

Optm Weight Decay

Weight decay for optimizer

type

number

default

0.01

  • optm_betas

Optm Betas

Betas for optimizer

type

array

default

0.9

0.999

maxItems

2

minItems

2

  • optm_warmup_steps

Optm Warmup Steps

Warmup steps for optimizer, can be an integer or a float, if it is a float and range in [0.0, 1.0], it will be multiplied by the total steps

default

20

anyOf

type

integer

type

number

  • optm_decay_ratio

Optm Decay Ratio

Ratio of total steps for decay, range in [0.0, 1.0], 0 means no decay.

default

None

anyOf

type

number

type

null

  • optm_decay_type

Optm Decay Type

Type of decay for optimizer

default

None

anyOf

type

string

type

null

  • optm_min_lr_factor

Optm Min Lr Factor

Minimum lr factor for optimizer, range in [0.0, 1.0]

type

number

default

0.0

  • optm_grad_norm_clip

Optm Grad Norm Clip

Gradient norm clip for optimizer

type

number

default

1.0

  • master_dtype

Master Dtype

The master weight data type for optimizers, is orthognal to param_dtype.

default

float32

anyOf

type

string

type

null

  • param_dtype

Param Dtype

The data type for forward/backward. Outside forward/backward, params are in master_dtype

type

string

default

bfloat16

  • transfer_dtype

Transfer Dtype

The data type for transfer parameters between Policy and Rollout.

type

string

default

None

  • fsdp_reduce_dtype

Fsdp Reduce Dtype

The data type for reduction in FSDP

type

string

default

float32

  • fsdp_offload

Fsdp Offload

Whether to offload the model to CPU if using FSDP

type

boolean

default

False

  • fsdp_reshard_after_forward

Fsdp Reshard After Forward

Reshard the param after forward pass in FSDP

type

string

default

default

  • train_batch_per_replica

Train Batch Per Replica

The batch size for training per iteration in one replica, this is the local batch size for each gradient accumulation step

type

integer

default

8

  • enable_validation

Enable Validation

Enable validation during training.

type

boolean

default

False

  • validation_step

Validation Step

Validation frequency during training, in terms of training steps

type

integer

default

20

  • validation_batch_per_replica

Validation Batch Per Replica

The batch size for validation per iteration in one replica.

type

integer

default

24

  • fp8

#/$defs/FP8Config

  • ckpt

#/$defs/CheckpointConfig

  • resume

Resume

Resume training from a checkpoint. If True, will resume from the latest checkpoint of the output_dir. If a string, will resume from the specified checkpoint path.

default

False

anyOf

type

boolean

type

string

  • epoch

Epoch

Number of epochs for training

type

integer

default

1

  • output_dir

Output Dir

Output directory

type

string

default

./outputs

  • timestamp

Timestamp

Timestamp for the output directory and wandb ID, if not set, will be generated automatically

type

string

default

  • epsilon

Epsilon

Epsilon for optimizer

type

number

default

1e-06

  • async_tp_enabled

Async Tp Enabled

Whether to use async tensor parallelism

type

boolean

default

False

  • compile

Compile

Whether to use torch.compile

type

boolean

default

True

  • sync_weight_interval

Sync Weight Interval

The interval of train step for synchronizing weights between replicas.

type

integer

default

1

  • deterministic

Deterministic

Whether to use deterministic training. If set to True, will use deterministic training, which is expected to be slower.

type

boolean

default

False

  • seed

Seed

Random seed for training. If deterministic is set to True, will by default be set to 42.

default

None

anyOf

type

integer

type

null

  • max_num_steps

Max Num Steps

Optional upper bound on total training steps. If set, training stops when either this step count or the epoch-based limit is reached (whichever comes first). Handy for quick smoke tests.

default

None

anyOf

type

integer

type

null

  • ValidationConfig

ValidationConfig

type

object

properties

  • dataset

Dataset configuration for validation. It includes dataset name, subset, revision and test split.

#/$defs/DatasetConfig

  • temperature

Temperature

Temperature for sampling during validation.

type

number

default

0.0

  • top_p

Top P

Top-p for sampling during validation.

default

None

anyOf

type

number

type

null

  • top_k

Top K

Top-k for sampling during validation.

default

1

anyOf

type

integer

type

null

  • repetition_penalty

Repetition Penalty

Repetition penalty for sampling during validation.

type

number

default

1.0

  • n_generation

N Generation

n parameter same like what in OpenAI chat API for validation.

type

integer

default

1

  • max_response_length

Max Response Length

Max output length of rollout generation during validation.

default

None

anyOf

type

integer

type

null

  • reward_function

Reward Function

Reward functions for the model. Currently support single_choice, boxed_math, and format. You can add weight to each reward function by passing a dict, e.g., {‘single_choice’: 0.9, ‘format’: 0.1}

anyOf

type

string

type

array

items

type

string

type

object

additionalProperties

type

number