DGXC-Lepton Job

If you already have access to the DGX Cloud Lepton Platform, you can launch your training job with the following steps:

Ensure the LeptonAI CLI is installed and you’re logged in to your workspace.
```
pip install -U leptonai>=0.25.0
```
Go to your workspace dashboard and generate the token, login with:
```
lep login -c <workspace_id>:<token>
```
(OPTIONAL) In your training configuration file, set the output_dir to the mounted storage path.
Build your Docker image.
Push it to your container registry (e.g., NVIDIA NGC, Docker Hub).
Go to DGX Lepton workspace dashboard → Settings to configure your container registries and secrets.

You can also quickly create a secret in your workspace using the CLI:
```
lep secret create --name MY_HF_TOKEN --value hf_xxxxxxxxx
```
In the examples below, we’ll reference this secret using the name MY_HF_TOKEN.
Include the image, registries, and secrets in the launch command as shown below.
(OPTIONAL) Ensure your job mounts the available Local File System as shown below.
Check available node group and nodes with:
```
lep node list
lep node list --detail
```

To launch your job, use the following command as a template. Make sure to replace the placeholders (e.g., image, config, secrets) with your own values.

cosmos-rl \
--config configs/qwen3/qwen3-8b-p-tp4-r-tp2-pp1-grpo.toml \
--lepton-mode \
--lepton-job-name <job_name> \
--lepton-container-image <image> \
--lepton-resource-shape <resource-type> \  # e.g., gpu.8xa100-80gb
--lepton-node-group <node_group> \
--lepton-image-pull-secrets <registry_name> \
--lepton-secret HUGGING_FACE_HUB_TOKEN=MY_HF_TOKEN \  # Example usage of a secret. Make sure to setup 'MY_HF_TOKEN' in your workspace under Settings → Secrets.
--lepton-env <ENVIRONMENT_VARIABLE_NAME>=<VALUE> \
--lepton-mount /:<mount_path>:local-path-for-local:<local_disc_volume_name> \
tools/dataset/gsm8k_grpo.py

Warning

–mount currently only works for node groups that have Local Disk Enabled.

Example

If your node group has Local Disk enabled, and you have a volume named volume_A, inside your volume there is a dataset folder you can mount that dataset on a specific path job_dataset in your job using –lepton-mount.

--lepton-mount /dataset:/job_dataset:local-path-for-local:volume_A

For more available options, scroll down to [Option Reference for `cosmos-rl` command]

Valid Resource Shapes

Please use the command lep node list –detail to view the available GPUs in the node group. Note: It is allowed to use CPU resource shapes within a GPU node group.

CPU Instances

cpu.small
cpu.medium
cpu.large

GPU Instances

NVIDIA A10: - gpu.a10 - gpu.a10.6xlarge

NVIDIA A100 (40GB): - gpu.a100-40gb - gpu.2xa100-40gb - gpu.4xa100-40gb - gpu.8xa100-40gb

NVIDIA A100 (80GB): - gpu.a100-80gb - gpu.2xa100-80gb - gpu.4xa100-80gb - gpu.8xa100-80gb

NVIDIA H100 SXM: - gpu.h100-sxm - gpu.2xh100-sxm - gpu.4xh100-sxm - gpu.8xh100-sxm

Option Reference for cosmos-rl command

Option (long)	Short	Type / Action	Default	Description
–config	—	str (required)	—	Path to TOML configuration file (algorithm, model, data, parallelism…).
–url	—	str	None	Controller URL ip:port; local controller if absent or IP is local.
–port	—	int	8000	Controller port when –url is not provided.
–policy	—	int	None	Total policy replicas (else read from TOML).
–rollout	—	int	None	Total rollout replicas (else read from TOML).
–log-dir	—	str	None	Directory for logs (stdout if not set).
–weight-sync-check	-wc	store_true	False	Debug: check weight-sync correctness between policy and rollout.
–num-workers	—	int	1	Workers used for multi-node training.
–worker-idx	—	int	0	Local worker index.(ignored in Lepton mode, which automatically assigns each replica’s index.)
–lepton-mode	—	store_true	False	Enable Lepton remote-execution mode.
Lepton-specific options
–lepton-job-name	-n	str (required)	None	Job name.(required in lepton mode)
–lepton-container-port	—	str, append	None	Exposed ports port[:protocol] (repeatable).
–lepton-resource-shape	—	str	None	Pod resource shape.
–lepton-node-group	-ng	str, append	None	Target node group(s).
–lepton-max-failure-retry	—	int	None	Max per-worker retries.
–lepton-max-job-failure-retry	—	int	None	Max job-level retries.
–lepton-env	-e	str, append	None	Env vars NAME=VALUE (repeatable).
–lepton-secret	-s	str, append	None	Secrets (repeatable).
–lepton-mount	—	str, append	None	Persistent storage mounts.
–lepton-image-pull-secrets	—	str, append	None	Image-pull secrets.
–lepton-intra-job-communication	—	bool	None	Enable intra-job communication.
–lepton-privileged	—	store_true	False	Run in privileged mode.
–lepton-ttl-seconds-after-finished	—	int	259200	TTL (s) for finished jobs.
–lepton-log-collection	-lg	bool	None	Enable/disable log collection.
–lepton-node-id	-ni	str, append	None	Specific node(s) to run on.
–lepton-queue-priority	-qp	str	None	Queue priority.
–lepton-visibility	—	str	None	Job visibility (public/private).
–lepton-shared-memory-size	—	int	None	Shared memory size (MiB).
–lepton-with-reservation	—	str	None	Reservation ID for dedicated node groups.