DGXC-Lepton Job

If you already have access to the DGX Cloud Lepton Platform, you can launch your training job with the following steps:

  1. Ensure the LeptonAI CLI is installed and you’re logged in to your workspace.

    pip install -U leptonai>=0.25.0
    

    Go to your workspace dashboard and generate the token, login with:

    lep login -c <workspace_id>:<token>
    
  2. (OPTIONAL) In your training configuration file, set the output_dir to the mounted storage path.

  3. Build your Docker image.

  4. Push it to your container registry (e.g., NVIDIA NGC, Docker Hub).

  5. Go to DGX Lepton workspace dashboard → Settings to configure your container registries and secrets.

    You can also quickly create a secret in your workspace using the CLI:

    lep secret create --name MY_HF_TOKEN --value hf_xxxxxxxxx
    

    In the examples below, we’ll reference this secret using the name MY_HF_TOKEN.

  6. Include the image, registries, and secrets in the launch command as shown below.

  7. (OPTIONAL) Ensure your job mounts the available Local File System as shown below.

  8. Check available node group and nodes with:

    lep node list
    lep node list --detail
    

To launch your job, use the following command as a template. Make sure to replace the placeholders (e.g., image, config, secrets) with your own values.

cosmos-rl \
--config configs/qwen3/qwen3-8b-p-tp4-r-tp2-pp1-grpo.toml \
--lepton-mode \
--lepton-job-name <job_name> \
--lepton-container-image <image> \
--lepton-resource-shape <resource-type> \  # e.g., gpu.8xa100-80gb
--lepton-node-group <node_group> \
--lepton-image-pull-secrets <registry_name> \
--lepton-secret HUGGING_FACE_HUB_TOKEN=MY_HF_TOKEN \  # Example usage of a secret. Make sure to setup 'MY_HF_TOKEN' in your workspace under Settings → Secrets.
--lepton-env <ENVIRONMENT_VARIABLE_NAME>=<VALUE> \
--lepton-mount /:<mount_path>:local-path-for-local:<local_disc_volume_name> \
tools/dataset/gsm8k_grpo.py

Warning

–mount currently only works for node groups that have Local Disk Enabled.

Example

If your node group has Local Disk enabled, and you have a volume named volume_A, inside your volume there is a dataset folder you can mount that dataset on a specific path job_dataset in your job using –lepton-mount.

--lepton-mount /dataset:/job_dataset:local-path-for-local:volume_A

For more available options, scroll down to [Option Reference for `cosmos-rl` command]

Valid Resource Shapes

Please use the command lep node list –detail to view the available GPUs in the node group. Note: It is allowed to use CPU resource shapes within a GPU node group.

CPU Instances

  • cpu.small

  • cpu.medium

  • cpu.large

GPU Instances

NVIDIA A10: - gpu.a10 - gpu.a10.6xlarge

NVIDIA A100 (40GB): - gpu.a100-40gb - gpu.2xa100-40gb - gpu.4xa100-40gb - gpu.8xa100-40gb

NVIDIA A100 (80GB): - gpu.a100-80gb - gpu.2xa100-80gb - gpu.4xa100-80gb - gpu.8xa100-80gb

NVIDIA H100 SXM: - gpu.h100-sxm - gpu.2xh100-sxm - gpu.4xh100-sxm - gpu.8xh100-sxm

Option Reference for cosmos-rl command

Option (long)

Short

Type / Action

Default

Description

–config

str (required)

Path to TOML configuration file (algorithm, model, data, parallelism…).

–url

str

None

Controller URL ip:port; local controller if absent or IP is local.

–port

int

8000

Controller port when –url is not provided.

–policy

int

None

Total policy replicas (else read from TOML).

–rollout

int

None

Total rollout replicas (else read from TOML).

–log-dir

str

None

Directory for logs (stdout if not set).

–weight-sync-check

-wc

store_true

False

Debug: check weight-sync correctness between policy and rollout.

–num-workers

int

1

Workers used for multi-node training.

–worker-idx

int

0

Local worker index.(ignored in Lepton mode, which automatically assigns each replica’s index.)

–lepton-mode

store_true

False

Enable Lepton remote-execution mode.

Lepton-specific options

–lepton-job-name

-n

str (required)

None

Job name.(required in lepton mode)

–lepton-container-port

str, append

None

Exposed ports port[:protocol] (repeatable).

–lepton-resource-shape

str

None

Pod resource shape.

–lepton-node-group

-ng

str, append

None

Target node group(s).

–lepton-max-failure-retry

int

None

Max per-worker retries.

–lepton-max-job-failure-retry

int

None

Max job-level retries.

–lepton-env

-e

str, append

None

Env vars NAME=VALUE (repeatable).

–lepton-secret

-s

str, append

None

Secrets (repeatable).

–lepton-mount

str, append

None

Persistent storage mounts.

–lepton-image-pull-secrets

str, append

None

Image-pull secrets.

–lepton-intra-job-communication

bool

None

Enable intra-job communication.

–lepton-privileged

store_true

False

Run in privileged mode.

–lepton-ttl-seconds-after-finished

int

259200

TTL (s) for finished jobs.

–lepton-log-collection

-lg

bool

None

Enable/disable log collection.

–lepton-node-id

-ni

str, append

None

Specific node(s) to run on.

–lepton-queue-priority

-qp

str

None

Queue priority.

–lepton-visibility

str

None

Job visibility (public/private).

–lepton-shared-memory-size

int

None

Shared memory size (MiB).

–lepton-with-reservation

str

None

Reservation ID for dedicated node groups.