Profiler
This page explains how to enable and use runtime profiling for a running RL task.
Prerequisites
Start your Cosmos-RL RL task first.
Make sure profiler is enabled in your TOML config.
Example:
[profiler]
enable_profiler = true
[profiler.sub_profiler_config]
active_steps = 1
warmup_steps = 1
wait_steps = 1
rank_filter = [0]
record_shape = false
profile_memory = false
with_stack = false
with_modules = false
Important:
Profiling commands below only work when
profiler.enable_profiler = true.You can tune profiling behavior with
profiler.sub_profiler_config.Profiler config definitions are in
cosmos_rl/policy/config/__init__.py(ProfilerConfigandSubProfilerConfig).
Step-by-Step Workflow
1) List running replicas
Run:
python -m cosmos_rl.cli.cli replica ls -cp 8000 -ch localhost
Notes:
-cpis the controller port.-chis the controller host.Replace
8000andlocalhostwith your real controller endpoint.
2) Pick the target policy replica
From replica ls output, identify:
Replica name
Replica role
Choose the policy replica you want to profile.
3) Enable profiling for that replica
Run:
python -m cosmos_rl.cli.cli profile set 04c8f8f4-e4e2-46ac-be3c-28ad63a7c108 -cp 8000 -ch localhost
Replace the replica ID, host, and port with your own values.
4) Confirm profiler logs on the policy replica
In policy replica logs, you should see messages like:
[Profiler] init profiler for rank 0
[Profiler] start to trace for rank: 0
[Profiler] save trace for rank: 0 to file: ./outputs/.../profile_trace/04c8f8f4-e4e2-46ac-be3c-28ad63a7c108_0/0_trace.json.gz after 3 steps.
The save log contains the full trace file path.
5) Open the trace in Chrome trace viewer
The saved file is typically trace.json.gz.
You can open it with:
Perfetto UI: https://ui.perfetto.dev
Chrome tracing (legacy):
chrome://tracing
Perfetto is recommended.
Quick Troubleshooting
No profiler logs: - Verify
profiler.enable_profiler = true. - Checkrank_filterincludes your rank.profile sethas no visible effect: - Re-check controller host/port and replica name.No trace file generated: - Make sure training runs for at least
wait_steps + warmup_steps + active_steps.