Load-Balanced Dynamic Batching
================================

Overview
--------

Load-balanced dynamic batching is a data loading strategy designed to minimize padding waste and improve training efficiency in distributed training scenarios. Unlike traditional fixed-size batching, this approach dynamically creates batches that maximize token utilization while respecting a maximum token constraint per batch.

Key Features
------------

- **Dynamic Batch Formation**: Batches are created on-the-fly based on sample lengths, maximizing batch size while staying within token limits
- **Load Balancing**: Balances the number of tokens across different data parallel ranks, reducing padding waste
- **Step-Based Training**: Training is controlled by optimizer steps (``max_num_steps`` in ``[train]``), not epochs. User-provided epoch configuration is ignored
- **Automatic Data Looping**: When ``infinite_loop = true`` (default), data automatically restarts when exhausted, with epoch incremented for new data ordering
- **Gradient Accumulation Support**: Built-in support for accumulating multiple batches per optimizer step
- **Resume Support**: Properly handles training resumption with deterministic data ordering based on train_step

How It Works
------------

The load-balanced batching system consists of two main components:

1. **ShardedIterableDataset**: Shards the base dataset across data parallel ranks and converts it to an IterableDataset
2. **LoadBalancedDataset**: Maintains a pool of samples and dynamically creates batches using a best-fit strategy

**Training Mode**:
- When ``enable_dp_load_balancing = true``, training is **step-based**, not epoch-based
- Training duration is controlled by ``max_num_steps`` in ``[train]`` (number of optimizer steps)
- User-provided ``epoch`` configuration parameter is **ignored**
- Epoch is managed internally for deterministic data ordering (different epoch = different shuffle)
- When ``infinite_loop = true`` (default), data automatically restarts when exhausted, with epoch incremented
- Each rank may consume data at different rates due to dynamic batching, but training stops when ``total_steps`` is reached

Batch Formation Strategy
------------------------

The system uses a pool-based approach:

1. **Sample Pool**: Each rank maintains a pool of samples (default: 32 samples)
2. **Best-Fit Selection**: When forming a batch, the system selects samples from the pool based on the batching mode:

   **Without Sequence Packing** (default):
   - Maximizes batch_size * max_input_len while staying within ``max_tokens_for_batch``
   - Uses padding to align sequences to the same length
   - Batching strategies:
     - ``prefer_closest``: Selects samples with lengths closest to existing samples in the batch (minimizes padding)
     - ``prefer_first``: FIFO selection (faster but may have more padding)

   **With Sequence Packing** (``sequence_packing = true``):
   - Maximizes total tokens (sum of all sequence lengths) while staying within ``max_tokens_for_batch``
   - Multiple sequences are packed into a single tensor without padding
   - Uses a simpler greedy strategy: adds sequences until total tokens exceed the limit
   - More efficient token utilization, but requires model support for sequence packing

Configuration
-------------

Load-balanced batching is configured through the training policy configuration:

.. code-block:: toml

   [train]
   max_num_steps = 100  # Required when enable_dp_load_balancing is true
   sequence_packing = false  # Set to true to enable sequence packing

   [train.train_policy]
   enable_dp_load_balancing = true
   load_balanced_pool_size = 32
   load_balanced_max_tokens_for_batch = 32768
   load_balanced_batching_strategy = "prefer_closest"  # or "prefer_first"
   load_balanced_batches_per_optimizer_step = 1  # Also known as load_balanced_accumulate_steps

Configuration Parameters
------------------------

enable_dp_load_balancing
   Enable load-balanced dynamic batching (default: false)

load_balanced_pool_size
   Size of the sample pool maintained by each rank (default: 32)

   Larger pool sizes allow better batch formation but use more memory.

load_balanced_max_tokens_for_batch
   Maximum number of tokens per batch (default: 32768)

   This is the primary constraint for batch formation. The system will create batches that maximize batch_size * max_input_len while staying within this limit.

load_balanced_batching_strategy
   Batching strategy: "prefer_closest" or "prefer_first" (default: "prefer_closest")

   - ``prefer_closest``: Minimizes padding by selecting samples with similar lengths
   - ``prefer_first``: FIFO selection, faster but may have more padding

max_num_steps (in ``[train]``)
   Maximum number of optimizer steps (training steps). **Required** when ``enable_dp_load_balancing = true``.

   This defines the number of times ``optimizer.step()`` will be called. The actual number of batches processed will be ``max_num_steps * load_balanced_batches_per_optimizer_step``.

   **Important**: When ``enable_dp_load_balancing = true``, training is **step-based**, not epoch-based. The user-provided ``epoch`` configuration parameter is ignored. The system uses ``max_num_steps`` (in ``[train]`` section) to determine when training should stop.

load_balanced_batches_per_optimizer_step
   Number of batches to accumulate per optimizer step for gradient accumulation (default: 1)

   Each DataLoader iteration will return this many batches, which are processed before calling ``optimizer.step()``. The total number of batches processed = ``max_num_steps`` (in ``[train]``) * ``load_balanced_batches_per_optimizer_step``.

sequence_packing
   Enable sequence packing for training (default: false)

   When enabled, multiple sequences are packed into a single tensor without padding, maximizing token utilization. The batch formation strategy changes from maximizing ``batch_size * max_input_len`` to maximizing ``sum(sequence_lengths)`` within the token limit.

   **Important**: Sequence packing requires model support. Not all models support sequence packing. The system will check compatibility and warn if the model doesn't support it.

   When sequence packing is enabled:
   - The batching strategy (``prefer_closest`` vs ``prefer_first``) is ignored
   - A greedy algorithm is used: sequences are added until total tokens exceed the limit
   - More efficient token utilization compared to padding-based batching
   - Requires the model to handle variable-length sequences within a batch

Usage Example
-------------

Here's a complete configuration example:

.. code-block:: toml

   [train]
   max_num_steps = 1000

   [train.train_policy]
   enable_dp_load_balancing = true
   load_balanced_pool_size = 64
   load_balanced_max_tokens_for_batch = 65536
   load_balanced_batching_strategy = "prefer_closest"
   load_balanced_batches_per_optimizer_step = 4
   dataloader_seed = 42

In this example:
- Each rank maintains a pool of 64 samples
- Maximum tokens per batch is 65536
- Uses "prefer_closest" strategy to minimize padding
- Training will run for 1000 optimizer steps
- Each optimizer step accumulates 4 batches (gradient accumulation)
- Total batches processed = 1000 * 4 = 4000 batches

Example with Sequence Packing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: toml

   [train]
   sequence_packing = true
   max_num_steps = 1000

   [train.train_policy]
   enable_dp_load_balancing = true
   load_balanced_pool_size = 64
   load_balanced_max_tokens_for_batch = 65536
   load_balanced_batches_per_optimizer_step = 4
   dataloader_seed = 42

In this example with sequence packing:
- Sequence packing is enabled (no padding needed)
- Maximum total tokens per batch is 65536 (sum of all sequence lengths)
- The batching strategy is automatically set to greedy packing
- More efficient token utilization compared to padding-based batching

Gradient Accumulation
---------------------

Load-balanced batching supports gradient accumulation at the data loading level. When ``load_balanced_batches_per_optimizer_step > 1``:

1. Each DataLoader iteration returns a list of batches (instead of a single batch)
2. The trainer processes all batches in the list, accumulating gradients
3. A single ``optimizer.step()`` is called after processing all batches

This approach moves gradient accumulation logic from the trainer to the data loading layer, providing better modularity and efficiency.

Infinite Loop and Epoch Management
----------------------------------

When ``enable_dp_load_balancing = true``, the system uses an ``infinite_loop`` parameter to control data iteration behavior:

**Infinite Loop (default: true)**:
- When ``infinite_loop = true``: Data automatically restarts when exhausted
  - Epoch is automatically incremented each time data restarts
  - Different epochs use different random seeds for data shuffling (deterministic but varied ordering)
  - This ensures training can reach ``max_num_steps`` even if data is exhausted
  - Recommended for step-based training where you want to train for a fixed number of optimizer steps

- When ``infinite_loop = false``: Data stops when exhausted
  - Training stops when all data has been processed
  - Not recommended for step-based training as training may stop before reaching ``max_num_steps``

**Epoch Management**:
- Epoch is **managed internally** and is used only for deterministic data ordering
- User-provided ``epoch`` configuration parameter is **ignored** when ``enable_dp_load_balancing = true``
- Epoch does not control training duration (training is step-based, controlled by ``max_num_steps``)
- When data restarts (``infinite_loop = true``), epoch is automatically incremented to ensure different data ordering
- Initial epoch is set to 0 when resuming training

**Why Infinite Loop?**:
- In dynamic batching, different ranks consume data at different rates
- Some ranks may exhaust their data shard before reaching ``max_num_steps``
- With ``infinite_loop = true``, data automatically restarts, ensuring all ranks can continue training
- Training stops when ``train_step >= total_steps`` (where ``total_steps = max_num_steps``), not when data is exhausted

Resume Support
--------------

Load-balanced batching properly handles training resumption:

1. **Step-Based Training**: Training is based on optimizer steps (``max_num_steps``), not epochs
2. **Automatic Epoch Management**: Epoch is automatically managed internally for deterministic data ordering
   - Initial epoch is set to 0 when resuming
   - Epoch is automatically incremented when data loops (if ``infinite_loop = true``)
   - Different epochs use different random seeds for data shuffling
3. **Batch Skipping**: Skips batch groups that have already been processed based on ``train_step``

The resume logic ensures that:
- Data ordering matches the original training (deterministic shuffling)
- Only batches within the current step range are skipped
- Training continues from the correct position

**Note**: The user-provided ``epoch`` configuration parameter is **ignored** when ``enable_dp_load_balancing = true``. Epoch is managed internally for data ordering purposes only.

Implementation Details
----------------------

ShardedIterableDataset
~~~~~~~~~~~~~~~~~~~~~~

The ``ShardedIterableDataset`` class:

- Shards the base dataset across data parallel ranks
- Converts a regular ``Dataset`` to an ``IterableDataset``
- Supports deterministic shuffling based on epoch number
- Ensures each rank only sees its portion of the data

LoadBalancedDataset
~~~~~~~~~~~~~~~~~~

The ``LoadBalancedDataset`` class:

- Maintains a pool of samples for dynamic batch formation
- Implements best-fit batching strategies (with or without sequence packing)
- Supports gradient accumulation by yielding multiple batches per iteration
- Provides ``set_epoch()`` and ``skip_batches()`` methods for resume support
- Supports automatic data looping with ``infinite_loop`` parameter:
  - When ``infinite_loop = true`` (default): Automatically restarts data iteration when exhausted, incrementing epoch for new data ordering
  - When ``infinite_loop = false``: Stops iteration when data is exhausted
- Adapts batch formation algorithm based on ``seq_packing_enabled`` flag:
  - Without packing: maximizes batch_size * max_length (with padding)
  - With packing: maximizes sum of sequence lengths (without padding)

**Epoch Management**:
- Epoch is managed internally for deterministic data ordering (different epoch = different shuffle)
- When ``infinite_loop = true``, epoch is automatically incremented each time data restarts
- Epoch does not control training duration (training is step-based, controlled by ``max_num_steps``)

Best-Fit Algorithm
~~~~~~~~~~~~~~~~~~

The best-fit algorithm works differently depending on whether sequence packing is enabled:

**Without Sequence Packing** (default):

1. Start with an empty batch
2. For each sample in the pool:
   - Calculate the new batch size and max length if this sample is added
   - Check if the total tokens (batch_size * max_length) <= max_tokens_for_batch
   - If valid, calculate a score based on the batching strategy
3. Select the sample with the best score (highest for "prefer_closest", first for "prefer_first")
4. Add the sample to the batch and remove it from the pool
5. Repeat until no more samples can be added

**With Sequence Packing** (``sequence_packing = true``):

1. Start with an empty batch
2. For each sample in the pool (in order):
   - Calculate the new total tokens (sum of all sequence lengths) if this sample is added
   - Check if the total tokens <= max_tokens_for_batch
   - If valid, add the sample to the batch and remove it from the pool
3. Repeat until no more samples can be added

The sequence packing algorithm uses a greedy approach: it adds sequences until the total token count exceeds the limit, maximizing token utilization without padding.

Advantages
----------

1. **Reduced Padding Waste**: By grouping samples with similar lengths, padding is minimized. With sequence packing enabled, padding is completely eliminated
2. **Better GPU Utilization**: More tokens per batch means better GPU utilization
3. **Flexible Batch Sizes**: Adapts to varying sample lengths automatically
4. **Distributed Training Friendly**: Balances load across ranks while maintaining data distribution
5. **Sequence Packing Support**: When enabled, eliminates padding entirely by packing multiple sequences into a single tensor, maximizing token efficiency

Limitations
-----------

1. **Approximate Length**: ``len(dataset)`` returns an approximate value based on sample count, not actual batch count
2. **Memory Overhead**: Maintaining a sample pool requires additional memory
3. **Step-Based Training**: When ``enable_dp_load_balancing = true``, training is step-based, not epoch-based. User-provided epoch configuration is ignored
4. **Epoch Management**: Epoch is managed internally for data ordering purposes only. It does not control training duration
5. **Deterministic Resume**: Resume is deterministic based on train_step, but exact batch composition may vary slightly due to dynamic batching

Best Practices
--------------

1. **Pool Size**: Choose a pool size that balances memory usage and batch quality. Larger pools (64-128) work well for most cases
2. **Max Tokens**: Set ``max_tokens_for_batch`` based on your GPU memory and model size. Common values: 32768, 65536, 131072
3. **Batching Strategy**: Use "prefer_closest" for better efficiency, "prefer_first" if speed is more important (only applies when sequence packing is disabled)
4. **Sequence Packing**: Enable sequence packing if your model supports it for better token utilization. Check model compatibility before enabling
5. **Gradient Accumulation**: Use ``load_balanced_batches_per_optimizer_step`` to control effective batch size
6. **Seed**: Set ``dataloader_seed`` for reproducibility
7. **Step-Based Training**: Remember that when ``enable_dp_load_balancing = true``, training is step-based. Set ``max_num_steps`` (in ``[train]``) to control training duration, not ``epoch``
8. **Infinite Loop**: Keep ``infinite_loop = true`` (default) for step-based training. Data will automatically restart when exhausted, ensuring training can reach ``max_num_steps``

When to Use Sequence Packing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sequence packing is recommended when:
- Your model supports variable-length sequences within a batch
- You want to maximize token utilization (reduce padding waste)
- You have sequences with highly variable lengths
- Your model architecture can handle packed sequences efficiently

Sequence packing is NOT recommended when:
- Your model doesn't support sequence packing (check compatibility)
- You need fixed-size batches for certain operations
- The overhead of handling variable-length sequences outweighs the benefits

Troubleshooting
---------------

**Issue**: Training seems slower than expected
   - Check if ``load_balanced_pool_size`` is too small
   - Verify ``max_tokens_for_batch`` is appropriate for your hardware
   - Consider using "prefer_first" strategy for faster batch formation

**Issue**: Out of memory errors
   - Reduce ``load_balanced_max_tokens_for_batch``
   - Reduce ``load_balanced_pool_size``
   - Reduce ``load_balanced_batches_per_optimizer_step``

**Issue**: Resume doesn't work correctly
   - Ensure ``max_num_steps`` (in ``[train]``) matches the original training configuration
   - Check that checkpoint contains correct ``train_step`` information
   - Verify ``dataloader_seed`` is the same as original training
   - Note: User-provided ``epoch`` parameter is ignored when ``enable_dp_load_balancing = true``. Epoch is managed internally

**Issue**: Data keeps looping indefinitely
   - This is expected behavior when ``infinite_loop = true`` (default)
   - Training stops based on ``max_num_steps``, not data exhaustion
   - Set ``infinite_loop = false`` if you want data to stop when exhausted (not recommended for step-based training)

**Issue**: Sequence packing not working
   - Verify that ``sequence_packing = true`` is set in the configuration
   - Check if your model supports sequence packing (the system will warn if not)
   - Ensure ``enable_dp_load_balancing = true`` is also set
   - Check model compatibility: not all models support sequence packing

Related Documentation
---------------------

- :doc:`configuration` - General configuration guide
- :doc:`dataflow` - Data flow in cosmos-rl