Load-Balanced Dynamic Batching

Overview

Load-balanced dynamic batching is a data loading strategy designed to minimize padding waste and improve training efficiency in distributed training scenarios. Unlike traditional fixed-size batching, this approach dynamically creates batches that maximize token utilization while respecting a maximum token constraint per batch.

Key Features

  • Dynamic Batch Formation: Batches are created on-the-fly based on sample lengths, maximizing batch size while staying within token limits

  • Load Balancing: Balances the number of tokens across different data parallel ranks, reducing padding waste

  • Step-Based Training: Training is controlled by optimizer steps (max_num_steps in [train]), not epochs. User-provided epoch configuration is ignored

  • Automatic Data Looping: When infinite_loop = true (default), data automatically restarts when exhausted, with epoch incremented for new data ordering

  • Gradient Accumulation Support: Built-in support for accumulating multiple batches per optimizer step

  • Resume Support: Properly handles training resumption with deterministic data ordering based on train_step

How It Works

The load-balanced batching system consists of two main components:

  1. ShardedIterableDataset: Shards the base dataset across data parallel ranks and converts it to an IterableDataset

  2. LoadBalancedDataset: Maintains a pool of samples and dynamically creates batches using a best-fit strategy

Training Mode: - When enable_dp_load_balancing = true, training is step-based, not epoch-based - Training duration is controlled by max_num_steps in [train] (number of optimizer steps) - User-provided epoch configuration parameter is ignored - Epoch is managed internally for deterministic data ordering (different epoch = different shuffle) - When infinite_loop = true (default), data automatically restarts when exhausted, with epoch incremented - Each rank may consume data at different rates due to dynamic batching, but training stops when total_steps is reached

Batch Formation Strategy

The system uses a pool-based approach:

  1. Sample Pool: Each rank maintains a pool of samples (default: 32 samples)

  2. Best-Fit Selection: When forming a batch, the system selects samples from the pool based on the batching mode:

    Without Sequence Packing (default): - Maximizes batch_size * max_input_len while staying within max_tokens_for_batch - Uses padding to align sequences to the same length - Batching strategies:

    • prefer_closest: Selects samples with lengths closest to existing samples in the batch (minimizes padding)

    • prefer_first: FIFO selection (faster but may have more padding)

    With Sequence Packing (sequence_packing = true): - Maximizes total tokens (sum of all sequence lengths) while staying within max_tokens_for_batch - Multiple sequences are packed into a single tensor without padding - Uses a simpler greedy strategy: adds sequences until total tokens exceed the limit - More efficient token utilization, but requires model support for sequence packing

Configuration

Load-balanced batching is configured through the training policy configuration:

[train]
max_num_steps = 100  # Required when enable_dp_load_balancing is true
sequence_packing = false  # Set to true to enable sequence packing

[train.train_policy]
enable_dp_load_balancing = true
load_balanced_pool_size = 32
load_balanced_max_tokens_for_batch = 32768
load_balanced_batching_strategy = "prefer_closest"  # or "prefer_first"
load_balanced_batches_per_optimizer_step = 1  # Also known as load_balanced_accumulate_steps

Configuration Parameters

enable_dp_load_balancing

Enable load-balanced dynamic batching (default: false)

load_balanced_pool_size

Size of the sample pool maintained by each rank (default: 32)

Larger pool sizes allow better batch formation but use more memory.

load_balanced_max_tokens_for_batch

Maximum number of tokens per batch (default: 32768)

This is the primary constraint for batch formation. The system will create batches that maximize batch_size * max_input_len while staying within this limit.

load_balanced_batching_strategy

Batching strategy: “prefer_closest” or “prefer_first” (default: “prefer_closest”)

  • prefer_closest: Minimizes padding by selecting samples with similar lengths

  • prefer_first: FIFO selection, faster but may have more padding

max_num_steps (in [train])

Maximum number of optimizer steps (training steps). Required when enable_dp_load_balancing = true.

This defines the number of times optimizer.step() will be called. The actual number of batches processed will be max_num_steps * load_balanced_batches_per_optimizer_step.

Important: When enable_dp_load_balancing = true, training is step-based, not epoch-based. The user-provided epoch configuration parameter is ignored. The system uses max_num_steps (in [train] section) to determine when training should stop.

load_balanced_batches_per_optimizer_step

Number of batches to accumulate per optimizer step for gradient accumulation (default: 1)

Each DataLoader iteration will return this many batches, which are processed before calling optimizer.step(). The total number of batches processed = max_num_steps (in [train]) * load_balanced_batches_per_optimizer_step.

sequence_packing

Enable sequence packing for training (default: false)

When enabled, multiple sequences are packed into a single tensor without padding, maximizing token utilization. The batch formation strategy changes from maximizing batch_size * max_input_len to maximizing sum(sequence_lengths) within the token limit.

Important: Sequence packing requires model support. Not all models support sequence packing. The system will check compatibility and warn if the model doesn’t support it.

When sequence packing is enabled: - The batching strategy (prefer_closest vs prefer_first) is ignored - A greedy algorithm is used: sequences are added until total tokens exceed the limit - More efficient token utilization compared to padding-based batching - Requires the model to handle variable-length sequences within a batch

Usage Example

Here’s a complete configuration example:

[train]
max_num_steps = 1000

[train.train_policy]
enable_dp_load_balancing = true
load_balanced_pool_size = 64
load_balanced_max_tokens_for_batch = 65536
load_balanced_batching_strategy = "prefer_closest"
load_balanced_batches_per_optimizer_step = 4
dataloader_seed = 42

In this example: - Each rank maintains a pool of 64 samples - Maximum tokens per batch is 65536 - Uses “prefer_closest” strategy to minimize padding - Training will run for 1000 optimizer steps - Each optimizer step accumulates 4 batches (gradient accumulation) - Total batches processed = 1000 * 4 = 4000 batches

Example with Sequence Packing

[train]
sequence_packing = true
max_num_steps = 1000

[train.train_policy]
enable_dp_load_balancing = true
load_balanced_pool_size = 64
load_balanced_max_tokens_for_batch = 65536
load_balanced_batches_per_optimizer_step = 4
dataloader_seed = 42

In this example with sequence packing: - Sequence packing is enabled (no padding needed) - Maximum total tokens per batch is 65536 (sum of all sequence lengths) - The batching strategy is automatically set to greedy packing - More efficient token utilization compared to padding-based batching

Gradient Accumulation

Load-balanced batching supports gradient accumulation at the data loading level. When load_balanced_batches_per_optimizer_step > 1:

  1. Each DataLoader iteration returns a list of batches (instead of a single batch)

  2. The trainer processes all batches in the list, accumulating gradients

  3. A single optimizer.step() is called after processing all batches

This approach moves gradient accumulation logic from the trainer to the data loading layer, providing better modularity and efficiency.

Infinite Loop and Epoch Management

When enable_dp_load_balancing = true, the system uses an infinite_loop parameter to control data iteration behavior:

Infinite Loop (default: true): - When infinite_loop = true: Data automatically restarts when exhausted

  • Epoch is automatically incremented each time data restarts

  • Different epochs use different random seeds for data shuffling (deterministic but varied ordering)

  • This ensures training can reach max_num_steps even if data is exhausted

  • Recommended for step-based training where you want to train for a fixed number of optimizer steps

  • When infinite_loop = false: Data stops when exhausted - Training stops when all data has been processed - Not recommended for step-based training as training may stop before reaching max_num_steps

Epoch Management: - Epoch is managed internally and is used only for deterministic data ordering - User-provided epoch configuration parameter is ignored when enable_dp_load_balancing = true - Epoch does not control training duration (training is step-based, controlled by max_num_steps) - When data restarts (infinite_loop = true), epoch is automatically incremented to ensure different data ordering - Initial epoch is set to 0 when resuming training

Why Infinite Loop?: - In dynamic batching, different ranks consume data at different rates - Some ranks may exhaust their data shard before reaching max_num_steps - With infinite_loop = true, data automatically restarts, ensuring all ranks can continue training - Training stops when train_step >= total_steps (where total_steps = max_num_steps), not when data is exhausted

Resume Support

Load-balanced batching properly handles training resumption:

  1. Step-Based Training: Training is based on optimizer steps (max_num_steps), not epochs

  2. Automatic Epoch Management: Epoch is automatically managed internally for deterministic data ordering - Initial epoch is set to 0 when resuming - Epoch is automatically incremented when data loops (if infinite_loop = true) - Different epochs use different random seeds for data shuffling

  3. Batch Skipping: Skips batch groups that have already been processed based on train_step

The resume logic ensures that: - Data ordering matches the original training (deterministic shuffling) - Only batches within the current step range are skipped - Training continues from the correct position

Note: The user-provided epoch configuration parameter is ignored when enable_dp_load_balancing = true. Epoch is managed internally for data ordering purposes only.

Implementation Details

ShardedIterableDataset

The ShardedIterableDataset class:

  • Shards the base dataset across data parallel ranks

  • Converts a regular Dataset to an IterableDataset

  • Supports deterministic shuffling based on epoch number

  • Ensures each rank only sees its portion of the data

LoadBalancedDataset

The LoadBalancedDataset class:

  • Maintains a pool of samples for dynamic batch formation

  • Implements best-fit batching strategies (with or without sequence packing)

  • Supports gradient accumulation by yielding multiple batches per iteration

  • Provides set_epoch() and skip_batches() methods for resume support

  • Supports automatic data looping with infinite_loop parameter: - When infinite_loop = true (default): Automatically restarts data iteration when exhausted, incrementing epoch for new data ordering - When infinite_loop = false: Stops iteration when data is exhausted

  • Adapts batch formation algorithm based on seq_packing_enabled flag: - Without packing: maximizes batch_size * max_length (with padding) - With packing: maximizes sum of sequence lengths (without padding)

Epoch Management: - Epoch is managed internally for deterministic data ordering (different epoch = different shuffle) - When infinite_loop = true, epoch is automatically incremented each time data restarts - Epoch does not control training duration (training is step-based, controlled by max_num_steps)

Best-Fit Algorithm

The best-fit algorithm works differently depending on whether sequence packing is enabled:

Without Sequence Packing (default):

  1. Start with an empty batch

  2. For each sample in the pool: - Calculate the new batch size and max length if this sample is added - Check if the total tokens (batch_size * max_length) <= max_tokens_for_batch - If valid, calculate a score based on the batching strategy

  3. Select the sample with the best score (highest for “prefer_closest”, first for “prefer_first”)

  4. Add the sample to the batch and remove it from the pool

  5. Repeat until no more samples can be added

With Sequence Packing (sequence_packing = true):

  1. Start with an empty batch

  2. For each sample in the pool (in order): - Calculate the new total tokens (sum of all sequence lengths) if this sample is added - Check if the total tokens <= max_tokens_for_batch - If valid, add the sample to the batch and remove it from the pool

  3. Repeat until no more samples can be added

The sequence packing algorithm uses a greedy approach: it adds sequences until the total token count exceeds the limit, maximizing token utilization without padding.

Advantages

  1. Reduced Padding Waste: By grouping samples with similar lengths, padding is minimized. With sequence packing enabled, padding is completely eliminated

  2. Better GPU Utilization: More tokens per batch means better GPU utilization

  3. Flexible Batch Sizes: Adapts to varying sample lengths automatically

  4. Distributed Training Friendly: Balances load across ranks while maintaining data distribution

  5. Sequence Packing Support: When enabled, eliminates padding entirely by packing multiple sequences into a single tensor, maximizing token efficiency

Limitations

  1. Approximate Length: len(dataset) returns an approximate value based on sample count, not actual batch count

  2. Memory Overhead: Maintaining a sample pool requires additional memory

  3. Step-Based Training: When enable_dp_load_balancing = true, training is step-based, not epoch-based. User-provided epoch configuration is ignored

  4. Epoch Management: Epoch is managed internally for data ordering purposes only. It does not control training duration

  5. Deterministic Resume: Resume is deterministic based on train_step, but exact batch composition may vary slightly due to dynamic batching

Best Practices

  1. Pool Size: Choose a pool size that balances memory usage and batch quality. Larger pools (64-128) work well for most cases

  2. Max Tokens: Set max_tokens_for_batch based on your GPU memory and model size. Common values: 32768, 65536, 131072

  3. Batching Strategy: Use “prefer_closest” for better efficiency, “prefer_first” if speed is more important (only applies when sequence packing is disabled)

  4. Sequence Packing: Enable sequence packing if your model supports it for better token utilization. Check model compatibility before enabling

  5. Gradient Accumulation: Use load_balanced_batches_per_optimizer_step to control effective batch size

  6. Seed: Set dataloader_seed for reproducibility

  7. Step-Based Training: Remember that when enable_dp_load_balancing = true, training is step-based. Set max_num_steps (in [train]) to control training duration, not epoch

  8. Infinite Loop: Keep infinite_loop = true (default) for step-based training. Data will automatically restart when exhausted, ensuring training can reach max_num_steps

When to Use Sequence Packing

Sequence packing is recommended when: - Your model supports variable-length sequences within a batch - You want to maximize token utilization (reduce padding waste) - You have sequences with highly variable lengths - Your model architecture can handle packed sequences efficiently

Sequence packing is NOT recommended when: - Your model doesn’t support sequence packing (check compatibility) - You need fixed-size batches for certain operations - The overhead of handling variable-length sequences outweighs the benefits

Troubleshooting

Issue: Training seems slower than expected
  • Check if load_balanced_pool_size is too small

  • Verify max_tokens_for_batch is appropriate for your hardware

  • Consider using “prefer_first” strategy for faster batch formation

Issue: Out of memory errors
  • Reduce load_balanced_max_tokens_for_batch

  • Reduce load_balanced_pool_size

  • Reduce load_balanced_batches_per_optimizer_step

Issue: Resume doesn’t work correctly
  • Ensure max_num_steps (in [train]) matches the original training configuration

  • Check that checkpoint contains correct train_step information

  • Verify dataloader_seed is the same as original training

  • Note: User-provided epoch parameter is ignored when enable_dp_load_balancing = true. Epoch is managed internally

Issue: Data keeps looping indefinitely
  • This is expected behavior when infinite_loop = true (default)

  • Training stops based on max_num_steps, not data exhaustion

  • Set infinite_loop = false if you want data to stop when exhausted (not recommended for step-based training)

Issue: Sequence packing not working
  • Verify that sequence_packing = true is set in the configuration

  • Check if your model supports sequence packing (the system will warn if not)

  • Ensure enable_dp_load_balancing = true is also set

  • Check model compatibility: not all models support sequence packing