, optimizer, model_config, optimizer_config)[source]

A helper function to integrate the model and optimizer with ZeRO optimizer and off-loading

  • model (torch.nn.Module) – Your model object

  • optimizer_config (dict) – Your optimizer object


(model, optimizer)

Return type


class, shard_strategy, process_group=None, reduce_scatter_process_group=None, reduce_scatter_bucket_size_mb=25, fp32_reduce_scatter=False, tensor_placement_policy='cuda', gradient_predivide_factor=1.0, reuse_fp16_shard=False, *args, **kwargs)[source]

A wrapper for the PyTorch module shards the model parameters among multiple GPU memory. Only 1/#nproc of parameters, gradients are stored in local CUDA memory, so forward and backward passes can be executed with limited CUDA memory budget.


You must use ShardedModelV2 with ShardedOptimizerV2.


Make sure you don’t use gradient accumulation and your optimizer can work with fp16 gradient and fp32 parameter, if you enable reuse_fp16_shard.

  • module (nn.Module) – A sharded module, which must be initialized by ZeroInitContext.

  • shard_strategy (BaseShardStrategy) – A shard strategy to manage shard behavior.

  • process_group (Optional[ProcessGroup], optional) – Data parallel process group. Defaults to None.

  • reduce_scatter_process_group (Optional[ProcessGroup], optional) – Reduce-scatter process group. Generally, it should be None, and it’s the same as process_group. Defaults to None.

  • reduce_scatter_bucket_size_mb (int, optional) – Reduce-scatter bucket size in MB. Defaults to 25.

  • fp32_reduce_scatter (bool, optional) – If set to True, gradients are forced to FP32 before reduce-scatter. Defaults to False.

  • tensor_placement_policy (str) – Which device to place held tensors. It can be ‘cpu’, ‘cuda’ and ‘auto’. If it’s ‘cpu’, parameters, gradients and optimizer states will be offloaded to CPU, which means min CUDA memory will be used. If it’s ‘cuda’, they won’t be offloaded, which means max CUDA memory will be used. If it’s ‘auto’, they are moving dynamically based on CPU and CUDA memory usage. It will utilize heterogeneous memory space evenly and well. Note that ‘auto’ policy can only work well when no other processes use CUDA during your training. Defaults to ‘cuda’.

  • gradient_predivide_factor (Optional[float], optional) – Gradient is divived by this value before reduce-scatter. Defaults to 1.0.

  • reuse_fp16_shard (bool, optional) – Whether to reuse fp16 shard for param and grad. Enabling this can reduce GPU memory usage, but you have to make sure you disable it when using gradient accumulation. In this mode, grad will be fp16. Make sure your optimizer supports mixed precision (fp32 param and fp16 grad). We find that PyTorch’s optimizers don’t support mixed precision, so we recommend you enable this only when using our CPUAdam with CPU offload. Defaults to False.


dummy memory tracer collected infomation to a file. try:

# forward: model(inputs) # backward: optimizer.backward()

except Exception as e:

model.dump_memory_stats() exit(0)

class, optimizer, gpu_margin_mem_ratio=0.0, initial_scale=4294967296, min_scale=1, growth_factor=2, backoff_factor=0.5, growth_interval=1000, hysteresis=2, max_scale=4294967296, dp_process_group=None, mp_process_group=None, verbose=False)[source]

A wrapper for optimizer. ShardedOptimizerV2 and ShardedModelV2 implement Zero Redundancy Optimizer (ZeRO).

By default the ZeRO optimizer stage 3 offload Optimizer States on CPU.

We apply the Device-aware Operator Placement technique for OS placement from the following paper.

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management

GPU margin space is the remaining space after removing peak non-model data from the overall GPU memory, which is detected by a runtime memory tracer.

We place as many OS chunks in the margin space as possible.

The size of margin space can be controlled by gpu_margin_mem_ratio. If it is set as 0.0, it is the same as classical ZeRO optimizer.


You must use ShardedOptimizerV2 with ShardedModelV2.


Make sure you set tensor_placement_policy in ShardedModelV2 to “auto”, if you set gpu_margin_mem_ratio > 0.

  • sharded_model (ShardedModelV2) – A sharded model initialized by class ShardedModelV2. The optimizer will use the shard strategy provided by sharded model to shard param fp32 tensors.

  • optimizer (Optimizer) – An Optimizer instance.

  • gpu_margin_mem_ratio (float, optional) – The ratio of GPU remaining memory (after the first forward-backward) which will be used when using hybrid CPU optimizer. This argument is meaningless when tensor_placement_policy of ShardedModelV2 is not “auto”. Defaults to 0.0.

  • initial_scale (float, optional) – Initial scale used by DynamicGradScaler. Defaults to 2**32.

  • min_scale (float, optional) – Min scale used by DynamicGradScaler. Defaults to 1.

  • growth_factor (float, optional) – growth_factor used by DynamicGradScaler. Defaults to 2.

  • backoff_factor (float, optional) – backoff_factor used by DynamicGradScaler. Defaults to 0.5.

  • growth_interval (float, optional) – growth_interval used by DynamicGradScaler. Defaults to 1000.

  • hysteresis (float, optional) – hysteresis used by DynamicGradScaler. Defaults to 2.

  • max_scale (int, optional) – max_scale used by DynamicGradScaler. Defaults to 2**32.

  • dp_process_group (Optional[ProcessGroup], optional) – data paralle process group. Defaults to None.

  • mp_process_group (Optional[ProcessGroup], optional) – model paralle process group. Defaults to None.


Get the memory usage of the optimizer. Including master_params (param fp32), momentum (self.state[p]['exp_avg']) variance (self.state[p]['exp_avg_sq'])


cuda/cpu memory usage in Byte.

Return type

Tuple[int, int]