colossalai.nn.optimizer.hybrid_adam

class colossalai.nn.optimizer.hybrid_adam.HybridAdam(model_params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, adamw_mode=True, nvme_offload_fraction=0.0, nvme_offload_dir=None, **defaults)[source]

Implements Adam algorithm.

Supports parameters updating on both GPU and CPU, depanding on the device of paramters. But the parameters and gradients should on the same device:

  • Parameters on CPU and gradients on CPU is allowed.

  • Parameters on GPU and gradients on GPU is allowed.

  • Parameters on GPU and gradients on CPU is not allowed.

Requires ColossalAI to be installed via pip install .

This version of Hybrid Adam is an hybrid of CPUAdam and FusedAdam.

  • For parameters updating on CPU, it uses CPUAdam.

  • For parameters updating on GPU, it uses FusedAdam.

  • Hybird precision calculation of fp16 and fp32 is supported, eg fp32 parameters and fp16 gradients.

colossalai.nn.optimizer.HybridAdam may be used as a drop-in replacement for torch.optim.AdamW, or torch.optim.Adam with adamw_mode=False

Adam was been proposed in Adam: A Method for Stochastic Optimization.

Parameters
  • model_params (iterable) – iterable of parameters of dicts defining parameter groups.

  • lr (float, optional) – learning rate. (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED yet in CPUAdam!

  • adamw_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)

  • simd_log (boolean, optional) – whether to show if you are using SIMD to accelerate. (default: False)

  • nvme_offload_fraction (float, optional) – Fraction of optimizer states to be offloaded to NVMe. Defaults to 0.0.

  • nvme_offload_dir (Optional[str], optional) – Directory to save NVMe offload files. If it’s None, a random temporary directory will be used. Defaults to None.