colossalai.initialize

colossalai.initialize.get_default_parser()[source]

Reads user command line and uses an argument parser to parse the input arguments. Input arguments include configuration, host, port, world size, local rank, backend for torch.distributed.

Returns

Returns the parser with the default arguments, the user may add customized arguments into this parser.

Return type

Namespace

colossalai.initialize.launch(config, rank, world_size, host, port, backend='nccl', local_rank=None, seed=1024, verbose=True)[source]

This function first parses the configuration arguments, using parse_args() in case one of the input arguments are not given. Then initialize and set distributed environment by calling global_context’s functions.

Parameters
  • config (Union[str, dict, Config]) – Config file or config file path are both acceptable

  • rank (int) – Rank for the default process group

  • world_size (int) – World size of the default process group

  • host (str) – The master address for distributed training

  • port (str) – The master port for distributed training

  • backend (str, optional) – Backend for torch.distributed, defaults to nccl

  • local_rank (int, optional) – Rank for the process on the node and is used to set the default CUDA device, defaults to None. If local_rank = None, the default device ordinal will be calculated automatically.

  • seed (int, optional) – Specified random seed for every process. Defaults to 1024.

  • verbose (bool, optional) – Whether to print logs. Defaults to True.

Raises

Exception – Raise exception when config type is wrong

colossalai.initialize.launch_from_slurm(config, host, port, backend='nccl', seed=1024, verbose=True)[source]

A wrapper for colossalai.launch for SLURM launcher by reading rank and world size from the environment variables set by SLURM

Parameters
  • config (Union[str, dict, Config]) – Config file or config file path are both acceptable

  • host (str) – The master address for distributed training

  • port (str) – The master port for distributed training

  • backend (str, optional) – Backend for torch.distributed, defaults to nccl

  • seed (int, optional) – Specified random seed for every process. Defaults to 1024.

  • verbose (bool, optional) – Whether to print logs. Defaults to True.

colossalai.initialize.launch_from_openmpi(config, host, port, backend='nccl', seed=1024, verbose=True)[source]

A wrapper for colossalai.launch for OpenMPI launcher by reading rank and world size from the environment variables set by OpenMPI

Parameters
  • config (Union[str, dict, Config]) – Config file or config file path are both acceptable

  • host (str) – The master address for distributed training

  • port (str) – The master port for distributed training

  • backend (str, optional) – Backend for torch.distributed, defaults to nccl

  • seed (int, optional) – Specified random seed for every process. Defaults to 1024.

  • verbose (bool, optional) – Whether to print logs. Defaults to True.

colossalai.initialize.launch_from_torch(config, backend='nccl', seed=1024, verbose=True)[source]

A wrapper for colossalai.launch for torchrun or torch.distributed.launch by reading rank and world size from the environment variables set by PyTorch

Parameters
  • config (Union[str, dict, Config]) – Config file or config file path are both acceptable

  • backend (str, optional) – Backend for torch.distributed, defaults to nccl

  • seed (int, optional) – Specified random seed for every process. Defaults to 1024.

  • verbose (bool, optional) – Whether to print logs. Defaults to True.

colossalai.initialize.initialize(model, optimizer, criterion=None, train_dataloader=None, test_dataloader=None, lr_scheduler=None, ophooks=None, verbose=True)[source]

Core function to wrap the essential training components with our functionality based on the config which is loaded into gpc.config.

Parameters
  • model (torch.nn.Module or Callbale) – Your model instance or a function to build the model.

  • optimizer (torch.optim.optimizer.Optimizer or Type[torch.optim.optimizer]) – Your optimizer instance.

  • criterion (torch.nn.modules.loss._Loss, optional) – Your criterion instance.

  • train_dataloader (torch.utils.data.DataLoader, optional) – Dataloader for training.

  • test_dataloader (torch.utils.data.DataLoader, optional) – Dataloader for testing.

  • lr_scheduler (torch.nn.lr_scheduler._LRScheduler, optional) – Your lr scheduler instance, optional.

  • verbose (bool, optional) – Whether to print logs.

Returns

A tuple of (engine, train_dataloader, test_dataloader, lr_scheduler) where only engine could not be None.

Return type

Tuple (engine, train_dataloader, test_dataloader, lr_scheduler)