colossalai.gemini

class colossalai.gemini.StatefulTensorMgr(tensor_placement_policy)[source]

Stateful Tensor Manager, inspired from PatrickStar

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management https://arxiv.org/abs/2108.05818

finish_iter()[source]

This function must be called when each iteration finishes

adjust_layout()[source]

Adjust the layout of statefuil tensor according to the information provided by mem_stats_collector, which should belongs to a Sharded Model.

class colossalai.gemini.GeminiManager(placement_policy, chunk_manager)[source]

Stateful Tensor Manager, inspired from PatrickStar

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management https://arxiv.org/abs/2108.05818

Parameters
  • placement_policy (str) – Which device to place held tensors. It can be ‘cpu’, ‘cuda’ and ‘auto’. If it’s ‘cpu’, parameters, gradients and optimizer states will be offloaded to CPU, which means min CUDA memory will be used. If it’s ‘cuda’, they won’t be offloaded, which means max CUDA memory will be used. If it’s ‘auto’, they are moving dynamically based on CPU and CUDA memory usage. It will utilize heterogeneous memory space evenly and well. Note that ‘auto’ policy can only work well when no other processes use CUDA during your training.

  • chunk_manager (ChunkManager) – A ChunkManager instance.

post_iter()[source]

This function must be called when each iteration finishes

adjust_layout(chunks, group_name)[source]

Adjust the layout of statefuil tensor according to the information provided by mem_stats_collector, which should belongs to a Sharded Model.

class colossalai.gemini.ChunkManager(chunk_size, process_group, enable_distributed_storage=False, init_device=None)[source]

A manager class to manipulate the tensors in chunks.

Parameters
  • chunk_size (int) – the size of a chunk.

  • process_group (ColoProcessGroup) – process group of the chunk.

  • enable_distributed_storage (bool) – optional, allow for distributed storage of a chunk. The default is false.

  • init_device (torch.device) – optional, the device on which the chunk is initialized. The default is None.

create_group(group_name, force_data_on_cuda=False)[source]

Create a chunk group.

Parameters
  • group_name (str) – group name

  • force_data_on_cuda (bool, optional) – If True, the data of chunks in this group is always on cuda.. Defaults to False.

append_tensor(tensor, group_name)[source]

Append a tensor to a chunk.

Parameters
  • tensor (torch.Tensor) – a tensor to append to the chunk.

  • group_name (str) – the name of the chunk group.

access_chunk(chunk)[source]

Synchronize the chunks via broadcast.

Parameters

chunk (Chunk) – the chunk to synchronize.

release_chunk(chunk)[source]

Release the memory space of a chunk.

Parameters

chunk (Chunk) – the chunk to release memory space

move_chunk(chunk, device, update_ptr=True)[source]

Move the chunk to the target device.

Parameters
  • chunk (Chunk) – the chunk to move to target device

  • device (torch.device) – target device

trans_tensor_state(tensor, state)[source]

Transit tensor state according to pre-defined state machine.

Parameters
  • tensor (torch.Tensor) – the tensor for state transititon

  • state (TensorState) – next tensor state for transtition

reduce_chunk(chunk)[source]

Reduce or all reduce the chunk. If enable_distributed_storage is true, all-reduce is used. Otherwise, this method uses reduce.

Parameters

chunk (Chunk) – the chunk for reduction.

copy_tensor_to_chunk_slice(tensor, data)[source]

Copy data to the chunk.

Parameters
  • tensor (torch.Tensor) – the tensor used to retrive meta information

  • data (torch.Tensor) – the tensor to be copied to the chunk

get_chunk(tensor)[source]

Return the chunk owning the tensor.

Parameters

tensor (torch.Tensor) – a torch tensor object

add_lazy_release_tensors(tensors)[source]

Add tensors to the buffer for lazy release.

Parameters

tensors (List[torch.Tensor]) – the tensors to be released lazily

exec_lazy_release()[source]

Execute release for tensors added to the lazy release buffer.

static get_chunk_util(chunk_size, params_numel)[source]

Calculate the utilization rate of a chunk.

Parameters
  • chunk_size (int) – the size of a chunk

  • params_numel (List[int]) – the list of integers representing the number of elements of parameters

static search_chunk_size(module, search_range, n_grids, min_chunk_size=None, filter_exlarge_params=True)[source]

Search for the chunk size for optimal chunk utilization.

Parameters
  • module (torch.nn.Module) – a torch module object

  • search_range (int) – the range of chunk size to search. The actual search range will be from max(min_chunk_size, max_param_size) to max(min_chunk_size, max_param_size) + search_range.

  • n_grids (int) – the number of intervals in the search range

  • min_chunk_size (int) – optional, the minimum size for a chunk. The default is None.

copy_chunk_group(dest_group_name, src_group_name)[source]

Copy chunk data from one group to another group.

Parameters
  • dest_group_name (str) – the destination group which receives the copied data

  • src_group_name (str) – the source group which provides the data to copy

get_chunks(tensors)[source]

Get all chunks owning the input tensors.

Parameters

tensors (Iterable[torch.Tensor]) – the tensors used to look for chunks

add_extern_static_tensor(tensor)[source]

Add extern static tensor to chunk manager. Those tensors won’t be managed by chunk manager, but we want to monitor memory usage of them. They are “static”, which means their shape, dtype, device never change. Thus, their memory usage never changes.

Parameters

tensor (torch.Tensor) – An extern static tensor. E.g. optimizer state.

class colossalai.gemini.TensorInfo(state: colossalai.gemini.chunk.TensorState, offset: int, end: int)[source]
class colossalai.gemini.Chunk(chunk_size, src_rank, process_group, dtype, init_device=None, force_data_on_cuda=False)[source]

A chunk is a contiguous memory space which contains multiple tensors.

Parameters
  • chunk_size (int) – the number of elements in a chunk

  • src_rank (int) – the process which owns the chunk

  • dtype (torch.dtype) – the data type of the chunk

  • init_device (torch.device) – optional, the device where the tensor is initialized. The default value is None, which is the current GPU.

  • force_data_on_cuda (bool) – optional, if True, chunk.data is always on cuda. Defaults to False.

append(tensor)[source]

Add a tensor to the chunk.

Parameters

tensor (torch.Tensor) – a tensor to be added to the chunk

release()[source]

Release the memory space on processes which do not own the chunk.

access()[source]

Broadcast the chunk to synchronize the tensors across data parallel processes.

move_device(device, update_ptr=True)[source]

Move the chunk to a target device.

Parameters

device (torch.device) – the target device for data movement.

reduce(is_all_reduce=False)[source]

Reduce or all-reduce the chunk.

Parameters

is_all_reduce (bool) – optional, whether to all-reduce the chunk. The default is false.

tensor_trans_state(tensor, tensor_state)[source]

Make a transition of the tensor into the next state.

Parameters
  • tensor (torch.Tensor) – a torch Tensor object.

  • tensor_state (TensorState) – the target state for transition.

copy_tensor_to_chunk_slice(tensor, data_slice)[source]

Copy data slice to the memory space indexed by the input tensor in the chunk.

Parameters
  • tensor (torch.Tensor) – the tensor used to retrive meta information

  • data_slice (torch.Tensor) – the tensor to be copied to the chunk

property can_release

Check whether the chunk can be released.

property can_move_device

Check whether the chunk can be moved across devices.

property can_reduce

Check whether the chunk can be reduced.

property is_empty

Check whether the chunk is empty.

property has_inf_or_nan

Check if the chunk has inf or nan values.

copy_(dest_chunk)[source]

Copy the data of this chunk to a destination chunk.

property device_type

Get the device type of the chunk.

class colossalai.gemini.TensorState(value)[source]

An enumeration.