class colossalai.kernel.FusedScaleMaskSoftmax(input_in_fp16, input_in_bf16, attn_mask_type, scaled_masked_softmax_fusion, mask_func, softmax_in_fp32, scale)[source]

Fused operation: scaling + mask + softmax

  • input_in_fp16 – Flag to indicate if input in fp16 data format.

  • input_in_bf16 – Flag to indicate if input in bf16 data format.

  • attn_mask_type – Attention mask type (pad or causal)

  • scaled_masked_softmax_fusion – Flag to indicate user want to use softmax fusion

  • mask_func – Mask function to be applied.

  • softmax_in_fp32 – If True, softmax in performed at fp32 precision.

  • scale – Scaling factor used in input tensor scaling.

class colossalai.kernel.MultiHeadAttention(hidden_size, nhead, batch_size, max_seq_len, dropout=0.0, norm_first=False, fp16=True, pg=None)[source]

Initialize the MultiHeadAttention.

Static variable:

layer_id: The layer-index counter starting from 0 and incrementing by 1 every time a layer object is instantiated, e.g. if a model has 24 transformer layers, layer_id goes from 0 to 23.

  • hidden_size – Total dimension of hidden_size.

  • nhead – Number of parallel attention heads.

  • batch_size – Batch Size for one foward

  • max_seq_len – Max length of input sequence

  • dropout – Dropout probability

  • norm_first – perform LayerNorms before attention