colossalai.kernel
- class colossalai.kernel.FusedScaleMaskSoftmax(input_in_fp16, input_in_bf16, attn_mask_type, scaled_masked_softmax_fusion, mask_func, softmax_in_fp32, scale)[source]
Fused operation: scaling + mask + softmax
- Parameters
input_in_fp16 – Flag to indicate if input in fp16 data format.
input_in_bf16 – Flag to indicate if input in bf16 data format.
attn_mask_type – Attention mask type (pad or causal)
scaled_masked_softmax_fusion – Flag to indicate user want to use softmax fusion
mask_func – Mask function to be applied.
softmax_in_fp32 – If True, softmax in performed at fp32 precision.
scale – Scaling factor used in input tensor scaling.
- class colossalai.kernel.MultiHeadAttention(hidden_size, nhead, batch_size, max_seq_len, dropout=0.0, norm_first=False, fp16=True, pg=None)[source]
Initialize the MultiHeadAttention.
Static variable:
layer_id: The layer-index counter starting from 0 and incrementing by 1 every time a layer object is instantiated, e.g. if a model has 24 transformer layers, layer_id goes from 0 to 23.
- Parameters
hidden_size – Total dimension of hidden_size.
nhead – Number of parallel attention heads.
batch_size – Batch Size for one foward
max_seq_len – Max length of input sequence
dropout – Dropout probability
norm_first – perform LayerNorms before attention