mpcompress.layers
Attention
Attention(dim: int, num_heads: int = 8, qkv_bias: bool = False, qk_norm: bool = False, attn_drop: float = 0.0, proj_drop: float = 0.0, norm_layer: Module = nn.LayerNorm, **kwargs)
Multi-head self-attention mechanism.
This module implements scaled dot-product attention with optional QK normalization and fused attention support. It computes attention over the input sequence using query, key, and value projections.
The attention mechanism follows: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V where d_k is the head dimension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
|
Embedding dimension of input tokens. Must be divisible by num_heads. |
required |
num_heads
|
|
Number of attention heads. Defaults to 8. |
8
|
qkv_bias
|
|
Whether to use bias in QKV projection. Defaults to False. |
False
|
qk_norm
|
|
Whether to apply normalization to Q and K. Defaults to False. |
False
|
attn_drop
|
|
Dropout probability for attention weights. Defaults to 0.0. |
0.0
|
proj_drop
|
|
Dropout probability for output projection. Defaults to 0.0. |
0.0
|
norm_layer
|
|
Normalization layer for QK normalization. Defaults to nn.LayerNorm. |
|
**kwargs
|
|
Additional keyword arguments (unused). |
{}
|
forward
forward(x: Tensor, attn_mask: Tensor = None) -> torch.Tensor
Forward pass through attention layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, N, C] where B is batch size, N is sequence length, and C is embedding dimension. |
required |
attn_mask
|
|
Attention mask tensor of shape [B, N, N] or broadcastable shape. Values are added to attention scores before softmax. Defaults to None. |
None
|
Returns:
| Type | Description |
|---|---|
|
torch.Tensor: Output tensor of the same shape as input [B, N, C]. |
Block
Block(dim: int, num_heads: int, mlp_ratio: float = 4.0, qkv_bias: bool = False, qk_norm: bool = False, proj_drop: float = 0.0, attn_drop: float = 0.0, init_values: Optional[float] = None, drop_path: float = 0.0, act_layer: Module = nn.GELU, norm_layer: Module = nn.LayerNorm, mlp_layer: Module = Mlp, attn_layer: Module = Attention)
Vision Transformer block with attention and MLP layers.
This block implements a standard Transformer block for Vision Transformers, consisting of:
- Multi-head self-attention with optional layer scaling
- Feed-forward MLP with optional layer scaling
- Residual connections with optional drop path regularization
- Layer normalization before each sub-layer
The block follows the architecture: x = x + DropPath(LayerScale(Attn(Norm(x)))) followed by x = x + DropPath(LayerScale(MLP(Norm(x)))).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
|
Embedding dimension of the input tokens. |
required |
num_heads
|
|
Number of attention heads. |
required |
mlp_ratio
|
|
Ratio of MLP hidden dimension to embedding dimension. Defaults to 4.0. |
4.0
|
qkv_bias
|
|
Whether to use bias in QKV projection. Defaults to False. |
False
|
qk_norm
|
|
Whether to apply normalization to Q and K. Defaults to False. |
False
|
proj_drop
|
|
Dropout probability for projection layers. Defaults to 0.0. |
0.0
|
attn_drop
|
|
Dropout probability for attention weights. Defaults to 0.0. |
0.0
|
init_values
|
|
Initial value for layer scaling. If None, layer scaling is disabled. Defaults to None. |
None
|
drop_path
|
|
Drop path probability for stochastic depth. Defaults to 0.0. |
0.0
|
act_layer
|
|
Activation function for MLP. Defaults to nn.GELU. |
|
norm_layer
|
|
Normalization layer to use. Defaults to nn.LayerNorm. |
|
mlp_layer
|
|
MLP layer class to use. Defaults to Mlp. |
|
attn_layer
|
|
Attention layer class to use. Defaults to Attention. |
|
forward
forward(x: Tensor, attn_mask: Tensor = None) -> torch.Tensor
Forward pass through the Transformer block.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, N, C] where B is batch size, N is sequence length, and C is embedding dimension. |
required |
attn_mask
|
|
Attention mask tensor. If provided, will be applied to the attention computation. Defaults to None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Output tensor of the same shape as input [B, N, C]. |
DepthConvBlock
DepthConvBlock(in_ch, out_ch, shortcut=False, force_adaptor=False)
Depthwise convolution block with feed-forward network.
This block implements a residual block using depthwise separable convolutions and a feed-forward network. It consists of: - Optional channel adaptor (1x1 conv) for dimension matching - Depthwise convolution path with residual connection - Feed-forward network with residual connection - Optional shortcut connection from input - Optional quantization step scaling - Optional tensor concatenation
Supports both PyTorch and CUDA implementations for efficient inference.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_ch
|
|
Number of input channels. |
required |
out_ch
|
|
Number of output channels. |
required |
shortcut
|
|
Whether to add shortcut connection from input to final output. Defaults to False. |
False
|
force_adaptor
|
|
Whether to force use of channel adaptor even when in_ch == out_ch. Defaults to False. |
False
|
forward
forward(x, quant_step=None, to_cat=None, cat_at_front=True)
Forward pass with optional quantization and concatenation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, C, H, W]. |
required |
quant_step
|
|
Quantization step for scaling output. If provided, output is multiplied by quant_step. Defaults to None. |
None
|
to_cat
|
|
Tensor to concatenate with output. Defaults to None. |
None
|
cat_at_front
|
|
If True, concatenate to_cat before output. If False, concatenate after output. Defaults to True. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Processed tensor of shape [B, out_ch, H, W], or concatenated tensor if to_cat is provided. |
forward_cuda
forward_cuda(x, quant_step=None, to_cat=None, cat_at_front=True)
CUDA-optimized implementation of forward pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, C, H, W]. |
required |
quant_step
|
|
Quantization step for scaling. Defaults to None. |
None
|
to_cat
|
|
Tensor to concatenate. Defaults to None. |
None
|
cat_at_front
|
|
Concatenation order. Defaults to True. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Processed or concatenated tensor. |
forward_torch
forward_torch(x, quant_step=None, to_cat=None, cat_at_front=True)
PyTorch implementation of forward pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, C, H, W]. |
required |
quant_step
|
|
Quantization step for scaling. Defaults to None. |
None
|
to_cat
|
|
Tensor to concatenate. Defaults to None. |
None
|
cat_at_front
|
|
Concatenation order. Defaults to True. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Processed or concatenated tensor. |
LayerScale
LayerScale(dim: int, init_values: float = 1e-05, inplace: bool = False)
Layer scaling module for stabilizing deep networks.
This module scales the input by a learnable parameter gamma. It is commonly used in Vision Transformers to stabilize training of very deep networks. The scaling factor is initialized to a small value (e.g., 1e-5) and learned during training.
Reference: "Going deeper with Image Transformers" (Touvron et al., 2021)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
|
Dimension of the input tensor (last dimension). |
required |
init_values
|
|
Initial value for the scaling parameter gamma. Defaults to 1e-5. |
1e-05
|
inplace
|
|
Whether to perform in-place multiplication. Defaults to False. |
False
|
forward
forward(x: Tensor) -> torch.Tensor
Scale input tensor by learnable parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [..., dim] where dim matches the dimension used in initialization. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Scaled tensor of the same shape as input. |
ResidualBlockUpsample
ResidualBlockUpsample(in_ch, out_ch)
Residual block with 2x upsampling.
This block performs 2x spatial upsampling followed by depthwise convolution processing. It combines a sub-pixel convolution for upsampling with a DepthConvBlock for feature refinement.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_ch
|
|
Number of input channels. |
required |
out_ch
|
|
Number of output channels. |
required |
forward
forward(x)
Forward pass with 2x upsampling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, in_ch, H, W]. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Upsampled and processed tensor of shape [B, out_ch, 2H, 2W]. |
ResidualBlockWithStride2
ResidualBlockWithStride2(in_ch, out_ch)
Residual block with 2x downsampling.
This block performs 2x spatial downsampling followed by depthwise convolution processing. It combines a strided convolution for downsampling with a DepthConvBlock for feature refinement.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_ch
|
|
Number of input channels. |
required |
out_ch
|
|
Number of output channels. |
required |
forward
forward(x)
Forward pass with 2x downsampling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, in_ch, H, W]. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Downsampled and processed tensor of shape [B, out_ch, H//2, W//2]. |
RoPEAttention
RoPEAttention(*args, num_prefix_tokens=1, num_latent_tokens=32, num_image_tokens=256, rope_theta=10.0, rope_mixed=True, **kwargs)
Multi-head attention with rotary position embeddings (RoPE).
This attention mechanism extends standard multi-head attention by applying rotary position embeddings to query and key vectors. It supports two modes:
- Mixed mode: Learnable 2D frequencies for image tokens and 1D frequencies for latent tokens
- Axial mode: Fixed 2D axial frequencies for image tokens
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
|
Positional arguments passed to parent Attention class. |
()
|
num_prefix_tokens
|
|
Number of prefix tokens (e.g., CLS token) that do not receive positional embeddings. Defaults to 1. |
1
|
num_latent_tokens
|
|
Number of latent tokens that receive 1D positional embeddings. Defaults to 32. |
32
|
num_image_tokens
|
|
Number of image tokens that receive 2D positional embeddings. Defaults to 256. |
256
|
rope_theta
|
|
Base frequency parameter for RoPE. Higher values result in lower frequencies. Defaults to 10.0. |
10.0
|
rope_mixed
|
|
If True, use learnable mixed 2D frequencies. If False, use fixed axial 2D frequencies. Defaults to True. |
True
|
**kwargs
|
|
Additional keyword arguments passed to parent Attention class. |
{}
|
forward
forward(x, attn_mask=None)
Forward pass with rotary position embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, N, C] where B is batch size, N is sequence length (1 + num_image_tokens + num_latent_tokens), and C is embedding dimension. |
required |
attn_mask
|
|
Attention mask tensor. Defaults to None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Output tensor of the same shape as input [B, N, C]. |
SubpelConv2x
SubpelConv2x(in_ch, out_ch, kernel_size, padding=0)
Sub-pixel convolution layer for 2x upsampling.
This layer performs 2x upsampling using sub-pixel convolution (also known as pixel shuffle). It uses a convolution followed by PixelShuffle to achieve efficient upsampling. Supports both PyTorch and CUDA implementations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_ch
|
|
Number of input channels. |
required |
out_ch
|
|
Number of output channels. |
required |
kernel_size
|
|
Size of the convolution kernel. |
required |
padding
|
|
Padding size for the convolution. Defaults to 0. |
0
|
forward
forward(x, to_cat=None, cat_at_front=True)
Forward pass with optional tensor concatenation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, C, H, W]. |
required |
to_cat
|
|
Tensor to concatenate with output. If None, only upsampled output is returned. Defaults to None. |
None
|
cat_at_front
|
|
If True, concatenate to_cat before output. If False, concatenate after output. Defaults to True. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Upsampled tensor of shape [B, out_ch, 2H, 2W], or concatenated tensor if to_cat is provided. |
forward_cuda
forward_cuda(x, to_cat=None, cat_at_front=True)
CUDA-optimized implementation of forward pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, C, H, W]. |
required |
to_cat
|
|
Tensor to concatenate with output. Defaults to None. |
None
|
cat_at_front
|
|
Concatenation order. Defaults to True. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Upsampled or concatenated tensor. |
forward_torch
forward_torch(x, to_cat=None, cat_at_front=True)
PyTorch implementation of forward pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input tensor of shape [B, C, H, W]. |
required |
to_cat
|
|
Tensor to concatenate with output. Defaults to None. |
None
|
cat_at_front
|
|
Concatenation order. Defaults to True. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
|
Upsampled or concatenated tensor. |