Skip to content

mpcompress.layers

Attention

Attention(dim: int, num_heads: int = 8, qkv_bias: bool = False, qk_norm: bool = False, attn_drop: float = 0.0, proj_drop: float = 0.0, norm_layer: Module = nn.LayerNorm, **kwargs)

Multi-head self-attention mechanism.

This module implements scaled dot-product attention with optional QK normalization and fused attention support. It computes attention over the input sequence using query, key, and value projections.

The attention mechanism follows: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V where d_k is the head dimension.

Parameters:

Name Type Description Default
dim int

Embedding dimension of input tokens. Must be divisible by num_heads.

required
num_heads int

Number of attention heads. Defaults to 8.

8
qkv_bias bool

Whether to use bias in QKV projection. Defaults to False.

False
qk_norm bool

Whether to apply normalization to Q and K. Defaults to False.

False
attn_drop float

Dropout probability for attention weights. Defaults to 0.0.

0.0
proj_drop float

Dropout probability for output projection. Defaults to 0.0.

0.0
norm_layer Module

Normalization layer for QK normalization. Defaults to nn.LayerNorm.

LayerNorm
**kwargs dict

Additional keyword arguments (unused).

{}

forward

forward(x: Tensor, attn_mask: Tensor = None) -> torch.Tensor

Forward pass through attention layer.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, N, C] where B is batch size, N is sequence length, and C is embedding dimension.

required
attn_mask Tensor

Attention mask tensor of shape [B, N, N] or broadcastable shape. Values are added to attention scores before softmax. Defaults to None.

None

Returns:

Type Description
Tensor

torch.Tensor: Output tensor of the same shape as input [B, N, C].

Block

Block(dim: int, num_heads: int, mlp_ratio: float = 4.0, qkv_bias: bool = False, qk_norm: bool = False, proj_drop: float = 0.0, attn_drop: float = 0.0, init_values: Optional[float] = None, drop_path: float = 0.0, act_layer: Module = nn.GELU, norm_layer: Module = nn.LayerNorm, mlp_layer: Module = Mlp, attn_layer: Module = Attention)

Vision Transformer block with attention and MLP layers.

This block implements a standard Transformer block for Vision Transformers, consisting of:

  • Multi-head self-attention with optional layer scaling
  • Feed-forward MLP with optional layer scaling
  • Residual connections with optional drop path regularization
  • Layer normalization before each sub-layer

The block follows the architecture: x = x + DropPath(LayerScale(Attn(Norm(x)))) followed by x = x + DropPath(LayerScale(MLP(Norm(x)))).

Parameters:

Name Type Description Default
dim int

Embedding dimension of the input tokens.

required
num_heads int

Number of attention heads.

required
mlp_ratio float

Ratio of MLP hidden dimension to embedding dimension. Defaults to 4.0.

4.0
qkv_bias bool

Whether to use bias in QKV projection. Defaults to False.

False
qk_norm bool

Whether to apply normalization to Q and K. Defaults to False.

False
proj_drop float

Dropout probability for projection layers. Defaults to 0.0.

0.0
attn_drop float

Dropout probability for attention weights. Defaults to 0.0.

0.0
init_values Optional[float]

Initial value for layer scaling. If None, layer scaling is disabled. Defaults to None.

None
drop_path float

Drop path probability for stochastic depth. Defaults to 0.0.

0.0
act_layer Module

Activation function for MLP. Defaults to nn.GELU.

GELU
norm_layer Module

Normalization layer to use. Defaults to nn.LayerNorm.

LayerNorm
mlp_layer Module

MLP layer class to use. Defaults to Mlp.

Mlp
attn_layer Module

Attention layer class to use. Defaults to Attention.

Attention

forward

forward(x: Tensor, attn_mask: Tensor = None) -> torch.Tensor

Forward pass through the Transformer block.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, N, C] where B is batch size, N is sequence length, and C is embedding dimension.

required
attn_mask Tensor

Attention mask tensor. If provided, will be applied to the attention computation. Defaults to None.

None

Returns:

Name Type Description
out Tensor

Output tensor of the same shape as input [B, N, C].

DepthConvBlock

DepthConvBlock(in_ch, out_ch, shortcut=False, force_adaptor=False)

Depthwise convolution block with feed-forward network.

This block implements a residual block using depthwise separable convolutions and a feed-forward network. It consists of: - Optional channel adaptor (1x1 conv) for dimension matching - Depthwise convolution path with residual connection - Feed-forward network with residual connection - Optional shortcut connection from input - Optional quantization step scaling - Optional tensor concatenation

Supports both PyTorch and CUDA implementations for efficient inference.

Parameters:

Name Type Description Default
in_ch int

Number of input channels.

required
out_ch int

Number of output channels.

required
shortcut bool

Whether to add shortcut connection from input to final output. Defaults to False.

False
force_adaptor bool

Whether to force use of channel adaptor even when in_ch == out_ch. Defaults to False.

False

forward

forward(x, quant_step=None, to_cat=None, cat_at_front=True)

Forward pass with optional quantization and concatenation.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, C, H, W].

required
quant_step float

Quantization step for scaling output. If provided, output is multiplied by quant_step. Defaults to None.

None
to_cat Tensor

Tensor to concatenate with output. Defaults to None.

None
cat_at_front bool

If True, concatenate to_cat before output. If False, concatenate after output. Defaults to True.

True

Returns:

Name Type Description
out Tensor

Processed tensor of shape [B, out_ch, H, W], or concatenated tensor if to_cat is provided.

forward_cuda

forward_cuda(x, quant_step=None, to_cat=None, cat_at_front=True)

CUDA-optimized implementation of forward pass.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, C, H, W].

required
quant_step float

Quantization step for scaling. Defaults to None.

None
to_cat Tensor

Tensor to concatenate. Defaults to None.

None
cat_at_front bool

Concatenation order. Defaults to True.

True

Returns:

Name Type Description
out Tensor

Processed or concatenated tensor.

forward_torch

forward_torch(x, quant_step=None, to_cat=None, cat_at_front=True)

PyTorch implementation of forward pass.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, C, H, W].

required
quant_step float

Quantization step for scaling. Defaults to None.

None
to_cat Tensor

Tensor to concatenate. Defaults to None.

None
cat_at_front bool

Concatenation order. Defaults to True.

True

Returns:

Name Type Description
out Tensor

Processed or concatenated tensor.

LayerScale

LayerScale(dim: int, init_values: float = 1e-05, inplace: bool = False)

Layer scaling module for stabilizing deep networks.

This module scales the input by a learnable parameter gamma. It is commonly used in Vision Transformers to stabilize training of very deep networks. The scaling factor is initialized to a small value (e.g., 1e-5) and learned during training.

Reference: "Going deeper with Image Transformers" (Touvron et al., 2021)

Parameters:

Name Type Description Default
dim int

Dimension of the input tensor (last dimension).

required
init_values float

Initial value for the scaling parameter gamma. Defaults to 1e-5.

1e-05
inplace bool

Whether to perform in-place multiplication. Defaults to False.

False

forward

forward(x: Tensor) -> torch.Tensor

Scale input tensor by learnable parameter.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [..., dim] where dim matches the dimension used in initialization.

required

Returns:

Name Type Description
out Tensor

Scaled tensor of the same shape as input.

ResidualBlockUpsample

ResidualBlockUpsample(in_ch, out_ch)

Residual block with 2x upsampling.

This block performs 2x spatial upsampling followed by depthwise convolution processing. It combines a sub-pixel convolution for upsampling with a DepthConvBlock for feature refinement.

Parameters:

Name Type Description Default
in_ch int

Number of input channels.

required
out_ch int

Number of output channels.

required

forward

forward(x)

Forward pass with 2x upsampling.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, in_ch, H, W].

required

Returns:

Name Type Description
out Tensor

Upsampled and processed tensor of shape [B, out_ch, 2H, 2W].

ResidualBlockWithStride2

ResidualBlockWithStride2(in_ch, out_ch)

Residual block with 2x downsampling.

This block performs 2x spatial downsampling followed by depthwise convolution processing. It combines a strided convolution for downsampling with a DepthConvBlock for feature refinement.

Parameters:

Name Type Description Default
in_ch int

Number of input channels.

required
out_ch int

Number of output channels.

required

forward

forward(x)

Forward pass with 2x downsampling.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, in_ch, H, W].

required

Returns:

Name Type Description
out Tensor

Downsampled and processed tensor of shape [B, out_ch, H//2, W//2].

RoPEAttention

RoPEAttention(*args, num_prefix_tokens=1, num_latent_tokens=32, num_image_tokens=256, rope_theta=10.0, rope_mixed=True, **kwargs)

Multi-head attention with rotary position embeddings (RoPE).

This attention mechanism extends standard multi-head attention by applying rotary position embeddings to query and key vectors. It supports two modes:

  • Mixed mode: Learnable 2D frequencies for image tokens and 1D frequencies for latent tokens
  • Axial mode: Fixed 2D axial frequencies for image tokens

Parameters:

Name Type Description Default
*args tuple

Positional arguments passed to parent Attention class.

()
num_prefix_tokens int

Number of prefix tokens (e.g., CLS token) that do not receive positional embeddings. Defaults to 1.

1
num_latent_tokens int

Number of latent tokens that receive 1D positional embeddings. Defaults to 32.

32
num_image_tokens int

Number of image tokens that receive 2D positional embeddings. Defaults to 256.

256
rope_theta float

Base frequency parameter for RoPE. Higher values result in lower frequencies. Defaults to 10.0.

10.0
rope_mixed bool

If True, use learnable mixed 2D frequencies. If False, use fixed axial 2D frequencies. Defaults to True.

True
**kwargs dict

Additional keyword arguments passed to parent Attention class.

{}

forward

forward(x, attn_mask=None)

Forward pass with rotary position embeddings.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, N, C] where B is batch size, N is sequence length (1 + num_image_tokens + num_latent_tokens), and C is embedding dimension.

required
attn_mask Tensor

Attention mask tensor. Defaults to None.

None

Returns:

Name Type Description
out Tensor

Output tensor of the same shape as input [B, N, C].

SubpelConv2x

SubpelConv2x(in_ch, out_ch, kernel_size, padding=0)

Sub-pixel convolution layer for 2x upsampling.

This layer performs 2x upsampling using sub-pixel convolution (also known as pixel shuffle). It uses a convolution followed by PixelShuffle to achieve efficient upsampling. Supports both PyTorch and CUDA implementations.

Parameters:

Name Type Description Default
in_ch int

Number of input channels.

required
out_ch int

Number of output channels.

required
kernel_size int

Size of the convolution kernel.

required
padding int

Padding size for the convolution. Defaults to 0.

0

forward

forward(x, to_cat=None, cat_at_front=True)

Forward pass with optional tensor concatenation.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, C, H, W].

required
to_cat Tensor

Tensor to concatenate with output. If None, only upsampled output is returned. Defaults to None.

None
cat_at_front bool

If True, concatenate to_cat before output. If False, concatenate after output. Defaults to True.

True

Returns:

Name Type Description
out Tensor

Upsampled tensor of shape [B, out_ch, 2H, 2W], or concatenated tensor if to_cat is provided.

forward_cuda

forward_cuda(x, to_cat=None, cat_at_front=True)

CUDA-optimized implementation of forward pass.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, C, H, W].

required
to_cat Tensor

Tensor to concatenate with output. Defaults to None.

None
cat_at_front bool

Concatenation order. Defaults to True.

True

Returns:

Name Type Description
out Tensor

Upsampled or concatenated tensor.

forward_torch

forward_torch(x, to_cat=None, cat_at_front=True)

PyTorch implementation of forward pass.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [B, C, H, W].

required
to_cat Tensor

Tensor to concatenate with output. Defaults to None.

None
cat_at_front bool

Concatenation order. Defaults to True.

True

Returns:

Name Type Description
out Tensor

Upsampled or concatenated tensor.