mpcompress.layers

Attention

Attention(dim: int, num_heads: int = 8, qkv_bias: bool = False, qk_norm: bool = False, attn_drop: float = 0.0, proj_drop: float = 0.0, norm_layer: Module = nn.LayerNorm, **kwargs)

Multi-head self-attention mechanism.

This module implements scaled dot-product attention with optional QK normalization and fused attention support. It computes attention over the input sequence using query, key, and value projections.

The attention mechanism follows: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V where d_k is the head dimension.

Parameters:

Name	Type	Description	Default
`dim`	`int`	Embedding dimension of input tokens. Must be divisible by num_heads.	required
`num_heads`	`int`	Number of attention heads. Defaults to 8.	`8`
`qkv_bias`	`bool`	Whether to use bias in QKV projection. Defaults to False.	`False`
`qk_norm`	`bool`	Whether to apply normalization to Q and K. Defaults to False.	`False`
`attn_drop`	`float`	Dropout probability for attention weights. Defaults to 0.0.	`0.0`
`proj_drop`	`float`	Dropout probability for output projection. Defaults to 0.0.	`0.0`
`norm_layer`	`Module`	Normalization layer for QK normalization. Defaults to nn.LayerNorm.	`LayerNorm`
`**kwargs`	`dict`	Additional keyword arguments (unused).	`{}`

forward

forward(x: Tensor, attn_mask: Tensor = None) -> torch.Tensor

Forward pass through attention layer.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, N, C] where B is batch size, N is sequence length, and C is embedding dimension.	required
`attn_mask`	`Tensor`	Attention mask tensor of shape [B, N, N] or broadcastable shape. Values are added to attention scores before softmax. Defaults to None.	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: Output tensor of the same shape as input [B, N, C].

Block

Block(dim: int, num_heads: int, mlp_ratio: float = 4.0, qkv_bias: bool = False, qk_norm: bool = False, proj_drop: float = 0.0, attn_drop: float = 0.0, init_values: Optional[float] = None, drop_path: float = 0.0, act_layer: Module = nn.GELU, norm_layer: Module = nn.LayerNorm, mlp_layer: Module = Mlp, attn_layer: Module = Attention)

Vision Transformer block with attention and MLP layers.

This block implements a standard Transformer block for Vision Transformers, consisting of:

Multi-head self-attention with optional layer scaling
Feed-forward MLP with optional layer scaling
Residual connections with optional drop path regularization
Layer normalization before each sub-layer

The block follows the architecture: x = x + DropPath(LayerScale(Attn(Norm(x)))) followed by x = x + DropPath(LayerScale(MLP(Norm(x)))).

Parameters:

Name	Type	Description	Default
`dim`	`int`	Embedding dimension of the input tokens.	required
`num_heads`	`int`	Number of attention heads.	required
`mlp_ratio`	`float`	Ratio of MLP hidden dimension to embedding dimension. Defaults to 4.0.	`4.0`
`qkv_bias`	`bool`	Whether to use bias in QKV projection. Defaults to False.	`False`
`qk_norm`	`bool`	Whether to apply normalization to Q and K. Defaults to False.	`False`
`proj_drop`	`float`	Dropout probability for projection layers. Defaults to 0.0.	`0.0`
`attn_drop`	`float`	Dropout probability for attention weights. Defaults to 0.0.	`0.0`
`init_values`	`Optional[float]`	Initial value for layer scaling. If None, layer scaling is disabled. Defaults to None.	`None`
`drop_path`	`float`	Drop path probability for stochastic depth. Defaults to 0.0.	`0.0`
`act_layer`	`Module`	Activation function for MLP. Defaults to nn.GELU.	`GELU`
`norm_layer`	`Module`	Normalization layer to use. Defaults to nn.LayerNorm.	`LayerNorm`
`mlp_layer`	`Module`	MLP layer class to use. Defaults to Mlp.	`Mlp`
`attn_layer`	`Module`	Attention layer class to use. Defaults to Attention.	`Attention`

forward

forward(x: Tensor, attn_mask: Tensor = None) -> torch.Tensor

Forward pass through the Transformer block.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, N, C] where B is batch size, N is sequence length, and C is embedding dimension.	required
`attn_mask`	`Tensor`	Attention mask tensor. If provided, will be applied to the attention computation. Defaults to None.	`None`

Returns:

Name	Type	Description
`out`	`Tensor`	Output tensor of the same shape as input [B, N, C].

DepthConvBlock

DepthConvBlock(in_ch, out_ch, shortcut=False, force_adaptor=False)

Depthwise convolution block with feed-forward network.

This block implements a residual block using depthwise separable convolutions and a feed-forward network. It consists of: - Optional channel adaptor (1x1 conv) for dimension matching - Depthwise convolution path with residual connection - Feed-forward network with residual connection - Optional shortcut connection from input - Optional quantization step scaling - Optional tensor concatenation

Supports both PyTorch and CUDA implementations for efficient inference.

Parameters:

Name	Type	Description	Default
`in_ch`	`int`	Number of input channels.	required
`out_ch`	`int`	Number of output channels.	required
`shortcut`	`bool`	Whether to add shortcut connection from input to final output. Defaults to False.	`False`
`force_adaptor`	`bool`	Whether to force use of channel adaptor even when in_ch == out_ch. Defaults to False.	`False`

forward

forward(x, quant_step=None, to_cat=None, cat_at_front=True)

Forward pass with optional quantization and concatenation.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, C, H, W].	required
`quant_step`	`float`	Quantization step for scaling output. If provided, output is multiplied by quant_step. Defaults to None.	`None`
`to_cat`	`Tensor`	Tensor to concatenate with output. Defaults to None.	`None`
`cat_at_front`	`bool`	If True, concatenate to_cat before output. If False, concatenate after output. Defaults to True.	`True`

Returns:

Name	Type	Description
`out`	`Tensor`	Processed tensor of shape [B, out_ch, H, W], or concatenated tensor if to_cat is provided.

forward_cuda

forward_cuda(x, quant_step=None, to_cat=None, cat_at_front=True)

CUDA-optimized implementation of forward pass.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, C, H, W].	required
`quant_step`	`float`	Quantization step for scaling. Defaults to None.	`None`
`to_cat`	`Tensor`	Tensor to concatenate. Defaults to None.	`None`
`cat_at_front`	`bool`	Concatenation order. Defaults to True.	`True`

Returns:

Name	Type	Description
`out`	`Tensor`	Processed or concatenated tensor.

forward_torch

forward_torch(x, quant_step=None, to_cat=None, cat_at_front=True)

PyTorch implementation of forward pass.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, C, H, W].	required
`quant_step`	`float`	Quantization step for scaling. Defaults to None.	`None`
`to_cat`	`Tensor`	Tensor to concatenate. Defaults to None.	`None`
`cat_at_front`	`bool`	Concatenation order. Defaults to True.	`True`

Returns:

Name	Type	Description
`out`	`Tensor`	Processed or concatenated tensor.

LayerScale

LayerScale(dim: int, init_values: float = 1e-05, inplace: bool = False)

Layer scaling module for stabilizing deep networks.

This module scales the input by a learnable parameter gamma. It is commonly used in Vision Transformers to stabilize training of very deep networks. The scaling factor is initialized to a small value (e.g., 1e-5) and learned during training.

Reference: "Going deeper with Image Transformers" (Touvron et al., 2021)

Parameters:

Name	Type	Description	Default
`dim`	`int`	Dimension of the input tensor (last dimension).	required
`init_values`	`float`	Initial value for the scaling parameter gamma. Defaults to 1e-5.	`1e-05`
`inplace`	`bool`	Whether to perform in-place multiplication. Defaults to False.	`False`

forward

forward(x: Tensor) -> torch.Tensor

Scale input tensor by learnable parameter.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [..., dim] where dim matches the dimension used in initialization.	required

Returns:

Name	Type	Description
`out`	`Tensor`	Scaled tensor of the same shape as input.

ResidualBlockUpsample

ResidualBlockUpsample(in_ch, out_ch)

Residual block with 2x upsampling.

This block performs 2x spatial upsampling followed by depthwise convolution processing. It combines a sub-pixel convolution for upsampling with a DepthConvBlock for feature refinement.

Parameters:

Name	Type	Description	Default
`in_ch`	`int`	Number of input channels.	required
`out_ch`	`int`	Number of output channels.	required

forward

forward(x)

Forward pass with 2x upsampling.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, in_ch, H, W].	required

Returns:

Name	Type	Description
`out`	`Tensor`	Upsampled and processed tensor of shape [B, out_ch, 2H, 2W].

ResidualBlockWithStride2

ResidualBlockWithStride2(in_ch, out_ch)

Residual block with 2x downsampling.

This block performs 2x spatial downsampling followed by depthwise convolution processing. It combines a strided convolution for downsampling with a DepthConvBlock for feature refinement.

Parameters:

Name	Type	Description	Default
`in_ch`	`int`	Number of input channels.	required
`out_ch`	`int`	Number of output channels.	required

forward

forward(x)

Forward pass with 2x downsampling.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, in_ch, H, W].	required

Returns:

Name	Type	Description
`out`	`Tensor`	Downsampled and processed tensor of shape [B, out_ch, H//2, W//2].

RoPEAttention

RoPEAttention(*args, num_prefix_tokens=1, num_latent_tokens=32, num_image_tokens=256, rope_theta=10.0, rope_mixed=True, **kwargs)

Multi-head attention with rotary position embeddings (RoPE).

This attention mechanism extends standard multi-head attention by applying rotary position embeddings to query and key vectors. It supports two modes:

Mixed mode: Learnable 2D frequencies for image tokens and 1D frequencies for latent tokens
Axial mode: Fixed 2D axial frequencies for image tokens

Parameters:

Name	Type	Description	Default
`*args`	`tuple`	Positional arguments passed to parent Attention class.	`()`
`num_prefix_tokens`	`int`	Number of prefix tokens (e.g., CLS token) that do not receive positional embeddings. Defaults to 1.	`1`
`num_latent_tokens`	`int`	Number of latent tokens that receive 1D positional embeddings. Defaults to 32.	`32`
`num_image_tokens`	`int`	Number of image tokens that receive 2D positional embeddings. Defaults to 256.	`256`
`rope_theta`	`float`	Base frequency parameter for RoPE. Higher values result in lower frequencies. Defaults to 10.0.	`10.0`
`rope_mixed`	`bool`	If True, use learnable mixed 2D frequencies. If False, use fixed axial 2D frequencies. Defaults to True.	`True`
`**kwargs`	`dict`	Additional keyword arguments passed to parent Attention class.	`{}`

forward

forward(x, attn_mask=None)

Forward pass with rotary position embeddings.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, N, C] where B is batch size, N is sequence length (1 + num_image_tokens + num_latent_tokens), and C is embedding dimension.	required
`attn_mask`	`Tensor`	Attention mask tensor. Defaults to None.	`None`

Returns:

Name	Type	Description
`out`	`Tensor`	Output tensor of the same shape as input [B, N, C].

SubpelConv2x

SubpelConv2x(in_ch, out_ch, kernel_size, padding=0)

Sub-pixel convolution layer for 2x upsampling.

This layer performs 2x upsampling using sub-pixel convolution (also known as pixel shuffle). It uses a convolution followed by PixelShuffle to achieve efficient upsampling. Supports both PyTorch and CUDA implementations.

Parameters:

Name	Type	Description	Default
`in_ch`	`int`	Number of input channels.	required
`out_ch`	`int`	Number of output channels.	required
`kernel_size`	`int`	Size of the convolution kernel.	required
`padding`	`int`	Padding size for the convolution. Defaults to 0.	`0`

forward

forward(x, to_cat=None, cat_at_front=True)

Forward pass with optional tensor concatenation.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, C, H, W].	required
`to_cat`	`Tensor`	Tensor to concatenate with output. If None, only upsampled output is returned. Defaults to None.	`None`
`cat_at_front`	`bool`	If True, concatenate to_cat before output. If False, concatenate after output. Defaults to True.	`True`

Returns:

Name	Type	Description
`out`	`Tensor`	Upsampled tensor of shape [B, out_ch, 2H, 2W], or concatenated tensor if to_cat is provided.

forward_cuda

forward_cuda(x, to_cat=None, cat_at_front=True)

CUDA-optimized implementation of forward pass.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, C, H, W].	required
`to_cat`	`Tensor`	Tensor to concatenate with output. Defaults to None.	`None`
`cat_at_front`	`bool`	Concatenation order. Defaults to True.	`True`

Returns:

Name	Type	Description
`out`	`Tensor`	Upsampled or concatenated tensor.

forward_torch

forward_torch(x, to_cat=None, cat_at_front=True)

PyTorch implementation of forward pass.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [B, C, H, W].	required
`to_cat`	`Tensor`	Tensor to concatenate with output. Defaults to None.	`None`
`cat_at_front`	`bool`	Concatenation order. Defaults to True.	`True`

Returns:

Name	Type	Description
`out`	`Tensor`	Upsampled or concatenated tensor.