Skip to content

mpcompress.backbone

Dinov2OrgBackbone

Dinov2OrgBackbone(model_size='small', img_size=256, patch_size=16, dynamic_size=False, slot=-4, n_last_blocks=4, ckpt_path=None)

DINOv2 backbone using the original Facebook Research implementation.

This class extends the original DINOv2 model to provide flexible feature extraction. The slot parameter determines the splitting point for dividing the ViT blocks into:

  • encode part: blocks[:slot],
  • decode part: blocks[slot:]

Intermediate features are extracted after the encode part and before the decode part.

Parameters:

Name Type Description Default
model_size str

Model variant specification ('small', 'base', 'large', 'giant'). Defaults to 'small'.

'small'
img_size int

Base input image size. Defaults to 256.

256
patch_size int

Patch embedding size. Defaults to 16.

16
dynamic_size bool

Whether to support dynamically varying input sizes. Defaults to False.

False
slot int

Block slicing position for feature extraction. Follows Python list slicing conventions. -4 means the last 4th block. Defaults to -4.

-4
n_last_blocks int

Number of final blocks to utilize for feature aggregation. Defaults to 4.

4
ckpt_path str

Path to pre-trained checkpoint for initialization. Defaults to None.

None

decode

decode(h, token_res=None, task='whole')

Decode encoded features through the decoder part of the DINOv2 model.

Parameters:

Name Type Description Default
h Tensor

Encoded features from the encoder.

required
token_res tuple

Token resolution (H, W) for reshaping patch tokens. Defaults to None.

None
task str

Decoding task type. Must be one of:

  • "whole": Return full token sequences from multiple layers.
  • "cls": Return class tokens and patch tokens separately.
  • "seg": Return patch tokens reshaped to 2D spatial format. Defaults to "whole".
'whole'

Returns:

Name Type Description
feats list[Tensor, ...]

Decoded features, format depends on task.

encode

encode(x)

Encode input images through the encoder part of the DINOv2 model.

The encoding process applies input normalization, prepares tokens with masks, and processes the input through the first slot transformer blocks.

Parameters:

Name Type Description Default
x Tensor

Input images of shape (B, 3, H, W).

required

Returns:

Name Type Description
h Tensor

Encoded features after the encoder blocks.

forward

forward(x, task='whole')

Forward pass through the backbone.

Parameters:

Name Type Description Default
x Tensor

Input images of shape (B, 3, H, W).

required
task str

Task type, one of ["whole", "cls", "seg"]. Defaults to "whole".

'whole'

Returns:

Name Type Description
feats list[Tensor, ...]

Output features, format depends on task.

slide_decode_seg

slide_decode_seg(feature_list, slide_res)

Decode features from sliding window encoding for segmentation task.

Parameters:

Name Type Description Default
feature_list list

List of encoded features from slide_encode.

required
slide_res tuple

Token resolution (H, W) for each crop.

required

Returns:

Name Type Description
multi_crop_feats list[list[Tensor, ...]]

List of decoded segmentation features, one for each crop.

slide_encode

slide_encode(img, slide_window, slide_stride)

Encode images using sliding window approach.

This method extracts features from overlapping image crops using a sliding window strategy. Useful for processing large images that don't fit in memory or for extracting features at multiple scales.

Parameters:

Name Type Description Default
img Tensor

Input image of shape (B, 3, H_img, W_img).

required
slide_window tuple

Window size for cropping (h_crop, w_crop).

required
slide_stride tuple

Stride for sliding window (h_stride, w_stride).

required

Returns:

Name Type Description
multi_crop_feats list[Tensor]

List of encoded features, one for each crop.

Dinov2TimmBackbone

Dinov2TimmBackbone(model_size='small', img_size=256, patch_size=16, dynamic_size=False, slot=-4, n_last_blocks=4, ckpt_path=None, device='cuda' if torch.cuda.is_available() else 'cpu', cast_dtype='float', with_registers=False)

This class extends the DINOv2 model to provide flexible feature extraction. The DINOv2 backbone implemented with timm supports variable patch sizes and dynamic input image sizes. The slot parameter determines the splitting point for dividing the ViT blocks into:

  • encode part: blocks[:slot],
  • decode part: blocks[slot:]

Intermediate feature are extracted after the encode part and before the decode part.

Parameters:

Name Type Description Default
model_size str

Model variant specification ('small', 'base', 'large', 'giant'). Defaults to 'small'.

'small'
img_size int

Base input image size. Defaults to 256.

256
patch_size int

Patch embedding size. Defaults to 16.

16
dynamic_size bool

Whether to support dynamically varying input sizes. Defaults to False.

False
slot int or None

Block slicing position for feature extraction. Follows Python list slicing conventions. Defaults to -4.

-4
n_last_blocks int

Number of final blocks to utilize for feature aggregation. Defaults to 4.

4
ckpt_path str

Path to pre-trained checkpoint for initialization. Defaults to None.

None
cast_dtype str or dtype

Data type for autocast mixed precision. Supports string format like "torch.float", "torch.float16", "float32", etc. Defaults to "torch.float".

'float'
device str

Device to run the model on. Defaults to "cuda" if available, else "cpu".

'cuda' if is_available() else 'cpu'
with_registers bool

Whether to use register tokens in the model. Defaults to False.

False

decode

decode(h, token_res=None, task='whole')

Decode encoded features through the decoder part of the DINOv2 model.

Parameters:

Name Type Description Default
h Tensor

Encoded features from the encoder.

required
token_res tuple

Token resolution (H, W) for reshaping patch tokens. Defaults to None.

None
task str

Decoding task type. Must be one of:

  • "whole": Return full token sequences from multiple layers.
  • "cls": Return class tokens and patch tokens separately.
  • "seg": Return patch tokens reshaped to 2D spatial format. Defaults to "whole".
'whole'

Returns:

Name Type Description
feats list[Tensor, ...]

Decoded features, format depends on task.

encode

encode(x)

Encode input images through the encoder part of the DINOv2 model.

The encoding process applies input normalization, patch embedding, positional embedding, and processes the input through the first slot transformer blocks. The intermediate features are extracted after the encoder part and before the decoder part.

Parameters:

Name Type Description Default
x Tensor

Input images of shape (B, 3, H, W).

required

Returns:

Name Type Description
h Tensor

Encoded features after the encoder blocks.

forward

forward(x, task='whole')

Forward pass through the backbone.

Parameters:

Name Type Description Default
x Tensor

Input images of shape (B, 3, H, W).

required
task str

Task type, one of ["whole", "cls", "seg"]. Defaults to "whole".

'whole'

Returns:

Name Type Description
feats list[Tensor]

Output features, format depends on task.

VqganBackbone

VqganBackbone(vqgan_config, **kwargs)

VQGAN-based backbone for image encoding and decoding.

This backbone uses a VQGAN (Vector Quantized Generative Adversarial Network) model to encode images into discrete tokens and decode them back to images. The encoding process converts images to latent codes and quantizes them using a codebook.

Parameters:

Name Type Description Default
vqgan_config dict

Configuration dictionary for the VQModel initialization.

required
**kwargs dict

Unused keyword arguments for API compatibility.

{}

Attributes:

Name Type Description
vqgan VQModel

The underlying VQGAN model.

codebook_size int

Size of the quantization codebook.

decode

decode(z_q)

Decode quantized latent codes back to images.

Parameters:

Name Type Description Default
z_q Tensor

Quantized latent codes of shape (B, C, H, W).

required

Returns:

Name Type Description
x_hat Tensor

Reconstructed images of shape (B, 3, H, W) in range [0, 1].

encode

encode(x)

Encode input images into latent codes and tokens.

The input images x are expected to be in the range [0, 1], which are then transformed to [-1, 1] for the VQGAN encoder. The encoder produces latent codes that are quantized using the codebook to produce discrete tokens. The quantization process produces z_q' = z + (z_q - z).detach(), which incurs a small MSE error (approximately 1e-18) between z_q' and z_q. We use z_q as the context for consistency.

Parameters:

Name Type Description Default
x Tensor

Input images of shape (B, 3, H, W) in range [0, 1].

required

Returns:

Name Type Description
vqgan_enc dict

A dictionary containing:

  • "z" (torch.Tensor): Continuous latent codes before quantization.
  • "z_q" (torch.Tensor): Quantized latent codes of shape (B, C, H, W).
  • "tokens" (torch.Tensor): Discrete token indices of shape (B, H, W).
  • "shape" (tuple): Spatial dimensions (H, W) of the latent representation.

tokens_to_features

tokens_to_features(tokens)

Convert discrete tokens to quantized latent features.

Parameters:

Name Type Description Default
tokens Tensor

Discrete token indices of shape (B, H, W).

required

Returns:

Name Type Description
z_q Tensor

Quantized latent features of shape (B, C, H, W).