mpcompress.backbone

Dinov2OrgBackbone

Dinov2OrgBackbone(model_size='small', img_size=256, patch_size=16, dynamic_size=False, slot=-4, n_last_blocks=4, ckpt_path=None)

DINOv2 backbone using the original Facebook Research implementation.

This class extends the original DINOv2 model to provide flexible feature extraction. The slot parameter determines the splitting point for dividing the ViT blocks into:

encode part: blocks[:slot],
decode part: blocks[slot:]

Intermediate features are extracted after the encode part and before the decode part.

Parameters:

Name	Type	Description	Default
`model_size`	`str`	Model variant specification ('small', 'base', 'large', 'giant'). Defaults to 'small'.	`'small'`
`img_size`	`int`	Base input image size. Defaults to 256.	`256`
`patch_size`	`int`	Patch embedding size. Defaults to 16.	`16`
`dynamic_size`	`bool`	Whether to support dynamically varying input sizes. Defaults to False.	`False`
`slot`	`int`	Block slicing position for feature extraction. Follows Python list slicing conventions. -4 means the last 4th block. Defaults to -4.	`-4`
`n_last_blocks`	`int`	Number of final blocks to utilize for feature aggregation. Defaults to 4.	`4`
`ckpt_path`	`str`	Path to pre-trained checkpoint for initialization. Defaults to None.	`None`

decode

decode(h, token_res=None, task='whole')

Decode encoded features through the decoder part of the DINOv2 model.

Parameters:

Name	Type	Description	Default
`h`	`Tensor`	Encoded features from the encoder.	required
`token_res`	`tuple`	Token resolution (H, W) for reshaping patch tokens. Defaults to None.	`None`
`task`	`str`	Decoding task type. Must be one of: "whole": Return full token sequences from multiple layers. "cls": Return class tokens and patch tokens separately. "seg": Return patch tokens reshaped to 2D spatial format. Defaults to "whole".	`'whole'`

Returns:

Name	Type	Description
`feats`	`list[Tensor, ...]`	Decoded features, format depends on task.

encode

encode(x)

Encode input images through the encoder part of the DINOv2 model.

The encoding process applies input normalization, prepares tokens with masks, and processes the input through the first slot transformer blocks.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input images of shape (B, 3, H, W).	required

Returns:

Name	Type	Description
`h`	`Tensor`	Encoded features after the encoder blocks.

forward

forward(x, task='whole')

Forward pass through the backbone.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input images of shape (B, 3, H, W).	required
`task`	`str`	Task type, one of ["whole", "cls", "seg"]. Defaults to "whole".	`'whole'`

Returns:

Name	Type	Description
`feats`	`list[Tensor, ...]`	Output features, format depends on task.

slide_decode_seg

slide_decode_seg(feature_list, slide_res)

Decode features from sliding window encoding for segmentation task.

Parameters:

Name	Type	Description	Default
`feature_list`	`list`	List of encoded features from slide_encode.	required
`slide_res`	`tuple`	Token resolution (H, W) for each crop.	required

Returns:

Name	Type	Description
`multi_crop_feats`	`list[list[Tensor, ...]]`	List of decoded segmentation features, one for each crop.

slide_encode

slide_encode(img, slide_window, slide_stride)

Encode images using sliding window approach.

This method extracts features from overlapping image crops using a sliding window strategy. Useful for processing large images that don't fit in memory or for extracting features at multiple scales.

Parameters:

Name	Type	Description	Default
`img`	`Tensor`	Input image of shape (B, 3, H_img, W_img).	required
`slide_window`	`tuple`	Window size for cropping (h_crop, w_crop).	required
`slide_stride`	`tuple`	Stride for sliding window (h_stride, w_stride).	required

Returns:

Name	Type	Description
`multi_crop_feats`	`list[Tensor]`	List of encoded features, one for each crop.

Dinov2TimmBackbone

Dinov2TimmBackbone(model_size='small', img_size=256, patch_size=16, dynamic_size=False, slot=-4, n_last_blocks=4, ckpt_path=None, device='cuda' if torch.cuda.is_available() else 'cpu', cast_dtype='float', with_registers=False)

This class extends the DINOv2 model to provide flexible feature extraction. The DINOv2 backbone implemented with timm supports variable patch sizes and dynamic input image sizes. The slot parameter determines the splitting point for dividing the ViT blocks into:

encode part: blocks[:slot],
decode part: blocks[slot:]

Intermediate feature are extracted after the encode part and before the decode part.

Parameters:

Name	Type	Description	Default
`model_size`	`str`	Model variant specification ('small', 'base', 'large', 'giant'). Defaults to 'small'.	`'small'`
`img_size`	`int`	Base input image size. Defaults to 256.	`256`
`patch_size`	`int`	Patch embedding size. Defaults to 16.	`16`
`dynamic_size`	`bool`	Whether to support dynamically varying input sizes. Defaults to False.	`False`
`slot`	`int or None`	Block slicing position for feature extraction. Follows Python list slicing conventions. Defaults to -4.	`-4`
`n_last_blocks`	`int`	Number of final blocks to utilize for feature aggregation. Defaults to 4.	`4`
`ckpt_path`	`str`	Path to pre-trained checkpoint for initialization. Defaults to None.	`None`
`cast_dtype`	`str or dtype`	Data type for autocast mixed precision. Supports string format like "torch.float", "torch.float16", "float32", etc. Defaults to "torch.float".	`'float'`
`device`	`str`	Device to run the model on. Defaults to "cuda" if available, else "cpu".	`'cuda' if is_available() else 'cpu'`
`with_registers`	`bool`	Whether to use register tokens in the model. Defaults to False.	`False`

decode

decode(h, token_res=None, task='whole')

Decode encoded features through the decoder part of the DINOv2 model.

Parameters:

Name	Type	Description	Default
`h`	`Tensor`	Encoded features from the encoder.	required
`token_res`	`tuple`	Token resolution (H, W) for reshaping patch tokens. Defaults to None.	`None`
`task`	`str`	Decoding task type. Must be one of: "whole": Return full token sequences from multiple layers. "cls": Return class tokens and patch tokens separately. "seg": Return patch tokens reshaped to 2D spatial format. Defaults to "whole".	`'whole'`

Returns:

Name	Type	Description
`feats`	`list[Tensor, ...]`	Decoded features, format depends on task.

encode

encode(x)

Encode input images through the encoder part of the DINOv2 model.

The encoding process applies input normalization, patch embedding, positional embedding, and processes the input through the first slot transformer blocks. The intermediate features are extracted after the encoder part and before the decoder part.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input images of shape (B, 3, H, W).	required

Returns:

Name	Type	Description
`h`	`Tensor`	Encoded features after the encoder blocks.

forward

forward(x, task='whole')

Forward pass through the backbone.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input images of shape (B, 3, H, W).	required
`task`	`str`	Task type, one of ["whole", "cls", "seg"]. Defaults to "whole".	`'whole'`

Returns:

Name	Type	Description
`feats`	`list[Tensor]`	Output features, format depends on task.

VqganBackbone

VqganBackbone(vqgan_config, **kwargs)

VQGAN-based backbone for image encoding and decoding.

This backbone uses a VQGAN (Vector Quantized Generative Adversarial Network) model to encode images into discrete tokens and decode them back to images. The encoding process converts images to latent codes and quantizes them using a codebook.

Parameters:

Name	Type	Description	Default
`vqgan_config`	`dict`	Configuration dictionary for the VQModel initialization.	required
`**kwargs`	`dict`	Unused keyword arguments for API compatibility.	`{}`

Attributes:

Name	Type	Description
`vqgan`	`VQModel`	The underlying VQGAN model.
`codebook_size`	`int`	Size of the quantization codebook.

decode

decode(z_q)

Decode quantized latent codes back to images.

Parameters:

Name	Type	Description	Default
`z_q`	`Tensor`	Quantized latent codes of shape (B, C, H, W).	required

Returns:

Name	Type	Description
`x_hat`	`Tensor`	Reconstructed images of shape (B, 3, H, W) in range [0, 1].

encode

encode(x)

Encode input images into latent codes and tokens.

The input images x are expected to be in the range [0, 1], which are then transformed to [-1, 1] for the VQGAN encoder. The encoder produces latent codes that are quantized using the codebook to produce discrete tokens. The quantization process produces z_q' = z + (z_q - z).detach(), which incurs a small MSE error (approximately 1e-18) between z_q' and z_q. We use z_q as the context for consistency.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input images of shape (B, 3, H, W) in range [0, 1].	required

Returns:

Name	Type	Description
`vqgan_enc`	`dict`	A dictionary containing: "z" (torch.Tensor): Continuous latent codes before quantization. "z_q" (torch.Tensor): Quantized latent codes of shape (B, C, H, W). "tokens" (torch.Tensor): Discrete token indices of shape (B, H, W). "shape" (tuple): Spatial dimensions (H, W) of the latent representation.

tokens_to_features

tokens_to_features(tokens)

Convert discrete tokens to quantized latent features.

Parameters:

Name	Type	Description	Default
`tokens`	`Tensor`	Discrete token indices of shape (B, H, W).	required

Returns:

Name	Type	Description
`z_q`	`Tensor`	Quantized latent features of shape (B, C, H, W).