mpcompress.backbone
Dinov2OrgBackbone
Dinov2OrgBackbone(model_size='small', img_size=256, patch_size=16, dynamic_size=False, slot=-4, n_last_blocks=4, ckpt_path=None)
DINOv2 backbone using the original Facebook Research implementation.
This class extends the original DINOv2 model to provide flexible feature extraction.
The slot parameter determines the splitting point for dividing the ViT blocks into:
- encode part: blocks[:slot],
- decode part: blocks[slot:]
Intermediate features are extracted after the encode part and before the decode part.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_size
|
|
Model variant specification ('small', 'base', 'large', 'giant'). Defaults to 'small'. |
'small'
|
img_size
|
|
Base input image size. Defaults to 256. |
256
|
patch_size
|
|
Patch embedding size. Defaults to 16. |
16
|
dynamic_size
|
|
Whether to support dynamically varying input sizes. Defaults to False. |
False
|
slot
|
|
Block slicing position for feature extraction. Follows Python list slicing conventions. -4 means the last 4th block. Defaults to -4. |
-4
|
n_last_blocks
|
|
Number of final blocks to utilize for feature aggregation. Defaults to 4. |
4
|
ckpt_path
|
|
Path to pre-trained checkpoint for initialization. Defaults to None. |
None
|
decode
decode(h, token_res=None, task='whole')
Decode encoded features through the decoder part of the DINOv2 model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h
|
|
Encoded features from the encoder. |
required |
token_res
|
|
Token resolution (H, W) for reshaping patch tokens. Defaults to None. |
None
|
task
|
|
Decoding task type. Must be one of:
|
'whole'
|
Returns:
| Name | Type | Description |
|---|---|---|
feats |
|
Decoded features, format depends on task. |
encode
encode(x)
Encode input images through the encoder part of the DINOv2 model.
The encoding process applies input normalization, prepares tokens with masks,
and processes the input through the first slot transformer blocks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input images of shape (B, 3, H, W). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
h |
|
Encoded features after the encoder blocks. |
forward
forward(x, task='whole')
Forward pass through the backbone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input images of shape (B, 3, H, W). |
required |
task
|
|
Task type, one of ["whole", "cls", "seg"]. Defaults to "whole". |
'whole'
|
Returns:
| Name | Type | Description |
|---|---|---|
feats |
|
Output features, format depends on task. |
slide_decode_seg
slide_decode_seg(feature_list, slide_res)
Decode features from sliding window encoding for segmentation task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_list
|
|
List of encoded features from slide_encode. |
required |
slide_res
|
|
Token resolution (H, W) for each crop. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
multi_crop_feats |
|
List of decoded segmentation features, one for each crop. |
slide_encode
slide_encode(img, slide_window, slide_stride)
Encode images using sliding window approach.
This method extracts features from overlapping image crops using a sliding window strategy. Useful for processing large images that don't fit in memory or for extracting features at multiple scales.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
img
|
|
Input image of shape (B, 3, H_img, W_img). |
required |
slide_window
|
|
Window size for cropping (h_crop, w_crop). |
required |
slide_stride
|
|
Stride for sliding window (h_stride, w_stride). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
multi_crop_feats |
|
List of encoded features, one for each crop. |
Dinov2TimmBackbone
Dinov2TimmBackbone(model_size='small', img_size=256, patch_size=16, dynamic_size=False, slot=-4, n_last_blocks=4, ckpt_path=None, device='cuda' if torch.cuda.is_available() else 'cpu', cast_dtype='float', with_registers=False)
This class extends the DINOv2 model to provide flexible feature extraction.
The DINOv2 backbone implemented with timm supports variable patch sizes and dynamic input image sizes.
The slot parameter determines the splitting point for dividing the ViT blocks into:
- encode part: blocks[:slot],
- decode part: blocks[slot:]
Intermediate feature are extracted after the encode part and before the decode part.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_size
|
|
Model variant specification ('small', 'base', 'large', 'giant'). Defaults to 'small'. |
'small'
|
img_size
|
|
Base input image size. Defaults to 256. |
256
|
patch_size
|
|
Patch embedding size. Defaults to 16. |
16
|
dynamic_size
|
|
Whether to support dynamically varying input sizes. Defaults to False. |
False
|
slot
|
|
Block slicing position for feature extraction. Follows Python list slicing conventions. Defaults to -4. |
-4
|
n_last_blocks
|
|
Number of final blocks to utilize for feature aggregation. Defaults to 4. |
4
|
ckpt_path
|
|
Path to pre-trained checkpoint for initialization. Defaults to None. |
None
|
cast_dtype
|
|
Data type for autocast mixed precision. Supports string format like "torch.float", "torch.float16", "float32", etc. Defaults to "torch.float". |
'float'
|
device
|
|
Device to run the model on. Defaults to "cuda" if available, else "cpu". |
'cuda' if
|
with_registers
|
|
Whether to use register tokens in the model. Defaults to False. |
False
|
decode
decode(h, token_res=None, task='whole')
Decode encoded features through the decoder part of the DINOv2 model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h
|
|
Encoded features from the encoder. |
required |
token_res
|
|
Token resolution (H, W) for reshaping patch tokens. Defaults to None. |
None
|
task
|
|
Decoding task type. Must be one of:
|
'whole'
|
Returns:
| Name | Type | Description |
|---|---|---|
feats |
|
Decoded features, format depends on task. |
encode
encode(x)
Encode input images through the encoder part of the DINOv2 model.
The encoding process applies input normalization, patch embedding, positional
embedding, and processes the input through the first slot transformer blocks.
The intermediate features are extracted after the encoder part and before
the decoder part.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input images of shape (B, 3, H, W). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
h |
|
Encoded features after the encoder blocks. |
forward
forward(x, task='whole')
Forward pass through the backbone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input images of shape (B, 3, H, W). |
required |
task
|
|
Task type, one of ["whole", "cls", "seg"]. Defaults to "whole". |
'whole'
|
Returns:
| Name | Type | Description |
|---|---|---|
feats |
|
Output features, format depends on task. |
VqganBackbone
VqganBackbone(vqgan_config, **kwargs)
VQGAN-based backbone for image encoding and decoding.
This backbone uses a VQGAN (Vector Quantized Generative Adversarial Network) model to encode images into discrete tokens and decode them back to images. The encoding process converts images to latent codes and quantizes them using a codebook.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vqgan_config
|
|
Configuration dictionary for the VQModel initialization. |
required |
**kwargs
|
|
Unused keyword arguments for API compatibility. |
{}
|
Attributes:
| Name | Type | Description |
|---|---|---|
|
|
The underlying VQGAN model. |
|
|
Size of the quantization codebook. |
decode
decode(z_q)
Decode quantized latent codes back to images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
z_q
|
|
Quantized latent codes of shape (B, C, H, W). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
x_hat |
|
Reconstructed images of shape (B, 3, H, W) in range [0, 1]. |
encode
encode(x)
Encode input images into latent codes and tokens.
The input images x are expected to be in the range [0, 1], which are then transformed to [-1, 1] for the VQGAN encoder. The encoder produces latent codes that are quantized using the codebook to produce discrete tokens. The quantization process produces z_q' = z + (z_q - z).detach(), which incurs a small MSE error (approximately 1e-18) between z_q' and z_q. We use z_q as the context for consistency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
|
Input images of shape (B, 3, H, W) in range [0, 1]. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
vqgan_enc |
|
A dictionary containing:
|
tokens_to_features
tokens_to_features(tokens)
Convert discrete tokens to quantized latent features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
|
Discrete token indices of shape (B, H, W). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
z_q |
|
Quantized latent features of shape (B, C, H, W). |