Skip to content

mpcompress.heads

Dinov2ClassifierHead

Dinov2ClassifierHead(embed_dim, layers, checkpoint_path)

Classification head for DINOv2 model.

This head takes multi-layer features from DINOv2 backbone and produces classification logits. Supports 1-layer and 4-layer configurations.

Parameters:

Name Type Description Default
embed_dim int

Embedding dimension of the features.

required
layers int

Number of layers to use. Supported values: 1, 4.

required
checkpoint_path str or None

Path to checkpoint file to load weights from. If None, weights are randomly initialized.

required

forward

forward(feature_list)

Forward pass through the classifier head.

Parameters:

Name Type Description Default
feature_list list[list[Tensor]]

List of layer features, where each element is [cls_token, patch_tokens]. Shape:

  • cls_token: (B, embed_dim)
  • patch_tokens: (B, N_patches, embed_dim)
required

Returns:

Name Type Description
logits Tensor

Classification logits of shape (B, 1000).

predict

predict(feature_list, topk=1)

Predict top-k class indices from features.

Parameters:

Name Type Description Default
feature_list list[list[Tensor]]

List of layer features, where each element is [cls_token, patch_tokens].

required
topk int

Number of top predictions to return. Defaults to 1.

1

Returns:

Name Type Description
indices Tensor

Top-k class indices of shape (B, topk).

Dinov2SegmentationHead

Dinov2SegmentationHead(in_channels, in_index, input_transform, channels, resize_factors=None, align_corners=False, num_classes=21, patch_size=16, dropout_ratio=0, checkpoint=None, **kwargs)

Segmentation head for DINOv2 model.

This head consists of BatchNorm and Conv layers to produce segmentation predictions from multi-level features. Supports various input transformation modes and sliding window inference for large images.

Parameters:

Name Type Description Default
in_channels int or Sequence[int]

Number of input channels.

required
in_index int or Sequence[int]

Indices of input features to use.

required
input_transform str or None

Transformation type of input features. Options: 'resize_concat', 'multiple_select', None.

required
channels int

Number of channels after transformation.

required
resize_factors Sequence[float]

Resize factors for each input. Defaults to None.

None
align_corners bool

Whether to align corners in interpolation. Defaults to False.

False
num_classes int

Number of segmentation classes. Defaults to 21.

21
patch_size int

Patch size of the vision transformer. Defaults to 16.

16
dropout_ratio float

Dropout ratio. Defaults to 0.

0
checkpoint str

Path to checkpoint file to load weights from. Defaults to None.

None
**kwargs dict

Additional keyword arguments passed to parent class.

{}

forward

forward(inputs)

Forward pass through the segmentation head.

Parameters:

Name Type Description Default
inputs list[Tensor] or Tensor

Input features from backbone. Can be a list of multi-level features or a single tensor.

required

Returns:

Name Type Description
seg_logits Tensor

Segmentation logits of shape (B, num_classes, H, W).

predict

predict(inputs, scale=1, size=None)

Predict segmentation logits with optional resizing.

Parameters:

Name Type Description Default
inputs list[Tensor] or Tensor

Input features from backbone.

required
scale float

Scale factor for resizing output. If scale != 1, output will be resized by this factor. Defaults to 1.

1
size tuple[int, int]

Target size (height, width) for resizing. If provided, output will be resized to this size. Defaults to None.

None

Returns:

Name Type Description
seg_logits Tensor

Segmentation logits of shape (B, num_classes, H, W).

slide_predict

slide_predict(feature_list, current_size, slide_window, slide_stride, target_size=None)

Perform sliding window prediction for large images.

This method processes an image in overlapping crops using a sliding window approach and averages predictions in overlapping regions.

Parameters:

Name Type Description Default
feature_list list[list[Tensor]]

List of crop features, where each crop contains features from multiple layers. Shape: [[(B,C,H,W), ...], ..., [(B,C,H,W), ...]] which is N_crop times N_layer of features.

required
current_size tuple[int, int]

Current image size (height, width).

required
slide_window tuple[int, int]

Sliding window size (height, width).

required
slide_stride tuple[int, int]

Sliding window stride (height, width).

required
target_size tuple[int, int]

Target size for final output. If None, output will be at current_size. Defaults to None.

None

Returns:

Name Type Description
preds Tensor

Averaged segmentation logits of shape (B, num_classes, H, W).