mpcompress.heads

Dinov2ClassifierHead

Dinov2ClassifierHead(embed_dim, layers, checkpoint_path)

Classification head for DINOv2 model.

This head takes multi-layer features from DINOv2 backbone and produces classification logits. Supports 1-layer and 4-layer configurations.

Parameters:

Name	Type	Description	Default
`embed_dim`	`int`	Embedding dimension of the features.	required
`layers`	`int`	Number of layers to use. Supported values: 1, 4.	required
`checkpoint_path`	`str or None`	Path to checkpoint file to load weights from. If None, weights are randomly initialized.	required

forward

forward(feature_list)

Forward pass through the classifier head.

Parameters:

Name	Type	Description	Default
`feature_list`	`list[list[Tensor]]`	List of layer features, where each element is [cls_token, patch_tokens]. Shape: cls_token: (B, embed_dim) patch_tokens: (B, N_patches, embed_dim)	required

Returns:

Name	Type	Description
`logits`	`Tensor`	Classification logits of shape (B, 1000).

predict

predict(feature_list, topk=1)

Predict top-k class indices from features.

Parameters:

Name	Type	Description	Default
`feature_list`	`list[list[Tensor]]`	List of layer features, where each element is [cls_token, patch_tokens].	required
`topk`	`int`	Number of top predictions to return. Defaults to 1.	`1`

Returns:

Name	Type	Description
`indices`	`Tensor`	Top-k class indices of shape (B, topk).

Dinov2SegmentationHead

Dinov2SegmentationHead(in_channels, in_index, input_transform, channels, resize_factors=None, align_corners=False, num_classes=21, patch_size=16, dropout_ratio=0, checkpoint=None, **kwargs)

Segmentation head for DINOv2 model.

This head consists of BatchNorm and Conv layers to produce segmentation predictions from multi-level features. Supports various input transformation modes and sliding window inference for large images.

Parameters:

Name	Type	Description	Default
`in_channels`	`int or Sequence[int]`	Number of input channels.	required
`in_index`	`int or Sequence[int]`	Indices of input features to use.	required
`input_transform`	`str or None`	Transformation type of input features. Options: 'resize_concat', 'multiple_select', None.	required
`channels`	`int`	Number of channels after transformation.	required
`resize_factors`	`Sequence[float]`	Resize factors for each input. Defaults to None.	`None`
`align_corners`	`bool`	Whether to align corners in interpolation. Defaults to False.	`False`
`num_classes`	`int`	Number of segmentation classes. Defaults to 21.	`21`
`patch_size`	`int`	Patch size of the vision transformer. Defaults to 16.	`16`
`dropout_ratio`	`float`	Dropout ratio. Defaults to 0.	`0`
`checkpoint`	`str`	Path to checkpoint file to load weights from. Defaults to None.	`None`
`**kwargs`	`dict`	Additional keyword arguments passed to parent class.	`{}`

forward

forward(inputs)

Forward pass through the segmentation head.

Parameters:

Name	Type	Description	Default
`inputs`	`list[Tensor] or Tensor`	Input features from backbone. Can be a list of multi-level features or a single tensor.	required

Returns:

Name	Type	Description
`seg_logits`	`Tensor`	Segmentation logits of shape (B, num_classes, H, W).

predict

predict(inputs, scale=1, size=None)

Predict segmentation logits with optional resizing.

Parameters:

Name	Type	Description	Default
`inputs`	`list[Tensor] or Tensor`	Input features from backbone.	required
`scale`	`float`	Scale factor for resizing output. If scale != 1, output will be resized by this factor. Defaults to 1.	`1`
`size`	`tuple[int, int]`	Target size (height, width) for resizing. If provided, output will be resized to this size. Defaults to None.	`None`

Returns:

Name	Type	Description
`seg_logits`	`Tensor`	Segmentation logits of shape (B, num_classes, H, W).

slide_predict

slide_predict(feature_list, current_size, slide_window, slide_stride, target_size=None)

Perform sliding window prediction for large images.

This method processes an image in overlapping crops using a sliding window approach and averages predictions in overlapping regions.

Parameters:

Name	Type	Description	Default
`feature_list`	`list[list[Tensor]]`	List of crop features, where each crop contains features from multiple layers. Shape: [[(B,C,H,W), ...], ..., [(B,C,H,W), ...]] which is N_crop times N_layer of features.	required
`current_size`	`tuple[int, int]`	Current image size (height, width).	required
`slide_window`	`tuple[int, int]`	Sliding window size (height, width).	required
`slide_stride`	`tuple[int, int]`	Sliding window stride (height, width).	required
`target_size`	`tuple[int, int]`	Target size for final output. If None, output will be at current_size. Defaults to None.	`None`

Returns:

Name	Type	Description
`preds`	`Tensor`	Averaged segmentation logits of shape (B, num_classes, H, W).