mpcompress.heads
Dinov2ClassifierHead
Dinov2ClassifierHead(embed_dim, layers, checkpoint_path)
Classification head for DINOv2 model.
This head takes multi-layer features from DINOv2 backbone and produces classification logits. Supports 1-layer and 4-layer configurations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embed_dim
|
|
Embedding dimension of the features. |
required |
layers
|
|
Number of layers to use. Supported values: 1, 4. |
required |
checkpoint_path
|
|
Path to checkpoint file to load weights from. If None, weights are randomly initialized. |
required |
forward
forward(feature_list)
Forward pass through the classifier head.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_list
|
|
List of layer features, where each element is [cls_token, patch_tokens]. Shape:
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
logits |
|
Classification logits of shape (B, 1000). |
predict
predict(feature_list, topk=1)
Predict top-k class indices from features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_list
|
|
List of layer features, where each element is [cls_token, patch_tokens]. |
required |
topk
|
|
Number of top predictions to return. Defaults to 1. |
1
|
Returns:
| Name | Type | Description |
|---|---|---|
indices |
|
Top-k class indices of shape (B, topk). |
Dinov2SegmentationHead
Dinov2SegmentationHead(in_channels, in_index, input_transform, channels, resize_factors=None, align_corners=False, num_classes=21, patch_size=16, dropout_ratio=0, checkpoint=None, **kwargs)
Segmentation head for DINOv2 model.
This head consists of BatchNorm and Conv layers to produce segmentation predictions from multi-level features. Supports various input transformation modes and sliding window inference for large images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_channels
|
|
Number of input channels. |
required |
in_index
|
|
Indices of input features to use. |
required |
input_transform
|
|
Transformation type of input features. Options: 'resize_concat', 'multiple_select', None. |
required |
channels
|
|
Number of channels after transformation. |
required |
resize_factors
|
|
Resize factors for each input. Defaults to None. |
None
|
align_corners
|
|
Whether to align corners in interpolation. Defaults to False. |
False
|
num_classes
|
|
Number of segmentation classes. Defaults to 21. |
21
|
patch_size
|
|
Patch size of the vision transformer. Defaults to 16. |
16
|
dropout_ratio
|
|
Dropout ratio. Defaults to 0. |
0
|
checkpoint
|
|
Path to checkpoint file to load weights from. Defaults to None. |
None
|
**kwargs
|
|
Additional keyword arguments passed to parent class. |
{}
|
forward
forward(inputs)
Forward pass through the segmentation head.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
|
Input features from backbone. Can be a list of multi-level features or a single tensor. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
seg_logits |
|
Segmentation logits of shape (B, num_classes, H, W). |
predict
predict(inputs, scale=1, size=None)
Predict segmentation logits with optional resizing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
|
Input features from backbone. |
required |
scale
|
|
Scale factor for resizing output. If scale != 1, output will be resized by this factor. Defaults to 1. |
1
|
size
|
|
Target size (height, width) for resizing. If provided, output will be resized to this size. Defaults to None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
seg_logits |
|
Segmentation logits of shape (B, num_classes, H, W). |
slide_predict
slide_predict(feature_list, current_size, slide_window, slide_stride, target_size=None)
Perform sliding window prediction for large images.
This method processes an image in overlapping crops using a sliding window approach and averages predictions in overlapping regions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_list
|
|
List of crop features, where each crop contains features from multiple layers. Shape: [[(B,C,H,W), ...], ..., [(B,C,H,W), ...]] which is N_crop times N_layer of features. |
required |
current_size
|
|
Current image size (height, width). |
required |
slide_window
|
|
Sliding window size (height, width). |
required |
slide_stride
|
|
Sliding window stride (height, width). |
required |
target_size
|
|
Target size for final output. If None, output will be at current_size. Defaults to None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
preds |
|
Averaged segmentation logits of shape (B, num_classes, H, W). |