Balancing Shared and Task-Specific Representations: A Hybrid Approach to Depth-Aware Video Panoptic Segmentation
Kurt H.W. Stolle Eindhoven University of TechnologyAbstract
In this work, we present Multiformer, a novel approach to depth-aware video panoptic segmentation (DVPS) based on the mask transformer paradigm. Our method learns object representations that are shared across segmentation, monocular depth estimation, and object tracking subtasks. In contrast to recent unified approaches that progressively refine a common object representation, we propose a hybrid method using task-specific branches within each decoder block, ultimately fusing them into a shared representation at the block interfaces. Extensive experiments on the Cityscapes-DVPS and SemKITTI-DVPS datasets demonstrate that Multiformer achieves state-of-the-art performance across all DVPS metrics, outperforming previous methods by substantial margins. With a ResNet-50 backbone, Multiformer surpasses the previous best result by 3.0 DVPQ points while also improving depth estimation accuracy. Using a Swin-B backbone, Multiformer further improves performance by 4.0 DVPQ points. Multiformer also provides valuable insights into the design of multi-task decoder architectures.
Summary


Context & Background
Depth-aware video panoptic segmentation (DVPS) has a critical role in autonomous driving and robotics applications. Current approaches to DVPS share information across tasks either explicitly or implicitly.
- Explicit approaches define the interactions of task-specific object representations in the network architecture to jointly model depth and segmentation. This enables fine-grained control over the flow of information and allocation of computational resources.
- Implicit approaches use a single shared object representation that embeds both tasks in a unified network. The relevant interactions can be learned through training, but the network may struggle to balance the requirements of each task.
Recent work has shown that implicit models, having shared object representations, can reliably improve the performance of DVPS models. We hypothesize that a hybrid approach that primarily uses shared object representations but also includes task-specific representations can further improve the performance of DVPS models. To this end, we introduce Multiformer, a hybrid model that balances shared and task-specific representations in a mask transformer model.
Key aspects
Mask transformer
Based on Mask2Former with added depth and tracking capabilities.
Hybrid decoder
Featuring a context adapter and branched decoder blocks.
Metric depth
Improved accuracy and stability without dataset-specific hyperparameters.
Architecture Overview
Multiformer processes video frames through a multi-stage architecture combining shared and task-specific processing. The model first extracts multi-scale features using an ImageNet-pretrained backbone enhanced with a pixel decoder employing multi-scale deformable attention. These features are combined through a feature pyramid network to create rich task-specific representations for segmentation and depth estimation.
Query Initialization & Processing
A context adapter intelligently initializes shared object queries by compressing task-specific features into a condensed context representation through cross-attention. The hybrid decoder architecture then processes these queries through alternating phases of task-specific refinement and shared representation fusion. Each decoder block splits queries into parallel mask and depth branches for specialized processing before merging updates through learned linear transforms.
Monocular depth estimation
The final stage features an innovative metric depth head that directly predicts log-scale depths through query-wise affine transforms and dynamic softmax merging, eliminating dataset-specific normalization.
Object tracking
Object tracking is achieved through cosine similarity matching of shared query embeddings between frames, requiring no dedicated video training pipeline due to the rich temporal consistency in the learned representations.
Results and Conclusions
Our comprehensive results include multi-task and specific evaluation on the Cityscapes-DVPS and SemKITTI-DVPS datasets, which demonstrate the effectiveness and versatility of the proposed approach.
For more details on usage and contribution, please refer to the section below.
Resources
- Install dependencies.
pip install multiformer
- Run the model.
multiformer cityscapes/large.swinb
MY_DATA
with your own data (file or directory).
For researchers
If you find this work useful, please consider citing the paper.
@InProceedings{Stolle2025Balancing,
title = {Balancing Shared and Task-Specific Representations: A Hybrid Approach to Depth-Aware Video Panoptic Segmentation},
author = {Stolle, Kurt Henricus Werner},
booktitle = {WACV},
year = {2025}
}
This publication is part of the NEON project with file number 17628 of the Crossover research program, which is (partly) financed by the Dutch Research Council (NWO). The Dutch national compute infrastructure was used with the support of the SURF Cooperative using grant EINF-5438.