Balancing Shared and Task-Specific Representations: A Hybrid Approach to Depth-Aware Video Panoptic Segmentation

Eindhoven University of Technology

Abstract

In this work, we present Multiformer, a novel approach to depth-aware video panoptic segmentation (DVPS) based on the mask transformer paradigm. Our method learns object representations that are shared across segmentation, monocular depth estimation, and object tracking subtasks. In contrast to recent unified approaches that progressively refine a common object representation, we propose a hybrid method using task-specific branches within each decoder block, ultimately fusing them into a shared representation at the block interfaces. Extensive experiments on the Cityscapes-DVPS and SemKITTI-DVPS datasets demonstrate that Multiformer achieves state-of-the-art performance across all DVPS metrics, outperforming previous methods by substantial margins. With a ResNet-50 backbone, Multiformer surpasses the previous best result by 3.0 DVPQ points while also improving depth estimation accuracy. Using a Swin-B backbone, Multiformer further improves performance by 4.0 DVPQ points. Multiformer also provides valuable insights into the design of multi-task decoder architectures.

Summary

Context & Background

Depth-aware video panoptic segmentation (DVPS) has a critical role in autonomous driving and robotics applications. Current approaches to DVPS share information across tasks either explicitly or implicitly.

Recent work has shown that implicit models, having shared object representations, can reliably improve the performance of DVPS models. We hypothesize that a hybrid approach that primarily uses shared object representations but also includes task-specific representations can further improve the performance of DVPS models. To this end, we introduce Multiformer, a hybrid model that balances shared and task-specific representations in a mask transformer model.

Key aspects

Mask transformer

Based on Mask2Former with added depth and tracking capabilities.

Hybrid decoder

Featuring a context adapter and branched decoder blocks.

Metric depth

Improved accuracy and stability without dataset-specific hyperparameters.

Architecture Overview

Multiformer model architecture diagram with backbone, pixel decoder, feature pyramid. The query adapter and three decoder blocks form a hybrid query decoder. Final predictions are made through a metric depth estimation head and tracking module.
Multiformer architecture

Multiformer processes video frames through a multi-stage architecture combining shared and task-specific processing. The model first extracts multi-scale features using an ImageNet-pretrained backbone enhanced with a pixel decoder employing multi-scale deformable attention. These features are combined through a feature pyramid network to create rich task-specific representations for segmentation and depth estimation.

Diagram of the context adapter, responsible for initializing the shared multi-purpose queries. 
                The block compresses task-specific features into a condensed context representation and interacts with learnable queries through cross-attention.
Context adapter
Branched decoder block of the hybrid query decoder system. The block splits queries into parallel mask and depth branches for specialized processing before merging updates through learned linear transforms.
Branched block
Depth estimation heads

Query Initialization & Processing

A context adapter intelligently initializes shared object queries by compressing task-specific features into a condensed context representation through cross-attention. The hybrid decoder architecture then processes these queries through alternating phases of task-specific refinement and shared representation fusion. Each decoder block splits queries into parallel mask and depth branches for specialized processing before merging updates through learned linear transforms.

Diagram of the naive min-max denormalized depth head.
Previous methods
Diagram of the metric depth head, responsible for predicting log-scale depths through query-wise affine transforms and dynamic softmax merging.
Ours
Depth estimation heads

Monocular depth estimation

The final stage features an innovative metric depth head that directly predicts log-scale depths through query-wise affine transforms and dynamic softmax merging, eliminating dataset-specific normalization.

Object tracking module, responsible for matching shared query embeddings between frames through cosine similarity.
Object tracker

Object tracking

Object tracking is achieved through cosine similarity matching of shared query embeddings between frames, requiring no dedicated video training pipeline due to the rich temporal consistency in the learned representations.

Results and Conclusions

Our comprehensive results include multi-task and specific evaluation on the Cityscapes-DVPS and SemKITTI-DVPS datasets, which demonstrate the effectiveness and versatility of the proposed approach.

For more details on usage and contribution, please refer to the section below.

Resources

  1. Install dependencies.
    pip install multiformer
  2. Run the model.
    multiformer cityscapes/large.swinb
    Replace MY_DATA with your own data (file or directory).

For researchers

If you find this work useful, please consider citing the paper.

 @InProceedings{Stolle2025Balancing,
  title = {Balancing Shared and Task-Specific Representations: A Hybrid Approach to Depth-Aware Video Panoptic Segmentation},
  author = {Stolle, Kurt Henricus Werner},
  booktitle = {WACV},
  year = {2025}
 }

This publication is part of the NEON project with file number 17628 of the Crossover research program, which is (partly) financed by the Dutch Research Council (NWO). The Dutch national compute infrastructure was used with the support of the SURF Cooperative using grant EINF-5438.