SSR: An Efficient and Robust Framework for Learning with Unknown Label Noise
Jointly processing information from multiple sensors is crucial to achieving
accurate and robust perception for reliable autonomous driving systems.
However, current 3D perception research follows a modality-specific paradigm,
leading to additional computation overheads and inefficient collaboration
between different sensor data. In this paper, we present an efficient
multi-modal backbone for outdoor 3D perception named UniTR, which processes a
variety of modalities with unified modeling and shared parameters. Unlike
previous works, UniTR introduces a modality-agnostic transformer encoder to
handle these view-discrepant sensor data for parallel modal-wise representation
learning and automatic cross-modal interaction without additional fusion steps.
More importantly, to make full use of these complementary sensor types, we
present a novel multi-modal integration strategy by both considering
semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood
relations. UniTR is also a fundamentally task-agnostic backbone that naturally
supports different 3D perception tasks. It sets a new state-of-the-art
performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object
detection and +12.0 higher mIoU for BEV map segmentation with lower inference
latency. Code will be available at https://github.com/Haiyang-W/UniTR .