MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention
Large-scale deployment of autonomous vehicles has been continually delayed
due to safety concerns. On the one hand, comprehensive scene understanding is
indispensable, a lack of which would result in vulnerability to rare but
complex traffic situations, such as the sudden emergence of unknown objects.
However, reasoning from a global context requires access to sensors of multiple
types and adequate fusion of multi-modal sensor signals, which is difficult to
achieve. On the other hand, the lack of interpretability in learning models
also hampers the safety with unverifiable failure causes. In this paper, we
propose a safety-enhanced autonomous driving framework, named Interpretable
Sensor Fusion Transformer(InterFuser), to fully process and fuse information
from multi-modal multi-view sensors for achieving comprehensive scene
understanding and adversarial event detection. Besides, intermediate
interpretable features are generated from our framework, which provide more
semantics and are exploited to better constrain actions to be within the safe
sets. We conducted extensive experiments on CARLA benchmarks, where our model
outperforms prior methods, ranking the first on the public CARLA Leaderboard.
Our code will be made available at https://github.com/opendilab/InterFuser