Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model
Multispectral image pairs can provide the combined information, making object
detection applications more reliable and robust in the open world. To fully
exploit the different modalities, we present a simple yet effective
cross-modality feature fusion approach, named Cross-Modality Fusion Transformer
(CFT) in this paper. Unlike prior CNNs-based works, guided by the transformer
scheme, our network learns long-range dependencies and integrates global
contextual information in the feature extraction stage. More importantly, by
leveraging the self attention of the transformer, the network can naturally
carry out simultaneous intra-modality and inter-modality fusion, and robustly
capture the latent interactions between RGB and Thermal domains, thereby
significantly improving the performance of multispectral object detection.
Extensive experiments and ablation studies on multiple datasets demonstrate
that our approach is effective and achieves state-of-the-art detection
performance. Our code and models are available at
https://github.com/DocF/multispectral-object-detection.