Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving
Multimodal Large Language Models (MLLMs) have experienced significant
advancements recently. Nevertheless, challenges persist in the accurate
recognition and comprehension of intricate details within high-resolution
images. Despite being indispensable for the development of robust MLLMs, this
area remains underinvestigated. To tackle this challenge, our work introduces
InfiMM-HD, a novel architecture specifically designed for processing images of
different resolutions with low computational overhead. This innovation
facilitates the enlargement of MLLMs to higher-resolution capabilities.
InfiMM-HD incorporates a cross-attention module and visual windows to reduce
computation costs. By integrating this architectural design with a four-stage
training pipeline, our model attains improved visual perception efficiently and
cost-effectively. Empirical study underscores the robustness and effectiveness
of InfiMM-HD, opening new avenues for exploration in related areas. Codes and
models can be found at https://huggingface.co/Infi-MM/infimm-hd