PaddingFlow: Improving Normalizing Flows with Padding-Dimensional Noise
Multimodal Large Language Models (MLLMs) excel in generating responses based
on visual inputs. However, they often suffer from a bias towards generating
responses similar to their pretraining corpus, overshadowing the importance of
visual information. We treat this bias as a "preference" for pretraining
statistics, which hinders the model's grounding in visual input. To mitigate
this issue, we propose Bootstrapped Preference Optimization (BPO), which
conducts preference learning with datasets containing negative responses
bootstrapped from the model itself. Specifically, we propose the following two
strategies: 1) using distorted image inputs to the MLLM for eliciting responses
that contain signified pretraining bias; 2) leveraging text-based LLM to
explicitly inject erroneous but common elements into the original response.
Those undesirable responses are paired with original annotated responses from
the datasets to construct the preference dataset, which is subsequently
utilized to perform preference learning. Our approach effectively suppresses
pretrained LLM bias, enabling enhanced grounding in visual inputs. Extensive
experimentation demonstrates significant performance improvements across
multiple benchmarks, advancing the state-of-the-art in multimodal
conversational systems.