Shape-Biased Domain Generalization via Shock Graph Embeddings
Contrastive learning-based video-language representation learning approaches,
e.g., CLIP, have achieved outstanding performance, which pursue semantic
interaction upon pre-defined video-text pairs. To clarify this coarse-grained
global interaction and move a step further, we have to encounter challenging
shell-breaking interactions for fine-grained cross-modal learning. In this
paper, we creatively model video-text as game players with multivariate
cooperative game theory to wisely handle the uncertainty during fine-grained
semantic interaction with diverse granularity, flexible combination, and vague
intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to
value possible correspondence between video frames and text words for sensitive
and explainable cross-modal contrast. To efficiently realize the cooperative
game of multiple video frames and multiple text words, the proposed method
clusters the original video frames (text words) and computes the Banzhaf
Interaction between the merged tokens. By stacking token merge modules, we
achieve cooperative games at different semantic levels. Extensive experiments
on commonly used text-video retrieval and video-question answering benchmarks
with superior performances justify the efficacy of our HBI. More encouragingly,
it can also serve as a visualization tool to promote the understanding of
cross-modal interaction, which have a far-reaching impact on the community.
Project page is available at https://jpthu17.github.io/HBI/.