Liquid Structural State-Space Models
Few-shot object detection is an imperative and long-lasting problem due to
the inherent long-tail distribution of real-world data. Its performance is
largely affected by the data scarcity of novel classes. But the semantic
relation between the novel classes and the base classes is constant regardless
of the data availability. In this work, we investigate utilizing this semantic
relation together with the visual information and introduce explicit relation
reasoning into the learning of novel object detection. Specifically, we
represent each class concept by a semantic embedding learned from a large
corpus of text. The detector is trained to project the image representations of
objects into this embedding space. We also identify the problems of trivially
using the raw embeddings with a heuristic knowledge graph and propose to
augment the embeddings with a dynamic relation graph. As a result, our few-shot
detector, termed SRR-FSD, is robust and stable to the variation of shots of
novel objects. Experiments show that SRR-FSD can achieve competitive results at
higher shots, and more importantly, a significantly better performance given
both lower explicit and implicit shots. The benchmark protocol with implicit
shots removed from the pretrained classification dataset can serve as a more
realistic setting for future research.