Learning Algebraic Recombination for Compositional Generalization
Vision-language modeling has enabled open-vocabulary tasks where predictions
can be queried using any text prompt in a zero-shot manner. Existing
open-vocabulary tasks focus on object classes, whereas research on object
attributes is limited due to the lack of a reliable attribute-focused
evaluation benchmark. This paper introduces the Open-Vocabulary Attribute
Detection (OVAD) task and the corresponding OVAD benchmark. The objective of
the novel task and benchmark is to probe object-level attribute information
learned by vision-language models. To this end, we created a clean and densely
annotated test set covering 117 attribute classes on the 80 object classes of
MS COCO. It includes positive and negative annotations, which enables
open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million
annotations. For reference, we provide a first baseline method for
open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's
value by studying the attribute detection performance of several foundation
models. Project page https://ovad-benchmark.github.io