From Generalized zero-shot learning to long-tail with class descriptors
Recent text-to-image models have achieved impressive results. However, since
they require large-scale datasets of text-image pairs, it is impractical to
train them on new domains where data is scarce or not labeled. In this work, we
propose using large-scale retrieval methods, in particular, efficient
k-Nearest-Neighbors (kNN), which offers novel capabilities: (1) training a
substantially small and efficient text-to-image diffusion model without any
text, (2) generating out-of-distribution images by simply swapping the
retrieval database at inference time, and (3) performing text-driven local
semantic manipulations while preserving object identity. To demonstrate the
robustness of our method, we apply our kNN approach on two state-of-the-art
diffusion backbones, and show results on several different datasets. As
evaluated by human studies and automatic metrics, our method achieves
state-of-the-art results compared to existing approaches that train
text-to-image generation models using images only (without paired text data)