Efficient Image Super-Resolution via Symmetric Visual Attention Network
In this paper, we introduce an open-vocabulary panoptic segmentation model
that effectively unifies the strengths of the Segment Anything Model (SAM) with
the vision-language CLIP model in an end-to-end framework. While SAM excels in
generating spatially-aware masks, it's decoder falls short in recognizing
object class information and tends to oversegment without additional guidance.
Existing approaches address this limitation by using multi-stage techniques and
employing separate models to generate class-aware prompts, such as bounding
boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model
which leverages SAM's spatially rich features to produce instance-aware masks
and harnesses CLIP's semantically discriminative features for effective
instance classification. Specifically, we address the limitations of SAM and
propose a novel Local Discriminative Pooling (LDP) module leveraging
class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary
classification. Furthermore, we introduce a Mask-Aware Selective Ensembling
(MASE) algorithm that adaptively enhances the quality of generated masks and
boosts the performance of open-vocabulary classification during inference for
each image. We conducted extensive experiments to demonstrate our methods
strong generalization properties across multiple datasets, achieving
state-of-the-art performance with substantial improvements over SOTA
open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and
ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art
methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website:
https://vibashan.github.io/possam-web/.