Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets
Motion, measured via optical flow, provides a powerful cue to discover and
learn objects in images and videos. However, compared to using appearance, it
has some blind spots, such as the fact that objects become invisible if they do
not move. In this work, we propose an approach that combines the strengths of
motion-based and appearance-based segmentation. We propose to supervise an
image segmentation network with the pretext task of predicting regions that are
likely to contain simple motion patterns, and thus likely to correspond to
objects. As the model only uses a single image as input, we can apply it in two
settings: unsupervised video segmentation, and unsupervised image segmentation.
We achieve state-of-the-art results for videos, and demonstrate the viability
of our approach on still images containing novel objects. Additionally we
experiment with different motion models and optical flow backbones and find the
method to be robust to these change. Project page and code available at
https://www.robots.ox.ac.uk/~vgg/research/gwm.