Mask Selection and Propagation for Unsupervised Video Object Segmentation
Temporal Reasoning is one important functionality for vision intelligence. In
computer vision research community, temporal reasoning is usually studied in
the form of video classification, for which many state-of-the-art Neural
Network structures and dataset benchmarks are proposed in recent years,
especially 3D CNNs and Kinetics. However, some recent works found that current
video classification benchmarks contain strong biases towards static features,
thus cannot accurately reflect the temporal modeling ability. New video
classification benchmarks aiming to eliminate static biases are proposed, with
experiments on these new benchmarks showing that the current clip-based 3D CNNs
are outperformed by RNN structures and recent video transformers.
In this paper, we find that 3D CNNs and their efficient depthwise variants,
when video-level sampling strategy is used, are actually able to beat RNNs and
recent vision transformers by significant margins on static-unbiased temporal
reasoning benchmarks. Further, we propose Temporal Fully Connected Block (TFC
Block), an efficient and effective component, which approximates fully
connected layers along temporal dimension to obtain video-level receptive
field, enhancing the spatiotemporal reasoning ability. With TFC blocks inserted
into Video-level 3D CNNs (V3D), our proposed TFCNets establish new
state-of-the-art results on synthetic temporal reasoning benchmark, CATER, and
real world static-unbiased dataset, Diving48, surpassing all previous methods.