Online Graph Dictionary Learning
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are
prone to over-fitting due to insufficient sign data resource and suffer limited
interpretability. In this paper, we propose the first self-supervised
pre-trainable SignBERT+ framework with model-aware hand prior incorporated. In
our framework, the hand pose is regarded as a visual token, which is derived
from an off-the-shelf detector. Each visual token is embedded with gesture
state and spatial-temporal position encoding. To take full advantage of current
sign data resource, we first perform self-supervised learning to model its
statistics. To this end, we design multi-level masked modeling strategies
(joint, frame and clip) to mimic common failure detection cases. Jointly with
these masked modeling strategies, we incorporate model-aware hand prior to
better capture hierarchical context over the sequence. After the pre-training,
we carefully design simple yet effective prediction heads for downstream tasks.
To validate the effectiveness of our framework, we perform extensive
experiments on three main SLU tasks, involving isolated and continuous sign
language recognition (SLR), and sign language translation (SLT). Experimental
results demonstrate the effectiveness of our method, achieving new
state-of-the-art performance with a notable gain.