Accurate and Real-time 3D Pedestrian Detection Using an Efficient Attentive Pillar Network
Learning specific hands-on skills such as cooking, car maintenance, and home
repairs increasingly happens via instructional videos. The user experience with
such videos is known to be improved by meta-information such as time-stamped
annotations for the main steps involved. Generating such annotations
automatically is challenging, and we describe here two relevant contributions.
First, we construct and release a new dense video captioning dataset, Video
Timeline Tags (ViTT), featuring a variety of instructional videos together with
time-stamped annotations. Second, we explore several multimodal
sequence-to-sequence pretraining strategies that leverage large unsupervised
datasets of videos and caption-like texts. We pretrain and subsequently
finetune dense video captioning models using both YouCook2 and ViTT. We show
that such models generalize well and are robust over a wide variety of
instructional videos.