Controlling Hallucinations at Word Level in Data-to-Text Generation
Although the estimation of 3D human pose and shape (HPS) is rapidly
progressing, current methods still cannot reliably estimate moving humans in
global coordinates, which is critical for many applications. This is
particularly challenging when the camera is also moving, entangling human and
camera motion. To address these issues, we adopt a novel 5D representation
(space, time, and identity) that enables end-to-end reasoning about people in
scenes. Our method, called TRACE, introduces several novel architectural
components. Most importantly, it uses two new "maps" to reason about the 3D
trajectory of people over time in camera, and world, coordinates. An additional
memory unit enables persistent tracking of people even during long occlusions.
TRACE is the first one-stage method to jointly recover and track 3D humans in
global coordinates from dynamic cameras. By training it end-to-end, and using
full image information, TRACE achieves state-of-the-art performance on tracking
and HPS benchmarks. The code and dataset are released for research purposes.