Enhancing Intra-class Information Extraction for Heterophilous Graphs: One Neural Architecture Search Approach
Given an image with multiple people, our goal is to directly regress the pose
and shape of all the people as well as their relative depth. Inferring the
depth of a person in an image, however, is fundamentally ambiguous without
knowing their height. This is particularly problematic when the scene contains
people of very different sizes, e.g. from infants to adults. To solve this, we
need several things. First, we develop a novel method to infer the poses and
depth of multiple people in a single image. While previous work that estimates
multiple people does so by reasoning in the image plane, our method, called
BEV, adds an additional imaginary Bird's-Eye-View representation to explicitly
reason about depth. BEV reasons simultaneously about body centers in the image
and in depth and, by combing these, estimates 3D body position. Unlike prior
work, BEV is a single-shot method that is end-to-end differentiable. Second,
height varies with age, making it impossible to resolve depth without also
estimating the age of people in the image. To do so, we exploit a 3D body model
space that lets BEV infer shapes from infants to adults. Third, to train BEV,
we need a new dataset. Specifically, we create a "Relative Human" (RH) dataset
that includes age labels and relative depth relationships between the people in
the images. Extensive experiments on RH and AGORA demonstrate the effectiveness
of the model and training scheme. BEV outperforms existing methods on depth
reasoning, child shape estimation, and robustness to occlusion. The code and
dataset are released for research purposes.