Introduction
Figure 1: the goal of this work is to take an image such as the one in
Figure 1(a), detect a human figure, and localize his
joints and limbs (b) along with their associated pixel masks (c).
|
Pixels, Superpixels, and Segmentations
Figure 2: stages of low-level processing: (a) Input image. (b) Edge map (Pb); it
handles high-frequency texture nicely. (c) A Normalized Cuts
segmentation with k=40. Salient limbs pop out as single segments;
head and torso consist of several segments. (d) A "superpixel" map
with 200 superpixels. It captures all the details.
|
Flow of the Algorithm
Learning Limb Detection
Each segment (Figure 1(c)) is a candidate for limb detection. We formulate the
limb detection problem as a classification between "good" limbs (from
groundtruth human data) and "bad" limbs (random segments). The features we use
include:
- contour cue: the low-level contrast between a segment and its
surroundings.
- shape cue: shape similarity between a shape
and a template (rectangle).
- shading cue: shading similarity to a stored template (as a limb
often has a characteristic shading pattern).
- focus cue: the foreground is typically in focus, thus has
high-frequency content (a weak cue, present in many news photos).
A logistic classifier is used to combine these cues.
This figure shows the performance of our (half-) limb detector. In average, if
we keep 8 top candidates per image, we may detect about 4 true half-limbs
(seems a good trade-off point).
|
Assembly of Parts
For every baseball player image, there are a few half-limbs that are very
salient and easily detectable, which we call the islands of saliency. In
average we can detect 4 half-limbs among the top 8 candidates. We also have 5
candidates for torso, obtained from a similar classifier. The goal of the
assembly stage is to find out which top candidates are true half-limbs, label
them and use them to recover the whole-body configuration.
To label the half-limbs and the torso, we use the following global configuration cues:
- relative width: limbs may be foreshortened but their widths
remain unchanged. Therefore, the relative width of an upper-leg and a lower-arm
must be compatible with each other (we use anthropometric data as groundtruth).
- length given torso: we assume that torsos are not foreshortened
much. Therefore a torso candidate gives conservative upper bounds on the
lengths of limbs.
- adjacency: adjacent parts after all should be adjacent to each other.
- symmetry of clothing: symmetric limbs should have the same appearance.
We use exhaustive search with pruning to find the best configurations. Because
we are able to limit the number of candidates (with the help of sophisticated
low-level processing), the search space is reasonably small.
|
In a more recent work (ICCV 2005) we have
employed Integer Quadratic Programming (IQP) to solve the assembly problem.
|
References
- Recovering Human Body Configurations: Combining Segmentation and Recognition.
[abstract]
[pdf]
[ps]
[bibtex]
Greg Mori, Xiaofeng Ren, Alyosha Efros and Jitendra Malik, in CVPR '04, volume 2, pages 326-333, Washington, DC 2004.
|