Introduction
There are many methods for reliable recognition of frontal face image,
and recently for recognition from profile gait sequences [1]. However, in many practical settings the
subjects to be identified do not conform to these pose and motion limitations. In order
for a recognition method to be able to work under such conditions, it
needs to be view-independent. Furthermore, once a moving subject is observed by
the camera(s), it is appealing to integrate the recognition results
based on multiple modalities, to obtain a more robust recognition
scheme. We are currently developing techniques for view-independent integrated face-gait
recognition. The experiments use an implementation of Image-Based
Visual Hulls (IBVH) system, installed at the MIT AI Lab.
Methodology
An example of typical input to our system is shown in this AVI file.
At times the subject does not face any of the cameras, nor does he have his
profile towards any camera. Having two dynamic cameras, one staying in
front of the subject looking at his face and another staying at a
desired distance looking perpendicular to his motion direction, one
would obtain data which could be fed to view-dependent recognition
methods. We do not have such cameras. However, we have a few (four in
our case) far-placed images, providing information on the 3D structure
of the observed scene and its appearance (texture).
Using IBVH, we can reconstruct the 3D visuall hull of the person. Then, for any desired camera position, we can project this 3D hull on the image plane, and obtain synthetic silhouette of the VH viewd from that viewpoint. Choosing virtual viewpoints so that the viewing direction is perpendicular to the subjects motion vector at the given time, provides synthetic profiles (from both left and right), which can be used for recognition by gait. An example of such profile sequence, corresponding to the input above, is available.
Clearly, face recognition requires more than a silhouette. Visuall hull provides a 3D approximation of the face surface. As long as this surface is visible to at least one camera, a texture can be mapped to the VH, and for any virtual viewpoint it can be rendered using common computer graphics techniques. Thus, we can obtain a synthetic frontal view of the face by placing the virtual camera in front of the face looking towards it. In order to determine face orientation, we use the novel fast face detection method described in [3]. We can dramatically reduce the searching space: given the VH, and assuming general upright position, we need to search only the different views of the head (the top part of the VH). An AVI clip demonstrates such search space and the face images detected as close to frontal. Heuristic assumptions, such as assuming face orientation roughly parallel to the motion vector, can provide further cues on the initial spatial angle in which frontal face is sought.
Trajectory estimation
In order to estimate the motion of the subject, we examine the
position of the centroid of the VH, which is in turn an estimate of
the bosy center of mass. This is a coarse estimate,
susceptible to errors due to both measurment noise (imperfect
silhouette extraction leads to distorted VH), and to the dynamic
changes in body shape (such as swinging arms and legs). However we
found it reasonably precise for subjects who walk across the scene at
a normal walking speed.
A simple Kalman filter is applied to the observed centroids of the VH, to produce an estimated direction of the motion vector in each frame. Once this vector is obtained, we can compute the viewpoints relevant to synthetic face and gait data. Synthetic face images can be rendered and recognized on the fly, while gait feature vector can be computed once sufficient number of synthetic profile have been collected
Integration
Since face and gait recognition results are independent given the
constructed VH, one can potentially improve recognition performance by
combining them. We are exploring the probabilistic framework
for such integration. The ad-hoc rule currently in use assigns
a confidence measure to each possible ID under each modality. These
confidence levels are set according to the distance from the computed
feature vector for the observed data and the individuals represented
in the database. Final classification is then obtained by averaging
the confidence levels assigned by face and gait classifiers, and
picking the ID with the maximal confidence.
Results
Our experiments involved 12 individuals. The table below shows error
rates, obtained with leave-one-out cross validation over 54 sequences:
With VH view-normalization | Without view-normalization | |
Face only | .20 | .69 |
Gait only | .13 | .48 |
Combined face and gait | .09 | .36 |