01.02.2015 Views

Face Recognition by Elastic Bunch Graph Matching* - FIAS

Face Recognition by Elastic Bunch Graph Matching* - FIAS

Face Recognition by Elastic Bunch Graph Matching* - FIAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

7th Intern. Conf. on Computer Analysis of Images and Patterns,<br />

Kiel, Germany, September 1997.<br />

<strong>Face</strong> <strong>Recognition</strong> <strong>by</strong> <strong>Elastic</strong> <strong>Bunch</strong><br />

<strong>Graph</strong> Matching <br />

Laurenz Wiskott 1y , Jean-Marc Fellous 2z ,<br />

Norbert Kruger 1 , and Christoph von der Malsburg 1;2<br />

1 Institut fur Neuroinformatik<br />

Ruhr-Universitat Bochum<br />

D-44780 Bochum, Germany<br />

http://www.neuroinformatik.ruhr-uni-bochum.de<br />

2 Computer Science Department<br />

University of Southern California<br />

Los Angeles, CA 90089, USA<br />

Abstract<br />

We present a system for recognizing human faces from single images<br />

out of a large database with one image per person. The task is dicult<br />

because of image variation in terms of position, size, expression, and pose.<br />

The system collapses most of this variance <strong>by</strong> extracting concise face descriptions<br />

in the form of image graphs. In these, ducial points on the<br />

face (eyes, mouth etc.) are described <strong>by</strong> sets of wavelet components (jets).<br />

Image graph extraction is based on a novel approach, the bunch graph,<br />

which is constructed from a small set of sample image graphs. <strong>Recognition</strong><br />

is based on a straight-forward comparison of image graphs. We<br />

report recognition experiments on the FERET database and the Bochum<br />

database, including recognition across pose.<br />

1 Introduction<br />

The system presented here is based on a face recognition system described in [3].<br />

<strong>Face</strong>s are represented <strong>by</strong> labeled graphs based on a Gabor wavelet transform. A<br />

<br />

Supported <strong>by</strong> grants from the German Federal Ministry for Science and Technology (413-<br />

5839-01 IN 101 B9) and from ARPA and the U.S. Army Research Lab (01/93/K-109).<br />

y<br />

Current address: Computational Neurobiology Laboratory, The Salk Institute for Biological<br />

Studies, San Diego, CA 92186-5800, http://www.cnl.salk.edu/CNL, wiskott@salk.edu.<br />

z<br />

Current address: Volen Center for Complex Systems, Brandeis University, Waltham, MA<br />

02254-9110.<br />

1


matching process allows nding image graphs of new faces automatically. <strong>Recognition</strong><br />

is then based on a simple similarity measure between the image graph and<br />

a gallery of model graphs. We have made three major extensions to the preceding<br />

system in order to handle larger galleries and larger variations in pose,<br />

and to increase the matching accuracy. Firstly, we use the phase of the complex<br />

Gabor wavelet coecients to achieve a more accurate location of the nodes of<br />

the graph. Secondly, we employ object adapted graphs, so that nodes refer to<br />

specic facial landmarks, called ducial points. The correct correspondences between<br />

two faces can then be found across large viewpoint changes. Thirdly, we<br />

have introduced a new data structure, called the bunch graph, which serves as a<br />

generalized representation of faces <strong>by</strong> combining jets of a small set of individual<br />

faces. This allows the system to nd the ducial points in one matching process,<br />

which eliminates the need for matching each model graph individually. This<br />

reduces computational eort signicantly. A more detailed description of this<br />

system is given in [8].<br />

2 The System<br />

2.1 Preprocessing with Gabor Wavelets<br />

The representation of local features is based on a Gabor wavelet transform. It is<br />

dened as a convolution of the grey value image with a family of Gabor kernels<br />

j(~x) = k2 j<br />

2 exp ,k2 j x2<br />

2 2 !<br />

exp(i ~ k j ~x) , exp<br />

, 2<br />

2<br />

in the shape of plane waves with wave vector ~ k j , restricted <strong>by</strong> a Gaussian<br />

envelope function. We employ a discrete set of 5 dierent frequencies, index<br />

= 0; :::; 4, and 8 orientations, index = 0; :::; 7, yielding wave vectors<br />

~ kj = (k jx ;k jy ) = (k cos ' ;k sin ' ) with j = +8, k = 2 , +2<br />

2 , and<br />

' = . This sampling evenly covers a band in frequency space. The width<br />

8<br />

=k of the Gaussian is controlled <strong>by</strong> the parameter =2. The R second term in<br />

the bracket of (1) makes the kernels DC-free, i.e. the integral j(~x)d 2 ~x vanishes.<br />

One speaks of a wavelet transform since the family of kernels is self-similar,<br />

all kernels being generated from one mother wavelet <strong>by</strong> dilation and rotation.<br />

A jet J is dened as the set fJ j g of 40 complex coecients obtained for one<br />

image point <strong>by</strong> this convolution. It can be written as<br />

<br />

(1)<br />

J j = a j exp(i j ) (2)<br />

with amplitudes a j (~x), which slowly vary with position, and phases j (~x), which<br />

rotate with a rate set <strong>by</strong> the spatial frequency or wave vector ~ k j of the kernels.<br />

Due to phase rotation, jets taken from image points with a relative distance<br />

~d of few pixels have very dierent coecients although representing almost the<br />

same local feature. This can cause severe problems for matching. We therefore<br />

2


either ignore the phase or compensate for its variation explicitly. The former<br />

leads to the similarity function<br />

already used in [1, 3].<br />

function<br />

S a (J ; J 0 )=<br />

P<br />

j<br />

rP<br />

j<br />

a j a 0 j<br />

a 2 j<br />

P<br />

j<br />

; (3)<br />

a 02<br />

j<br />

Compensating phase variation leads to the similarity<br />

S (J ; J 0 )=<br />

P<br />

j<br />

a j a 0 j cos( j , 0 j , ~ d ~ k j )<br />

rP<br />

j<br />

a 2 j<br />

P<br />

j<br />

a 02<br />

j<br />

: (4)<br />

In order to compute it, the displacement ~ d has to be estimated. This can be<br />

done <strong>by</strong> maximizing S with respect to ~ d in its Taylor expansion. If applied<br />

iteratively from low to high frequencies, displacements of up to eight pixels can<br />

be estimated [8], which is useful to guide the matching process.<br />

2.2 <strong>Face</strong> Representation<br />

For faces, we have dened a set of ducial points, e.g. the pupils, the corners of<br />

the mouth, the tip of the nose, the top and bottom of the ears, etc. A labeled<br />

graph G representing a face consists of N nodes on these ducial points at positions<br />

~x n ;n=1; :::; N and E edges between them. The nodes are labeled with<br />

jets J n . The edges are labeled with distances ~x e = ~x n ,~x n 0;e=1; :::; E, where<br />

edge e connects node n 0 with n. Hence the edge labels are two-dimensional vectors.<br />

(When wanting to refer to the geometrical structure of a graph, unlabeled<br />

<strong>by</strong> jets, we call it a grid.) This face graph is object-adapted, since the nodes are<br />

selected from face-specic points (ducial points); see Fig. 2.<br />

<strong>Graph</strong>s for dierent head pose dier in geometry and local features. Although<br />

the ducial points refer to corresponding object locations, some may be occluded,<br />

and jets as well as distances vary due to rotation in depth. To be able to compare<br />

graphs for dierent poses, we manually dene pointers to associate corresponding<br />

nodes in the dierent graphs.<br />

In order to nd ducial points in new faces, one needs a general representation<br />

rather than models of individual faces. This representation should cover a wide<br />

range of possible variations in the appearance of faces, such as dierently shaped<br />

eyes, mouths, or noses, dierent types of beards, variations due to gender, age,<br />

and race etc. It is obvious that it would be too expensive tocover each feature<br />

combination <strong>by</strong> a separate graph. We instead combine a representative set of<br />

individual model graphs into a stack-like structure, called a face bunch graph<br />

(FBG) (see Fig. 1). Each model has the same grid structure and the nodes refer<br />

to identical ducial points. A set of jets referring to one ducial point is called a<br />

bunch. An eye bunch, for instance, may include jets from closed, open, female,<br />

and male eyes etc. to cover these local variations. During the location of ducial<br />

points in a face not seen before, the procedure described in the next section<br />

3


selects the best tting jet, called the local expert, from the bunch dedicated to<br />

each ducial point. Thus, the full combinatorics of jets in the bunch graph is<br />

available, covering a much larger range of facial variation than represented in<br />

the constituting model graphs themselves.<br />

Assume for a particular pose there are M model graphs G Bm (m =1; :::; M)<br />

of identical structure, taken from dierent model faces. The corresponding FBG<br />

B then is given the same structure, its nodes are labeled with bunches of jets<br />

J Bm<br />

n<br />

and its edges are labeled with the averaged distances ~x B e = P m<br />

~x Bm<br />

e<br />

=M.<br />

face bunch graph<br />

Figure 1: The face bunch graph (FBG) serves as<br />

a representation of faces in general. It is designed<br />

to cover all possible variations in the appearance of<br />

faces. The FBG combines information from a number<br />

of face graphs. Its nodes are labeled with sets of<br />

jets, called bunches, and its edges are labeled with<br />

averages of distance vectors. During comparison to<br />

an image the best tting jet in each bunch is selected<br />

independently, indicated <strong>by</strong> grey shading.<br />

2.3 Generating <strong>Face</strong> Representations <strong>by</strong> <strong>Elastic</strong> <strong>Bunch</strong><br />

<strong>Graph</strong> Matching<br />

So far we have only described how individual faces and general knowledge about<br />

faces are represented <strong>by</strong> labeled graphs and the FBG, respectively. How can<br />

these graphs be generated The simplest method is to do so manually. We<br />

have used this method to generate initial graphs for the system, one graph for<br />

each pose, together with pointers to indicate which pairs of nodes in graphs for<br />

dierent poses correspond to each other. Once the system has an FBG (possibly<br />

consisting of only one manually dened model), graphs for new images can be<br />

generated automatically <strong>by</strong> elastic bunch graph matching. Initially, when the<br />

FBG contains only few faces, it is necessary to review and correct the resulting<br />

matches, but once the FBG is rich enough (approximately 70 graphs) one can<br />

rely on the matching and generate large galleries of model graphs automatically.<br />

Akey role in elastic bunch graph matching is played <strong>by</strong> a function evaluating<br />

the graph similarity between an image graph and the FBG of identical pose. It<br />

depends on the jet similarities and the distortion of the image grid relative to the<br />

FBG grid. For an image graph G I with nodes n =1; :::; N and edges e =1; :::; E<br />

and an FBG B with model graphs m =1; :::; M the similarity is dened as<br />

S B (G I ; B) = 1 N<br />

X<br />

n<br />

max<br />

m<br />

,<br />

S (Jn I ; J n<br />

Bm ) , E<br />

X<br />

e<br />

(~x I e , ~xB e )2<br />

(~x B e )2 ; (5)<br />

where determines the relative importance of jets and metric structure. J n<br />

are the jets at node n and ~x e are the distance vectors used as labels at edges<br />

4


e. Since the FBG provides several jets for each ducial point, the best one is<br />

selected and used for comparison. These best tting jets serve aslocal experts<br />

for the image face.<br />

The goal of elastic bunch graph matching on a probe image is to nd the<br />

ducial points and thus to extract from the image a graph which maximizes the<br />

similarity with the FBG as dened in Equation (5). In practice, one has to apply<br />

a heuristic algorithm to come close to the optimum within reasonable time. We<br />

use a coarse to ne approach in which weintroduce the degrees of freedom of the<br />

FBG progressively: translation, scale, aspect ratio, and nally local distortions.<br />

Phase information is used only in the latter steps. The resulting graph is called<br />

the image graph and is stored as a representation for the individual face of the<br />

image.<br />

Figure 2: Object-adapted grids for dierent<br />

poses. The nodes are positioned automatically <strong>by</strong><br />

elastic graph matching against the corresponding<br />

face bunch graphs. The two top images show<br />

originals with widely diering size and grids as<br />

used for the normalization stage with many nodes<br />

on the outline. The other images are already<br />

rescaled to normal size. Their grids have more<br />

nodes on the face, which is more appropriate for<br />

recognition (The grids used in the experiments<br />

had about 14 additional nodes which are not<br />

shown here for simplicity).<br />

To minimize computing eort and to optimize reliability we extract a face<br />

representation in two stages. The rst stage, called normalization stage, has the<br />

purpose of estimating the position and size of the face in the original image,<br />

so that the image can be scaled and cut to standard size. The second stage<br />

takes this image as input and extracts a precise image graph appropriate for<br />

face recognition purposes. The two stages dier in emphasis. The rst one has<br />

to deal with greater uncertainty as to size and position of the head and has to<br />

optimize the reliability with which it nds the face, but there is no need to nd<br />

ducial points with any precision or extract data important for face recognition.<br />

The second stage can start with little uncertainty as to position and size of the<br />

head but has to extract a detailed face graph with high precision. The two<br />

stages took about 20 plus 10 seconds on a SPARCstation 10-512 with a 50 MHz<br />

processor.<br />

2.4 <strong>Recognition</strong><br />

After having extracted model graphs from the gallery images and image graphs<br />

from the probe images, recognition is possible with relatively little computational<br />

eort <strong>by</strong> comparing an image graph with all model graphs and picking<br />

the one with the highest similarity value. The similarity function is based on<br />

5


the average similarity between corresponding jets. For image and model graphs<br />

referring to dierent pose, we compare jets according to the manually provided<br />

correspondences. Here we use the similarity function without phase (Eq. 3). A<br />

comparison against a gallery of 250 individuals took slightly less than a second.<br />

3 Experiments<br />

For the experiments we used image galleries taken from the ARPA/ARL FERET<br />

database provided <strong>by</strong> the US Army Research Laboratory. The poses are: frontal<br />

view, half-prole right or left (rotated <strong>by</strong> about 40-70 ), and prole right or<br />

left. For most faces there are two frontal views with dierent facial expression.<br />

Apart from a few exceptions there are no disguises, variations in hair-style or<br />

in clothing. The background is always a homogeneous light or grey, except for<br />

smoothly varying shadows. The size of the faces varies <strong>by</strong> about a factor of<br />

three (but is constant for each individual, information which we could have used<br />

to improve recognition rates, but didn't). The format of the original images is<br />

256384 pixels.<br />

Model Probe First rank First 10 ranks<br />

gallery images # % # %<br />

250 fa 250 fb 245 98 248 99<br />

250 hr 181 hl 103 57 147 81<br />

250 pr 250 pl 210 84 236 94<br />

249 fa + 1fb 171 hl + 79 hr 44 18 111 44<br />

171 hl + 79 hr 249 fa + 1fb 42 17 95 38<br />

170 hl + 80 hr 217 pl + 33 pr 22 9 67 27<br />

217 pl + 33 pr 170 hl + 80 hr 31 12 80 32<br />

Table 1: <strong>Recognition</strong> results for cross-runs between dierent galleries (f: frontal<br />

views; a, b: expression a and b; h: half proles; p: proles; l, r: left and right).<br />

Each gallery contained only one image per person; the dierent compositions in<br />

the four bottom lines are due to the fact that not all poses were available for all<br />

people. The table shows how often the correct model was identied as rank one<br />

and how often it was among the rst 10 (4%).<br />

We used various model and probe galleries with faces of dierent pose. Each<br />

model gallery contained 250 faces with just one image per person. <strong>Recognition</strong><br />

results are shown in Table 1. The recognition rate is very high for frontal against<br />

frontal images (rst row). This is mainly due to the fact that in this database two<br />

frontal views show only little variation, and any face recognition system should<br />

perform well under these circumstances (cf. results on the Bochum database<br />

below for a more challenging example). Before comparing left against right<br />

poses we ipped all right pose images over. The recognition rate is high for right<br />

prole against left prole (84%, third row). The sharply reduced recognition<br />

rate of 57% (second row) when comparing left and right half-proles is probably<br />

due to the poor control in rotation angle in the database | inspection of images<br />

6


shows that right and left rotation angles dier <strong>by</strong> upto30 . When comparing<br />

half proles with either frontal views or proles another reduction in recognition<br />

rate is observed (although even a correct recognition rate of 10% out of a gallery<br />

of 250 is still high above chance level, which would be 0.4%!).<br />

We also did experiments on the Bochum database, which is described in<br />

greater detail in [3]. In this database the dierences in facial expression are<br />

stronger than in the FERET database, some faces were even half covered <strong>by</strong><br />

hair or a hand. First rank recognition rates against a model gallery of 108<br />

frontal views were 91% on frontal views of dierent facial expression, 94% on<br />

faces rotated in depth <strong>by</strong> 11 , and 88% on faces rotated <strong>by</strong> 22 . The results are<br />

similar to those in [3] and demonstrate robustness against variations in facial<br />

expression and rotation in depth up to 22 .<br />

4 Discussion<br />

The system presented is general and exible. It is designed for an in-class<br />

recognition task, i.e. for recognizing members of a known class of objects. We<br />

have applied it to face recognition but the system is in no way specialized to faces<br />

and we assume that it can be directly applied to other in-class recognition tasks,<br />

such as recognizing individuals of a given animal species, given the same level<br />

of standardization of the images. In contrast to many neural network systems,<br />

no extensive training for new faces or new object classes is required. Only a<br />

moderate numberoftypical examples have to be inspected to build up a bunch<br />

graph, and individuals can then be recognized after storing a single image.<br />

We tested the system with respect to rotation in depth and dierences in<br />

facial expression. We did not investigate robustness to other variations, such<br />

as illumination changes or structured background. The performance is high on<br />

faces of same pose. We also showed robustness against rotation in depth up<br />

to about 22 . For large rotation angles the performance degrades signicantly.<br />

Our system performs well compared to other systems. Results of a blind test of<br />

dierent systems on the FERET database were published in [5, 6].<br />

In comparison to the system [3] on the basis of which we have developed<br />

the system presented here we have made several major modications. We now<br />

utilize wavelet phase information for accurate node localization. Previously,<br />

node localization was rather imprecise. We have introduced the potential to<br />

specialize the system to specic object types and to handle dierent poses with<br />

the help of object-adapted grids. The face bunch graph is able to represent a<br />

wide variety of faces, which allows matching on face images of unseen individuals.<br />

These improvements make it possible to extract an image graph from a new face<br />

image in one matching process. Even if the person of the new image is not<br />

included in the FBG, the image graph reliably refers to the ducial points. This<br />

accelerates recognition from large databases considerably since for each probe<br />

image, correct node positions need to be searched only once instead of in each<br />

attempted match to a gallery image, as was previously necessary. We did not<br />

expect and the system does not show an improvement interms of recognition<br />

rates compared to the preceding system.<br />

7


Besides the ability to handle larger galleries and larger rotation in depth, the<br />

main achievement of the system presented here is that the increased matching<br />

accuracy, the object adapted graphs, and the face bunch graph provide the basis<br />

for further developments [2, 4, 7].<br />

Acknowledgements: This work has benet from many colleagues. We particularly<br />

wish to thank Irving Biederman, Michael Lyons, Thomas Maurer, Jonathon<br />

Phillips, Ladan Shams, Marni Stewart Bartlett, and Jan Vorbruggen. For the<br />

experiments we have used the FERET database of facial images collected under<br />

the ARPA/ARL FERET program and the Bochum database.<br />

References<br />

[1] Buhmann, J., Lange, J., von der Malsburg, C., Vorbruggen, J. C., and<br />

Wurtz, R. P. Object recognition with Gabor functions in the dynamic link architecture:<br />

Parallel implementation on a transputer network. In Neural Networks for<br />

Signal Processing, B. Kosko, Ed. Prentice Hall, Englewood Clis, NJ 07632, 1992,<br />

pp. 121{159.<br />

[2] Kruger, N. An algorithm for the learning of weights in discrimination functions<br />

using a priori constraints. accepted for publication in IEEE Transactions on Pattern<br />

Analysis and Machine Intelligence, 1997.<br />

[3] Lades, M., Vorbruggen, J. C., Buhmann, J., Lange, J., von der Malsburg,<br />

C., Wurtz, R. P., and Konen, W. Distortion invariant object recognition in the<br />

dynamic link architecture. IEEE Transactions on Computers 42, 3 (1993), 300{311.<br />

[4] Maurer, T., and von der Malsburg, C. Linear feature transformations to<br />

recognize faces rotated in depth. In Proceedings of the International Conference on<br />

Articial Neural Networks, ICANN'95 (Paris, Oct. 1995), pp. 353{358.<br />

[5] Phillips, P. J., Rauss, P. J., and Der, S. Z. FERET (face recognition technology)<br />

recognition algorithm development and test report. Tech. Rep. ARL-TR-995,<br />

U. S. Army Research Laboratory, 2800 Powder Mill Road, Adelphi, MD 20783-<br />

1197, Oct. 1996.<br />

[6] Rauss, P. J., Phillips, J., Hamilton, M. K., and DePersia, A. T. FERET<br />

(face-recognition technology) recognition algorithms. In Proc. of the Fifth Automatic<br />

Target Recognizer System and Technology Symposium (1996).<br />

[7] Wiskott, L. Phantom faces for face analysis. Pattern <strong>Recognition</strong> 30, 6 (1996).<br />

[8] Wiskott, L., Fellous, J.-M., Kruger, N., and von der Malsburg, C. <strong>Face</strong><br />

recognition <strong>by</strong> elastic bunch graph matching. Tech. Rep. IR-INI 96-08, Institut fur<br />

Neuroinformatik, Ruhr-Universitat Bochum, D-44780 Bochum, Germany, 1996.<br />

8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!