01.02.2015 Views

An Algorithm for the Learning of Weights in Discrimination Functions ...

An Algorithm for the Learning of Weights in Discrimination Functions ...

An Algorithm for the Learning of Weights in Discrimination Functions ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>An</strong> <strong>Algorithm</strong> <strong>for</strong> <strong>the</strong> <strong>Learn<strong>in</strong>g</strong> <strong>of</strong> <strong>Weights</strong><br />

<strong>in</strong> Discrim<strong>in</strong>ation <strong>Functions</strong> us<strong>in</strong>g<br />

a priori Constra<strong>in</strong>ts<br />

Norbert Kruger<br />

Abstract<br />

We <strong>in</strong>troduce a learn<strong>in</strong>g algorithm <strong>for</strong> <strong>the</strong> weights <strong>in</strong> a very common class <strong>of</strong> discrim<strong>in</strong>ation<br />

functions usually called \weighted average". The learn<strong>in</strong>g algorithm can reduce<br />

<strong>the</strong> number <strong>of</strong> free variables by simple but eective a priori criteria about signicant features.<br />

Here we apply our algorithm to three tasks <strong>of</strong> dierent dimensionality all concerned<br />

with face recognition.<br />

1 Introduction<br />

Many pattern recognition systems can be roughly divided <strong>in</strong>to two parts, feature extraction and<br />

pattern discrim<strong>in</strong>ation. In feature extraction an <strong>in</strong>put I is trans<strong>for</strong>med <strong>in</strong>to a vector I k 2 IR N .<br />

(In speech recognition I k can, <strong>for</strong> example, represent <strong>the</strong> Fourier trans<strong>for</strong>mation <strong>in</strong> a certa<strong>in</strong><br />

time <strong>in</strong>terval <strong>in</strong> a specic frequency band [14]; <strong>in</strong> image process<strong>in</strong>g I k could be <strong>the</strong> lter response<br />

<strong>of</strong> a wavelet-like lter at a certa<strong>in</strong> position <strong>in</strong> <strong>the</strong> grey-level picture [11, 15]). In discrim<strong>in</strong>ation<br />

<strong>the</strong> <strong>in</strong>put I has to be assigned to a specic class c. The extracted features are used to evaluate<br />

certa<strong>in</strong> similarities to <strong>the</strong> dierent classes. Let Sim k (c; I) be a measure <strong>for</strong> <strong>the</strong> similarity <strong>of</strong><br />

<strong>the</strong> <strong>in</strong>put I to class c only regard<strong>in</strong>g <strong>the</strong> k-th feature, respectively submodule. Sim k (c; I) is<br />

assumed to be high if accord<strong>in</strong>g to <strong>the</strong> k-th submodule I is a member <strong>of</strong> class c and is assumed<br />

to be low if not. Sim k (c; I) may <strong>for</strong> example be <strong>the</strong> distance between a representative <strong>of</strong> class<br />

c and <strong>the</strong> <strong>in</strong>put I <strong>in</strong> a specic feature as <strong>for</strong> example <strong>in</strong> vector quantization [6]. Very <strong>of</strong>ten <strong>the</strong><br />

nal discrim<strong>in</strong>ation function is similar to<br />

nX<br />

Sim k (c; I)<br />

Sim tot (c; I) =<br />

; (1)<br />

k=1 n<br />

and it is said that I belongs to class c if Sim tot (c; I) is maximal <strong>for</strong> c. The disadvantage <strong>of</strong><br />

(1) is that it does not take <strong>in</strong>to account whe<strong>the</strong>r <strong>the</strong> feature I k is important <strong>for</strong> <strong>the</strong> decision<br />

1


whe<strong>the</strong>r I belongs to class c or not. A better choice would be<br />

nX<br />

Sim tot (c; I) = c k Sim k (c; I); (2)<br />

k=1<br />

with <strong>the</strong> restrictions 0;P c k > k c k<br />

= 1 <strong>for</strong> every c. For <strong>the</strong> more signicant features k <strong>for</strong> <strong>the</strong><br />

recognition <strong>of</strong> a representative <strong>of</strong> class c we expect k c to be high, o<strong>the</strong>rwise to be low. In many<br />

pattern recognition tasks (e.g., [11, 6]) a discrim<strong>in</strong>ation function <strong>of</strong> type (1) is used. Here we<br />

<strong>in</strong>troduce an algorithm to extend (1) to (2) and we give a learn<strong>in</strong>g rule <strong>for</strong> <strong>the</strong> free parameters<br />

c k<br />

. There<strong>for</strong>e we give an algorithm to improve a standard choice such as (1) by <strong>in</strong>troduc<strong>in</strong>g<br />

and learn<strong>in</strong>g new parameters.<br />

We apply this algorithm to dierent tasks related to face recognition. The rst two tasks are<br />

both concerned with face discrim<strong>in</strong>ation, but <strong>the</strong> dimension <strong>of</strong> <strong>the</strong> problems is very dierent.<br />

In <strong>the</strong> rst task we have to determ<strong>in</strong>e approximately 40 parameters c k<br />

, <strong>in</strong> <strong>the</strong> second task we<br />

determ<strong>in</strong>e 1800 parameters. For <strong>the</strong> rst two tasks k<br />

c does not depend on <strong>the</strong> class c (i.e.,<br />

k<br />

c = k<br />

<strong>for</strong> all c0<br />

c; c 0 ). In our notation we only have k . The third task is concerned with<br />

pose estimation and <strong>the</strong> weights are chosen class dependently ( c k<br />

6= <strong>for</strong> c0<br />

k c 6= c0 ) where c; c 0<br />

represent <strong>the</strong> dierent poses. We describe <strong>the</strong> learn<strong>in</strong>g scheme by apply<strong>in</strong>g <strong>the</strong> algorithm to<br />

<strong>the</strong> rst task and give a brief description <strong>of</strong> its extension to <strong>the</strong> o<strong>the</strong>r two tasks. We would<br />

like to stress that our algorithm is not restricted to face recognition, but is able to deal with<br />

any problem which ts <strong>in</strong>to <strong>the</strong> <strong>for</strong>malism dened above.<br />

2 Problems <strong>in</strong> choos<strong>in</strong>g suitable discrim<strong>in</strong>ation functions<br />

The approach <strong>of</strong> statistical decision <strong>the</strong>ory is described <strong>in</strong> detail <strong>in</strong> [1] and [4]. Bayes' <strong>for</strong>mula<br />

P (cjI) =<br />

P (Ijc)P (c)<br />

P (I)<br />

(3)<br />

gives <strong>the</strong> optimal discrim<strong>in</strong>ation function we can achieve. Un<strong>for</strong>tunately <strong>the</strong> problem arises to<br />

estimate <strong>the</strong> parameters P (c) and <strong>the</strong> conditional densities P (Ijc) which are N-dimensional<br />

functions <strong>for</strong> every c. 1 There<strong>for</strong>e a priori assumptions about <strong>the</strong> P (Ijc) usually have to be made<br />

and one has great diculties if <strong>the</strong>se assumptions are not fullled (see <strong>for</strong> example [5, 12]).<br />

<strong>An</strong>o<strong>the</strong>r discrim<strong>in</strong>ation function used <strong>in</strong> applications is <strong>the</strong> Mahalanobis metric (see e.g. [7])<br />

which is a l<strong>in</strong>ear function. But also <strong>in</strong> this case <strong>the</strong>re are many problems <strong>in</strong> determ<strong>in</strong><strong>in</strong>g <strong>the</strong><br />

free parameters <strong>of</strong> <strong>the</strong> metric.<br />

It is well known that <strong>the</strong> number <strong>of</strong> free parameters is an important factor <strong>in</strong> learn<strong>in</strong>g<br />

algorithms because it is correlated with <strong>the</strong> number <strong>of</strong> tra<strong>in</strong><strong>in</strong>g examples and <strong>the</strong> time needed<br />

to tra<strong>in</strong> <strong>the</strong> system <strong>in</strong> order to get acceptable generalization abilities. Here it is our aim to<br />

reduce <strong>the</strong> number <strong>of</strong> free parameters by us<strong>in</strong>g a priori sett<strong>in</strong>gs. For <strong>the</strong> second task we are<br />

able to reduce a problem with 1800 free parameters to less than 10 parameters. Only this<br />

reduction allows us to nd a suitable solution.<br />

We assume that a general problem <strong>in</strong> pattern recognition is <strong>the</strong> right mixture <strong>of</strong> a priori<br />

knowledge and learn<strong>in</strong>g. There are two extremes: One can <strong>in</strong>vent an expert system, <strong>the</strong> a priori<br />

assumptions are problem dependent and <strong>the</strong> system may be suitable to solvewell dened tasks,<br />

1 P(I) can be calculated from P (I jc) and P (c).<br />

2


e.g., <strong>the</strong> recognition <strong>of</strong> trac signs. This system will not be able to handle, <strong>for</strong> example, <strong>the</strong><br />

recognition <strong>of</strong> faces. The o<strong>the</strong>r extreme is a very general statement such asBayes' <strong>for</strong>mula,<br />

but <strong>the</strong>n learn<strong>in</strong>g may take <strong>for</strong>ever. We th<strong>in</strong>k that <strong>the</strong> right mixture <strong>of</strong> a priori knowledge and<br />

learn<strong>in</strong>g has to be found. That means that <strong>the</strong> a priori assumptions have tobevery general,<br />

eective, and applicable <strong>in</strong> many situations (see also [9, 10]).<br />

One <strong>of</strong> <strong>the</strong> advantages <strong>of</strong> our algorithm is <strong>the</strong> mixture <strong>of</strong> a priori knowledge and learn<strong>in</strong>g.<br />

The a priori sett<strong>in</strong>gs are very simple and evident and are controlled by <strong>the</strong> learn<strong>in</strong>g which also<br />

gives <strong>the</strong> required exibility to nd a solution adapted to <strong>the</strong> problem.<br />

3 The face recognition system<br />

As basic visual feature we use a local image descriptor <strong>in</strong> <strong>the</strong> <strong>for</strong>m <strong>of</strong> a vector I s;o (x; y) called<br />

\jet" [2, 11]. Each component (s; o) <strong>of</strong> a jet is a Gabor wavelet <strong>of</strong> specic frequency s and<br />

orientation o, extracted from <strong>the</strong> image at a denite po<strong>in</strong>t (x; y). We are employ<strong>in</strong>g Gabor<br />

wavelets <strong>of</strong> 5 dierent frequencies and 8 dierent orientations <strong>for</strong> a total <strong>of</strong> 40 complex components.<br />

Such a jet describes <strong>the</strong> area surround<strong>in</strong>g (x; y). We represent objects as labeled graphs<br />

(see gure 1). That means dierent landmarks (like <strong>the</strong> tip <strong>of</strong> <strong>the</strong> nose, <strong>the</strong> left eye etc.) <strong>for</strong>m<br />

a graph labeled with jets at its nodes (respectively landmarks). To compare jets and graphs,<br />

similarity functions are dened: The normalized dot product <strong>of</strong> two jets I k ;J k<br />

Sim jet (I k ;J k )= X s;o<br />

I(k;s;o) J(k;s;o)<br />

jjI k jj jjJ k jj<br />

(4)<br />

yields <strong>the</strong>ir similarity. The sum over jet similarities<br />

nX<br />

nod<br />

Sim tot (I;J)=<br />

k=1<br />

Sim jet (I k ;J k )<br />

n nod<br />

(5)<br />

<strong>of</strong> <strong>the</strong> nodes <strong>of</strong> <strong>the</strong> graph gives <strong>the</strong> total similaritybetween two faces I and J; n nod is <strong>the</strong> number<br />

<strong>of</strong> nodes <strong>in</strong> <strong>the</strong> graph. If <strong>the</strong> faces have dierent poses jets describ<strong>in</strong>g <strong>the</strong> same landmark are<br />

compared aga<strong>in</strong>st each o<strong>the</strong>r (see gure 1 (top)).<br />

For pose estimation we position <strong>the</strong> graphs shown at <strong>the</strong> bottom <strong>of</strong> gure 1 automatically<br />

by <strong>the</strong> algorithm described <strong>in</strong> [15] on a given picture. A similaririty betweenanode<strong>of</strong>a<br />

graph and a general representation <strong>of</strong> poses (called gfk, general face knowledge) is dened.<br />

We calculate <strong>the</strong> weighted average over <strong>the</strong>se similarities <strong>for</strong> each graph represent<strong>in</strong>g a certa<strong>in</strong><br />

pose. The pose correspond<strong>in</strong>g to <strong>the</strong> graph yield<strong>in</strong>g <strong>the</strong> highest total similarity ischosen as<br />

<strong>the</strong> correct one. The pose estimation algorithm is described more precisely <strong>in</strong> [8, 9].<br />

4 Description <strong>of</strong> <strong>the</strong> algorithm <strong>for</strong> nd<strong>in</strong>g weights <strong>for</strong><br />

nodes<br />

In this section we describe <strong>the</strong> basic idea <strong>of</strong> our algorithm on <strong>the</strong> example <strong>of</strong> learn<strong>in</strong>g weights<br />

<strong>for</strong> <strong>the</strong> nodes <strong>of</strong> <strong>the</strong> graphs <strong>for</strong> face discrim<strong>in</strong>ation. We modify (5) <strong>in</strong>to<br />

nX<br />

nod<br />

Sim tot (I;J)= k Sim jet (I k ;J k ): (6)<br />

k=1<br />

3


We <strong>in</strong>troduce an evaluation function<br />

Q(T;1;:::; n ); (7)<br />

which measures <strong>the</strong> quality <strong>of</strong> recognition on a certa<strong>in</strong> tra<strong>in</strong><strong>in</strong>g set T depend<strong>in</strong>g on <strong>the</strong> weights<br />

1;:::; n . Q might be <strong>the</strong> number <strong>of</strong> correctly recognized faces or a more complicated function.<br />

It turned out that our algorithm is not very sensitive to <strong>the</strong> dierent choices <strong>of</strong> Q we have<br />

tried. We will <strong>the</strong>re<strong>for</strong>e not describe our choices <strong>of</strong> Q <strong>in</strong> detail.<br />

We assume that <strong>in</strong> T <strong>the</strong>re are only pairs <strong>of</strong> pictures <strong>of</strong> <strong>the</strong> same person. Our system<br />

\knows" a person by extract<strong>in</strong>g a labeled graph from a picture <strong>of</strong> this person. We call G i <strong>the</strong><br />

graph <strong>of</strong> <strong>the</strong> person stored <strong>in</strong> <strong>the</strong> system, <strong>the</strong> set <strong>of</strong> all graphs stored <strong>in</strong> <strong>the</strong> system we call a<br />

\gallery". I i is <strong>the</strong> graph extracted from <strong>the</strong> second picture <strong>of</strong> i-th person which is <strong>in</strong>put to<br />

<strong>the</strong> system and has to be recognized.<br />

We applied an optimization algorithm <strong>for</strong> nd<strong>in</strong>g local m<strong>in</strong>ima <strong>of</strong> a multidimensional function<br />

to <strong>the</strong> n-dimensional function Q <strong>in</strong> order to optimize <strong>the</strong> weights k . Because <strong>of</strong> <strong>the</strong> large<br />

number <strong>of</strong> parameters we had diculties <strong>in</strong> generalization, i.e., we achieved a large improvement<br />

<strong>for</strong> <strong>the</strong> tra<strong>in</strong><strong>in</strong>g data but a decrease <strong>of</strong> per<strong>for</strong>mance on <strong>the</strong> test data.<br />

In this paper we use <strong>the</strong> simplex algorithm [13] as optimization algorithm. The simplex<br />

algorithm nds m<strong>in</strong>ima <strong>of</strong> a multi-dimensional function f : IR n ! IR without us<strong>in</strong>g any<br />

<strong>in</strong><strong>for</strong>mation about <strong>the</strong> derivative <strong>of</strong> <strong>the</strong> function f. Fur<strong>the</strong>rmore <strong>the</strong> simplex algorithm is more<br />

robust aga<strong>in</strong>st local m<strong>in</strong>ima than many o<strong>the</strong>r optimization algorithms.<br />

Let T k be <strong>the</strong> part <strong>of</strong> <strong>the</strong> tra<strong>in</strong><strong>in</strong>g set which <strong>in</strong>cludes only <strong>the</strong> k-th feature from all I i and<br />

G j . If, <strong>for</strong> example, <strong>the</strong> feature k is <strong>the</strong> jet represent<strong>in</strong>g <strong>the</strong> nose, T k is <strong>the</strong> set <strong>of</strong> all nose jets<br />

<strong>in</strong> <strong>the</strong> tra<strong>in</strong><strong>in</strong>g set. In <strong>the</strong> follow<strong>in</strong>g we derive a function J(T k ) which calculates <strong>the</strong> k directly<br />

from T k :<br />

k = J(T k ):<br />

We call J <strong>the</strong> \parametrization function". We assume that <strong>the</strong> <strong>in</strong><strong>for</strong>mation needed to evaluate<br />

<strong>the</strong> discrim<strong>in</strong>ativepower <strong>of</strong> <strong>the</strong> k-th node is conta<strong>in</strong>ed <strong>in</strong> <strong>the</strong> set <strong>of</strong> jets <strong>of</strong> <strong>the</strong> k-th feature, i.e.<br />

<strong>in</strong> T k . The function J should make use <strong>of</strong> a priori knowledge about signicance. It is simply<br />

stated that <strong>for</strong> a feature f to be considered signicant, <strong>the</strong> follow<strong>in</strong>g two conditions hold:<br />

C1 For two representatives <strong>of</strong> <strong>the</strong> same class <strong>the</strong> similarities <strong>for</strong> <strong>the</strong> feature f are <strong>in</strong> general<br />

high.<br />

C2 For two representatives <strong>of</strong> dierent classes <strong>the</strong> similarities <strong>for</strong> <strong>the</strong> feature f are <strong>in</strong> general<br />

low.<br />

In our notation C1 and C2 can be summarized as follows: A feature k has large discrim<strong>in</strong>ative<br />

power if<br />

C1 Sim jet (I i k ;Gi k) is large <strong>for</strong> many i, and<br />

C2 Sim jet (I i k ;Gj k) is small <strong>for</strong> many j 6= i.<br />

We can even comb<strong>in</strong>e C1 and C2 <strong>in</strong>to <strong>the</strong> follow<strong>in</strong>g statement: A node k has large discrim<strong>in</strong>ative<br />

power if<br />

C3 Sim jet (I i k ;Gi ) , k Sim jet(I i k ;Gj k<br />

) is large <strong>for</strong> many i and j 6= i.<br />

4


There<strong>for</strong>e we look at <strong>the</strong> values<br />

We dene<br />

(i; j) k := Sim jet (I i k ;Gi k) , Sim jet (I i k ;Gj k<br />

): (8)<br />

H i k = f(i; j) k jj :1;:::;Mg:<br />

Figure 2 shows examples <strong>of</strong> such sets. For every node <strong>the</strong>re exist M sets H i k<br />

;i :1;:::;M.<br />

These sets are connected to signicant features <strong>in</strong> <strong>the</strong> follow<strong>in</strong>g way. Figure 2 shows three sets<br />

Hk. i The large bar represents <strong>the</strong> similarity Sim jet (Ik i ;Gi k) and <strong>the</strong> dots represent a scatter plot<br />

<strong>of</strong> <strong>the</strong> Sim jet (Ik i ;Gj k<br />

) with j 6= i. Figure 2a) corresponds to a node which isvery suitable <strong>for</strong><br />

discrim<strong>in</strong>ation <strong>of</strong> <strong>the</strong> i-th class from <strong>the</strong> o<strong>the</strong>r classes because Sim jet (I i k ;Gi k<br />

)ismuch higher<br />

than <strong>the</strong> Sim jet (I i k ;Gj k<br />

);j 6= i. Fig. 2b) corresponds to a node which is not as suitable <strong>for</strong><br />

discrim<strong>in</strong>ation as 2a) but is more suitable than <strong>the</strong> node correspond<strong>in</strong>g to 2c).<br />

If we nowhave a large number <strong>of</strong> Hk i <strong>for</strong> a certa<strong>in</strong> feature <strong>of</strong> type 2a) and 2b), <strong>the</strong> correspond<strong>in</strong>g<br />

feature is signicant and <strong>the</strong> function J(T k ) should give a high value. But if we have<br />

many H i k<br />

<strong>of</strong> type 2b) and 2c) <strong>the</strong> correspond<strong>in</strong>g feature is not very signicant and <strong>the</strong> function<br />

J(T k ) should give alowvalue.<br />

We assume J(T k ) to be <strong>of</strong> <strong>the</strong> follow<strong>in</strong>g <strong>for</strong>m:<br />

where<br />

J(T k )=j3<br />

2<br />

4 X i<br />

1<br />

n j 2<br />

2<br />

4 X j6=i<br />

1<br />

n j 1 [(i; j) k ]<br />

j i : IR ! IR; i :1;:::;3<br />

3<br />

5 3 5 ; (9)<br />

are real monotonically <strong>in</strong>creas<strong>in</strong>g functions. Examples <strong>for</strong> functions J are given <strong>in</strong> section 5.<br />

This choice <strong>of</strong> J can be motivated as follows:<br />

Let us call eq. (8) \simple comparison". It represents <strong>the</strong> stability <strong>in</strong> time <strong>of</strong> a feature<br />

<strong>for</strong> a certa<strong>in</strong> class and <strong>the</strong> dierence <strong>of</strong> this feature to ano<strong>the</strong>r class.<br />

The simple comparisons are evaluated by a function j1. j1 must be monotonically <strong>in</strong>creas<strong>in</strong>g<br />

because a larger value <strong>of</strong> eq. (8) <strong>in</strong>dicates a more signicant feature than a low<br />

value <strong>of</strong> eq. (8). (A similiar argument also holds <strong>for</strong> j2 and j3.)<br />

The simple comparisons <strong>for</strong> one class compared with all o<strong>the</strong>rs are averaged and evaluated<br />

by j2, i.e., one set H i is evaluated. The histogram k Hi k<br />

can be regarded as a \complex"<br />

comparison <strong>in</strong> which a certa<strong>in</strong> feature <strong>of</strong> one person is compared with <strong>the</strong> same feature<br />

<strong>for</strong> many o<strong>the</strong>r persons.<br />

Each judgement <strong>of</strong> a histogram H i k is averaged aga<strong>in</strong> and nally judged by a function j3.<br />

There<strong>for</strong>e <strong>the</strong> \complex" comparison expressed <strong>in</strong> H i k<br />

is evaluated many times and this<br />

<strong>in</strong><strong>for</strong>mation is used to determ<strong>in</strong>e <strong>the</strong> nal k .<br />

In <strong>the</strong> j i we allow some free parameters 1;:::; N . The problem we have is <strong>the</strong> follow<strong>in</strong>g.<br />

We know some properties <strong>of</strong> j1;j2 and j3, e.g. that <strong>the</strong>y have to be monotonically <strong>in</strong>creas<strong>in</strong>g,<br />

but <strong>the</strong>ir exact shape is unknown. This uncerta<strong>in</strong>ty is expressed by <strong>the</strong> parameters 1;:::; N .<br />

We substitute<br />

k = J(T k ; 1;:::; N )<br />

5


and we get<br />

Q(T ; 1;:::; N )=<br />

Q(J(T1; 1;:::; N );:::;J(T n ; 1;:::; N )):<br />

We apply <strong>the</strong> simplex algorithm [13] to this function Q whichnow depends on N parameters.<br />

It is assumed that N is much smaller than n. There<strong>for</strong>e, we have a reduction <strong>of</strong> dimensionality<br />

from n to N. J representsalaw which calculates <strong>the</strong> signicance <strong>of</strong> a feature from <strong>the</strong> subset<br />

T k <strong>of</strong> <strong>the</strong> tra<strong>in</strong><strong>in</strong>g set T . The function J is <strong>the</strong> same <strong>for</strong> each nodek but because it depends<br />

on T k it gives <strong>the</strong>m dierent values. The whole method can be summarized as an \estimation<br />

<strong>of</strong> an <strong>in</strong>completely known dependence". That is to say, some conditions like C1 and C2 are<br />

known, but <strong>the</strong>re are still some unknown parameters. The cruical po<strong>in</strong>t isnowhowmuch a<br />

priori knowledge can be xed. If we do not give enough free variables we may miss <strong>the</strong> goal.<br />

If we give too many free variables we <strong>in</strong>crease <strong>the</strong> search space and we will get problems <strong>in</strong><br />

generalization. In section 5 we will use dierent functions <strong>for</strong> j1;j2 and j3.<br />

The extension to learn a weight matrix (k;s;o) <strong>for</strong> all jet components is straight<strong>for</strong>ward and<br />

is described more precisly <strong>in</strong> [8]. The basic dierence to <strong>the</strong> algorithm described above is <strong>the</strong><br />

choice <strong>of</strong> <strong>the</strong> {expressions. For <strong>the</strong> learn<strong>in</strong>g <strong>of</strong> <strong>the</strong> (k;s;o) we use as <strong>the</strong> dierences <strong>of</strong> <strong>the</strong><br />

similarities <strong>of</strong> two faces <strong>of</strong> <strong>the</strong> same person and to a dierent person <strong>in</strong> each jet component.<br />

For pose estimation we learn class dependent weights c k (where c represents <strong>the</strong> classes frontal,<br />

half prole and prole) by sett<strong>in</strong>g to <strong>the</strong> dierence <strong>of</strong> <strong>the</strong> node similarities achieved on a<br />

picture <strong>of</strong> a face with correct pose and <strong>in</strong>correct pose (<strong>for</strong> details see [8]).<br />

5 Results<br />

Our complete data set conta<strong>in</strong>s more than 1500 pictures <strong>of</strong> approximately 350 persons. The<br />

poses frontal, half prole and prole exist <strong>for</strong> most <strong>of</strong> <strong>the</strong> persons <strong>in</strong> <strong>the</strong> data set. We found out<br />

that <strong>the</strong> weights depend on <strong>the</strong> actual task. For example <strong>the</strong> weights <strong>for</strong> <strong>the</strong> discrim<strong>in</strong>ation<br />

problem (see gure 3 (left)) are dierent compared to <strong>the</strong> weights <strong>for</strong> <strong>the</strong> pose estimation<br />

problem (see gure 3 (right)). But even <strong>for</strong> <strong>the</strong> discrim<strong>in</strong>ation problem <strong>the</strong> weight matrices<br />

depend very much on <strong>the</strong> poses compared aga<strong>in</strong>st each o<strong>the</strong>r.<br />

Let C[a; b] <strong>the</strong> space <strong>of</strong> cont<strong>in</strong>ous functions from <strong>the</strong> <strong>in</strong>terval [a; b] toIR. We have tond<br />

functions<br />

j i 2C[a i ;b i ];i:1;:::;3<br />

which are suitable as parametrization functions. Avery general approach is to approximate<br />

j i with spl<strong>in</strong>es (see [3] <strong>for</strong> a more precise description <strong>of</strong> spl<strong>in</strong>es). We have done this <strong>for</strong> task 1<br />

and task 2. A spl<strong>in</strong>e can be dened with dierent numbers <strong>of</strong> free variables. In our simulations<br />

we recognized that 3 free variables are optimal and that j2 can be set to <strong>the</strong> identity map<br />

j2(x) =x without loss <strong>of</strong> per<strong>for</strong>mance but with better generalization properties. The borders<br />

<strong>of</strong> <strong>the</strong> spl<strong>in</strong>e are set depend<strong>in</strong>g on <strong>the</strong> mean and variance <strong>of</strong> <strong>the</strong> distributions <strong>of</strong> j1(x) and<br />

j3(x) measured on <strong>the</strong> tra<strong>in</strong><strong>in</strong>g set. As free parameters <strong>in</strong> <strong>the</strong> parametrization function we<br />

<strong>the</strong>re<strong>for</strong>e have six free variables, three variables <strong>for</strong> both j1 and j3. The optimization problem<br />

is <strong>the</strong>re<strong>for</strong>e reduced to six dimensions.<br />

We also tried <strong>the</strong> follow<strong>in</strong>g sett<strong>in</strong>gs:<br />

j1(x) = arctan(1 x)<br />

j2(x) = x (10)<br />

j3(x) = (max(x; 0)) 2 :<br />

6


In this case <strong>the</strong> problem is reduced to two dimension expressed by <strong>the</strong> parameters 1;2.<br />

At rst glance <strong>the</strong> sett<strong>in</strong>gs <strong>in</strong> (10) look somehow arbitrary but we already got good results<br />

with <strong>the</strong>se sett<strong>in</strong>gs. The function arctan ensures that outliers <strong>in</strong> <strong>the</strong> expressions (8) do not<br />

have toomuch <strong>in</strong>uence and <strong>the</strong> function (max(x; 0)) 2<br />

is used as a scatter function <strong>for</strong> <strong>the</strong><br />

calculation <strong>of</strong> <strong>the</strong> nal weights from <strong>the</strong> \judgement" which is already done <strong>in</strong> sums <strong>of</strong> <strong>the</strong><br />

parametrization function. 2 But <strong>for</strong> o<strong>the</strong>r problems <strong>the</strong> more general spl<strong>in</strong>e approach maybe<br />

better.<br />

Table 1 gives <strong>the</strong> results <strong>for</strong> <strong>the</strong> learn<strong>in</strong>g <strong>of</strong> <strong>the</strong> node weights with <strong>the</strong> parametrization<br />

function (10). Us<strong>in</strong>g spl<strong>in</strong>es as parametrization functions we get similar results. The weight<strong>in</strong>g<br />

<strong>of</strong> all jet components achieves slightly better results. Table 2 gives <strong>the</strong> results <strong>for</strong> <strong>the</strong> learn<strong>in</strong>g<br />

<strong>of</strong> class dependent weights <strong>for</strong> <strong>the</strong> pose estimation task. The time needed <strong>for</strong> learn<strong>in</strong>g <strong>for</strong> task<br />

1 with <strong>the</strong> sett<strong>in</strong>gs (10) <strong>for</strong> one weight matrix is less than 5 m<strong>in</strong>utes on a Sparc 10. If we<br />

approximate <strong>the</strong> j i with spl<strong>in</strong>es <strong>the</strong> learn<strong>in</strong>g takes approximately 15 m<strong>in</strong>utes. The learn<strong>in</strong>g<br />

<strong>of</strong> weights <strong>for</strong> all jet components takes approximately 12 hours and <strong>the</strong> learn<strong>in</strong>g <strong>for</strong> <strong>the</strong> pose<br />

estimation problem takes less than 5 m<strong>in</strong>utes. In [8] <strong>the</strong> results <strong>of</strong> our simulations are discussed<br />

<strong>in</strong> more detail.<br />

6 Conclusion<br />

We <strong>in</strong>troduced a learn<strong>in</strong>g algorithm <strong>for</strong> weights <strong>in</strong> discrim<strong>in</strong>ation functions and we applied this<br />

algorithm to very dierent tasks <strong>in</strong> face recognition. Never<strong>the</strong>less we expect our algorithm<br />

can also be applied to o<strong>the</strong>r discrm<strong>in</strong>ation tasks because we only make use <strong>of</strong> simple properties,<br />

<strong>the</strong> dierence <strong>of</strong> similarities <strong>of</strong> submodules with<strong>in</strong> classes compared to <strong>the</strong> similarities <strong>of</strong><br />

submodules between dierent classes. We expect this algorithm might bevery useful <strong>in</strong> o<strong>the</strong>r<br />

pattern recognition systems which make use <strong>of</strong> a discrim<strong>in</strong>ation function <strong>of</strong> <strong>the</strong> type (1), e.g.,<br />

vector quantization methods. The trans<strong>for</strong>mation <strong>of</strong> <strong>the</strong> <strong>in</strong>put space <strong>in</strong>duced by <strong>the</strong> weight<strong>in</strong>g<br />

is simple: it is a stretch<strong>in</strong>g or compression <strong>in</strong> each dimension <strong>of</strong> <strong>the</strong> <strong>in</strong>put space. The improvement<br />

which can be achieved will depend very much on <strong>the</strong> quality <strong>of</strong> <strong>the</strong> already extracted<br />

submodules or features. If <strong>the</strong>re are already very signicant features <strong>for</strong> many classes and a<br />

lot <strong>of</strong> o<strong>the</strong>r features which are <strong>in</strong>signicant, <strong>the</strong> algorithm will nd <strong>the</strong> signicant features and<br />

will yield a large improvement (as <strong>in</strong> <strong>the</strong> case <strong>of</strong> pose estimation). But if all <strong>of</strong> <strong>the</strong> features<br />

are not suitable to recognize elements <strong>of</strong> <strong>the</strong> dierent classes, <strong>the</strong> trans<strong>for</strong>mation we can learn<br />

with our algorithm will give less improvement. Then a more complex trans<strong>for</strong>mation has to be<br />

per<strong>for</strong>med, i.e., <strong>the</strong> feature extraction itself has to be improved. In recent work [9] we utilized<br />

a priori constra<strong>in</strong>ts similar to our criteria C1 and C2 to derive more ecient features.<br />

Acknowledgement:<br />

We like to thank Laurenz Wiskott, Jan Vorbruggen, Thomas Maurer and Christoph von<br />

der Malsburg <strong>for</strong> very fruitful discussions. Portions <strong>of</strong> <strong>the</strong> research use <strong>the</strong> FERET database<br />

<strong>of</strong> facial images collected under <strong>the</strong> ARPA/ARL FERET program.<br />

References<br />

[1] J.O. Berger, Statistical Decision Theory; foundations, concepts and methods (2en ed), New<br />

York, Spr<strong>in</strong>ger, 1985.<br />

2 The maximum function is only needed to ensure that <strong>the</strong> power function x 2<br />

is dened.<br />

7


[2] J.G. Daugman, \Uncerta<strong>in</strong>ty relation <strong>for</strong> resolution <strong>in</strong> space, spatial frequency, and orientation<br />

optimized by 2D visual cortical lters", Journal <strong>of</strong> <strong>the</strong> Optical Society <strong>of</strong> America<br />

vol. 2 (7), pp. 1160-1169 (1985).<br />

[3] C. De Boor, Apractical Guide to Spl<strong>in</strong>es, New York, Spr<strong>in</strong>ger Verlag, 1978.<br />

[4] T.S. Ferguson, Ma<strong>the</strong>matical statistics: Adecision <strong>the</strong>oritic approach, New York, Academic<br />

Press, 1967.<br />

[5] K. Fukunaga, Introduction to statistical pattern recognition (2nd ed), Boston, Academic<br />

Press, Boston 1990.<br />

[6] R.M. Gray, Vector Quantization, IEEE Transactions on Acoustics, Speech, and Signal<br />

Process<strong>in</strong>g, 1(2):4-29, 1984.<br />

[7] T. Kohonen, Self-Organisation and associative memory (3r. ed.), Spr<strong>in</strong>ger Series <strong>in</strong> In<strong>for</strong>mation<br />

Science 8, Heidelberg 1989.<br />

[8] N. Kruger. \<strong>An</strong> algorithm <strong>for</strong> <strong>the</strong> <strong>Learn<strong>in</strong>g</strong> <strong>of</strong> <strong>Weights</strong> <strong>in</strong> Discrim<strong>in</strong>ation <strong>Functions</strong>", IR-<br />

INI 08{95 (Technical Report).<br />

[9] N. Kruger, G. Peters, C. v.d. Malsburg, \Object Recognition with Banana Wavelets",<br />

accepted <strong>for</strong> ESANN{97.<br />

[10] N. Kruger, M. Potzsch. T. Maurer, M. R<strong>in</strong>ne, \Estimation <strong>of</strong> Face Position and Pose with<br />

Labeled Graphs", Proceed<strong>in</strong>gs <strong>of</strong> <strong>the</strong> BMVC 1996.<br />

[11] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R.P. Wurtz,<br />

W. Konen, \Distortion Invariant Object Recognition <strong>in</strong> <strong>the</strong> Dynamik L<strong>in</strong>k Architecture",<br />

IEEE Transactions on Computers, 42(3):300-311, March 1992.<br />

[12] M.S. Landy, L.T.Maloney, E.B. Johnsten, M. Young, \Measurement and model<strong>in</strong>g <strong>of</strong><br />

depth cue comb<strong>in</strong>ations: <strong>in</strong> defense <strong>of</strong> weak fusion", Vision Research, Vol. 35, No. 35,<br />

pp:389-412, February 1995.<br />

[13] J.A. Nelder, R. Maed, \A simplex method <strong>for</strong> function m<strong>in</strong>imization". Computer Journal,<br />

vol. 7, pp. 308-313, 1996.<br />

[14] L.R. Rab<strong>in</strong>er, R.W. Schafer, Digital Process<strong>in</strong>g <strong>of</strong> Speech Signals, Bell Laboratories 1978.<br />

[15] L. Wiskott, J.-M. Fellous, N. Kruger, C. von der Malsburg, \Face Recognition and Gender<br />

Determ<strong>in</strong>ation". Proceed<strong>in</strong>gs <strong>of</strong> <strong>the</strong> International Workshop on Automatic Face- and<br />

Gesture recognition, Zurich, June 1995.<br />

8


Fig. 1. Top: Flexible graphs <strong>for</strong> frontal and half prole views. At every node jets are extracted.<br />

For <strong>the</strong> nal decision which face <strong>in</strong> <strong>the</strong> gallery corresponds to a certa<strong>in</strong> <strong>in</strong>put face <strong>the</strong> jets <strong>in</strong><br />

dierent poses belong<strong>in</strong>g to <strong>the</strong> same landmark are compared with each o<strong>the</strong>r. Bottom: The<br />

graphs used <strong>for</strong> <strong>the</strong> preprocess<strong>in</strong>g <strong>for</strong> <strong>the</strong> dierent poses (frontal, half prole, prole). The size<br />

dierences are quite typical <strong>for</strong> our data set.<br />

Fig. 2: Examples <strong>of</strong> Hk. i The large bar represents <strong>the</strong> similarity Sim jet (Ik i ;Gi k) and <strong>the</strong> dots<br />

represent <strong>the</strong> Sim jet (Ik i ;Gj k<br />

) with i 6= j. The similarity<strong>of</strong>two jets ranges between 0 and 1 (<strong>for</strong><br />

fur<strong>the</strong>r explanation see text).<br />

Fig. 3. Left: Aweight matrix <strong>for</strong> <strong>the</strong> comparison <strong>of</strong> frontal with half prole views. The<br />

correspond<strong>in</strong>g graph is shown <strong>in</strong> gure 1 (top). The eyes are more important <strong>for</strong> discrim<strong>in</strong>ation<br />

<strong>of</strong> half proles and frontals compared to <strong>the</strong> mouth and ch<strong>in</strong>. The part <strong>of</strong> <strong>the</strong> face rotated<br />

towards <strong>the</strong> camera is more important than <strong>the</strong> opposite part <strong>of</strong> <strong>the</strong> face. Right: The learned<br />

weight matrices <strong>for</strong> <strong>the</strong> pose estimation task <strong>for</strong> frontal (left), half prole (middle) and prole<br />

(right) views. The weights correspond to <strong>the</strong> nodes <strong>of</strong> <strong>the</strong> graphs shown <strong>in</strong> gure 1 (bottom).<br />

The nodes correspond<strong>in</strong>g to <strong>the</strong> top <strong>of</strong> <strong>the</strong> head are very <strong>in</strong>signicant <strong>for</strong> pose estimation<br />

because <strong>for</strong> all poses <strong>the</strong>y represent similar features. The nodes were selected <strong>in</strong> [15] <strong>for</strong> face<br />

segmentation under <strong>the</strong> assumption that <strong>the</strong> pose is already known. For this task <strong>the</strong>y are<br />

very useful. The tip <strong>of</strong> <strong>the</strong> nose is signicant <strong>for</strong> <strong>the</strong> recognition <strong>of</strong> <strong>the</strong> frontal and half prole<br />

poses, <strong>the</strong> lips are signicant <strong>for</strong> <strong>the</strong> discrim<strong>in</strong>ation <strong>of</strong> frontals and half proles. The eyes are<br />

not very signicant <strong>for</strong> our pose estimation algorithm.<br />

Tab. 1: The results <strong>for</strong> weight learn<strong>in</strong>g <strong>for</strong> <strong>the</strong> nodes <strong>in</strong> percentages. As parametrization<br />

function we used (10). In <strong>the</strong> left column <strong>the</strong> pair <strong>of</strong> poses compared aga<strong>in</strong>st each o<strong>the</strong>r is<br />

marked. f means frontal, h half prole and p prole. \ph" means prole is compared aga<strong>in</strong>st<br />

half prole. That means prole is <strong>the</strong> pose <strong>of</strong> <strong>the</strong> faces we have stored <strong>in</strong> <strong>the</strong> gallery and<br />

half prole is <strong>the</strong> pose <strong>of</strong> <strong>the</strong> unknown pictures which are compared aga<strong>in</strong>st <strong>the</strong> gallery. \hh"<br />

means that a half prole is compared aga<strong>in</strong>st a half prole <strong>of</strong> opposite view <strong>of</strong> <strong>the</strong> face. The<br />

columns labeled with \norm." give <strong>the</strong> results without any weight<strong>in</strong>g and <strong>the</strong> columns labeled<br />

with \weight." give <strong>the</strong> results with <strong>the</strong> learned weights. The colums labeled by \co" give<br />

<strong>the</strong> percentage <strong>of</strong> correctly recognized faces and <strong>the</strong> columns labeled by \ra" give <strong>the</strong> number<br />

<strong>of</strong> pictures <strong>for</strong> which <strong>the</strong> correspond<strong>in</strong>g entry <strong>in</strong> <strong>the</strong> gallery was with<strong>in</strong> <strong>the</strong> rank <strong>of</strong> <strong>the</strong> rst<br />

ten percent best matches. The size <strong>of</strong> tra<strong>in</strong><strong>in</strong>g and test sets was between 130 and 150 pairs <strong>of</strong><br />

entries. We had no problems with local m<strong>in</strong>ima, <strong>the</strong> simplex algorithm always found <strong>the</strong> same<br />

m<strong>in</strong>ima without regard to <strong>the</strong> <strong>in</strong>itial conditions. The improvements are <strong>in</strong> <strong>the</strong> range between<br />

5% to 15%. A weight matrix <strong>for</strong> <strong>the</strong> comparison <strong>of</strong> half prole views and frontals is shown <strong>in</strong><br />

gure 3 (left).<br />

Tab. 2: The results <strong>for</strong> <strong>the</strong> algorithm applied to pose estimation. Here we used <strong>the</strong> parametrization<br />

function <strong>in</strong> (10). The correspond<strong>in</strong>g weight matrices are shown <strong>in</strong> gure 3 (right). We<br />

remark that <strong>in</strong> this case we used a k<strong>in</strong>d <strong>of</strong> graph which was created <strong>for</strong> a dierent task: <strong>the</strong><br />

segmentation <strong>of</strong> a face under <strong>the</strong> assumption that <strong>the</strong> pose is already known as described <strong>in</strong><br />

[15]. Just by <strong>in</strong>troduc<strong>in</strong>g weights <strong>in</strong> <strong>the</strong> discrim<strong>in</strong>ation function <strong>the</strong> errors could be reduced to<br />

<strong>the</strong> half. Here no ne tun<strong>in</strong>g, like add<strong>in</strong>g new nodes or select<strong>in</strong>g a special k<strong>in</strong>d <strong>of</strong> gfk adapted<br />

to pose estimation, is done to <strong>in</strong>crease <strong>the</strong> per<strong>for</strong>mance. This k<strong>in</strong>d <strong>of</strong> work is described more<br />

precisely <strong>in</strong> [10].<br />

9


Fig. 1.<br />

10


Sim<br />

S 1<br />

0<br />

a) b) c)<br />

Fig. 2.<br />

11


frontal half pr<strong>of</strong>ile pr<strong>of</strong>ile<br />

Fig. 3.<br />

12


tra<strong>in</strong><br />

test<br />

norm. weight. norm. weight.<br />

pose co ra co ra co ra co ra<br />

fh 18% 61% 27% 63% 21% 50% 26% 64%<br />

hf 19% 50% 23% 58% 13% 50% 21% 61%<br />

hp 19% 49% 24% 53% 21% 43% 28% 51%<br />

ph 23% 58% 26% 62% 24% 53% 30% 57%<br />

hh 69% 85% 73% 88% 46% 88% 50% 90%<br />

Tab. 1.<br />

13


pose estimation norm. weight.<br />

tra<strong>in</strong><strong>in</strong>g set (100 <strong>for</strong> each pose) 79% 89%<br />

test set (100 <strong>for</strong> each pose) 80% 91%<br />

Tab. 2.<br />

14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!