An Algorithm for the Learning of Weights in Discrimination Functions ...

An Algorithm for the Learning of Weights 

in Discrimination Functions using 

a priori Constraints 

Norbert Kruger 

Abstract 

We introduce a learning algorithm for the weights in a very common class of discrimination 

functions usually called \weighted average". The learning algorithm can reduce 

the number of free variables by simple but eective a priori criteria about signicant features. 

Here we apply our algorithm to three tasks of dierent dimensionality all concerned 

with face recognition. 

1 Introduction 

Many pattern recognition systems can be roughly divided into two parts, feature extraction and 

pattern discrimination. In feature extraction an input I is transformed into a vector I k 2 IR N . 

(In speech recognition I k can, for example, represent the Fourier transformation in a certain 

time interval in a specic frequency band [14]; in image processing I k could be the lter response 

of a wavelet-like lter at a certain position in the grey-level picture [11, 15]). In discrimination 

the input I has to be assigned to a specic class c. The extracted features are used to evaluate 

certain similarities to the dierent classes. Let Sim k (c; I) be a measure for the similarity of 

the input I to class c only regarding the k-th feature, respectively submodule. Sim k (c; I) is 

assumed to be high if according to the k-th submodule I is a member of class c and is assumed 

to be low if not. Sim k (c; I) may for example be the distance between a representative of class 

c and the input I in a specic feature as for example in vector quantization [6]. Very often the 

nal discrimination function is similar to 

nX 

Sim k (c; I) 

Sim tot (c; I) = 

; (1) 

k=1 n 

and it is said that I belongs to class c if Sim tot (c; I) is maximal for c. The disadvantage of 

(1) is that it does not take into account whether the feature I k is important for the decision 

1

whether I belongs to class c or not. A better choice would be 

nX 

Sim tot (c; I) = c k Sim k (c; I); (2) 

k=1 

with the restrictions 0;P c k > k c k 

= 1 for every c. For the more signicant features k for the 

recognition of a representative of class c we expect k c to be high, otherwise to be low. In many 

pattern recognition tasks (e.g., [11, 6]) a discrimination function of type (1) is used. Here we 

introduce an algorithm to extend (1) to (2) and we give a learning rule for the free parameters 

c k 

. Therefore we give an algorithm to improve a standard choice such as (1) by introducing 

and learning new parameters. 

We apply this algorithm to dierent tasks related to face recognition. The rst two tasks are 

both concerned with face discrimination, but the dimension of the problems is very dierent. 

In the rst task we have to determine approximately 40 parameters c k 

, in the second task we 

determine 1800 parameters. For the rst two tasks k 

c does not depend on the class c (i.e., 

k 

c = k 

for all c0 

c; c 0 ). In our notation we only have k . The third task is concerned with 

pose estimation and the weights are chosen class dependently ( c k 

6= for c0 

k c 6= c0 ) where c; c 0 

represent the dierent poses. We describe the learning scheme by applying the algorithm to 

the rst task and give a brief description of its extension to the other two tasks. We would 

like to stress that our algorithm is not restricted to face recognition, but is able to deal with 

any problem which ts into the formalism dened above. 

2 Problems in choosing suitable discrimination functions 

The approach of statistical decision theory is described in detail in [1] and [4]. Bayes' formula 

P (cjI) = 

P (Ijc)P (c) 

P (I) 

(3) 

gives the optimal discrimination function we can achieve. Unfortunately the problem arises to 

estimate the parameters P (c) and the conditional densities P (Ijc) which are N-dimensional 

functions for every c. 1 Therefore a priori assumptions about the P (Ijc) usually have to be made 

and one has great diculties if these assumptions are not fullled (see for example [5, 12]). 

Another discrimination function used in applications is the Mahalanobis metric (see e.g. [7]) 

which is a linear function. But also in this case there are many problems in determining the 

free parameters of the metric. 

It is well known that the number of free parameters is an important factor in learning 

algorithms because it is correlated with the number of training examples and the time needed 

to train the system in order to get acceptable generalization abilities. Here it is our aim to 

reduce the number of free parameters by using a priori settings. For the second task we are 

able to reduce a problem with 1800 free parameters to less than 10 parameters. Only this 

reduction allows us to nd a suitable solution. 

We assume that a general problem in pattern recognition is the right mixture of a priori 

knowledge and learning. There are two extremes: One can invent an expert system, the a priori 

assumptions are problem dependent and the system may be suitable to solvewell dened tasks, 

1 P(I) can be calculated from P (I jc) and P (c). 

2

e.g., the recognition of trac signs. This system will not be able to handle, for example, the 

recognition of faces. The other extreme is a very general statement such asBayes' formula, 

but then learning may take forever. We think that the right mixture of a priori knowledge and 

learning has to be found. That means that the a priori assumptions have tobevery general, 

eective, and applicable in many situations (see also [9, 10]). 

One of the advantages of our algorithm is the mixture of a priori knowledge and learning. 

The a priori settings are very simple and evident and are controlled by the learning which also 

gives the required exibility to nd a solution adapted to the problem. 

3 The face recognition system 

As basic visual feature we use a local image descriptor in the form of a vector I s;o (x; y) called 

\jet" [2, 11]. Each component (s; o) of a jet is a Gabor wavelet of specic frequency s and 

orientation o, extracted from the image at a denite point (x; y). We are employing Gabor 

wavelets of 5 dierent frequencies and 8 dierent orientations for a total of 40 complex components. 

Such a jet describes the area surrounding (x; y). We represent objects as labeled graphs 

(see gure 1). That means dierent landmarks (like the tip of the nose, the left eye etc.) form 

a graph labeled with jets at its nodes (respectively landmarks). To compare jets and graphs, 

similarity functions are dened: The normalized dot product of two jets I k ;J k 

Sim jet (I k ;J k )= X s;o 

I(k;s;o) J(k;s;o) 

jjI k jj jjJ k jj 

(4) 

yields their similarity. The sum over jet similarities 

nX 

nod 

Sim tot (I;J)= 

k=1 

Sim jet (I k ;J k ) 

n nod 

(5) 

of the nodes of the graph gives the total similaritybetween two faces I and J; n nod is the number 

of nodes in the graph. If the faces have dierent poses jets describing the same landmark are 

compared against each other (see gure 1 (top)). 

For pose estimation we position the graphs shown at the bottom of gure 1 automatically 

by the algorithm described in [15] on a given picture. A similaririty betweenanodeofa 

graph and a general representation of poses (called gfk, general face knowledge) is dened. 

We calculate the weighted average over these similarities for each graph representing a certain 

pose. The pose corresponding to the graph yielding the highest total similarity ischosen as 

the correct one. The pose estimation algorithm is described more precisely in [8, 9]. 

4 Description of the algorithm for nding weights for 

nodes 

In this section we describe the basic idea of our algorithm on the example of learning weights 

for the nodes of the graphs for face discrimination. We modify (5) into 

nX 

nod 

Sim tot (I;J)= k Sim jet (I k ;J k ): (6) 

k=1 

3

We introduce an evaluation function 

Q(T;1;:::; n ); (7) 

which measures the quality of recognition on a certain training set T depending on the weights 

1;:::; n . Q might be the number of correctly recognized faces or a more complicated function. 

It turned out that our algorithm is not very sensitive to the dierent choices of Q we have 

tried. We will therefore not describe our choices of Q in detail. 

We assume that in T there are only pairs of pictures of the same person. Our system 

\knows" a person by extracting a labeled graph from a picture of this person. We call G i the 

graph of the person stored in the system, the set of all graphs stored in the system we call a 

\gallery". I i is the graph extracted from the second picture of i-th person which is input to 

the system and has to be recognized. 

We applied an optimization algorithm for nding local minima of a multidimensional function 

to the n-dimensional function Q in order to optimize the weights k . Because of the large 

number of parameters we had diculties in generalization, i.e., we achieved a large improvement 

for the training data but a decrease of performance on the test data. 

In this paper we use the simplex algorithm [13] as optimization algorithm. The simplex 

algorithm nds minima of a multi-dimensional function f : IR n ! IR without using any 

information about the derivative of the function f. Furthermore the simplex algorithm is more 

robust against local minima than many other optimization algorithms. 

Let T k be the part of the training set which includes only the k-th feature from all I i and 

G j . If, for example, the feature k is the jet representing the nose, T k is the set of all nose jets 

in the training set. In the following we derive a function J(T k ) which calculates the k directly 

from T k : 

k = J(T k ): 

We call J the \parametrization function". We assume that the information needed to evaluate 

the discriminativepower of the k-th node is contained in the set of jets of the k-th feature, i.e. 

in T k . The function J should make use of a priori knowledge about signicance. It is simply 

stated that for a feature f to be considered signicant, the following two conditions hold: 

C1 For two representatives of the same class the similarities for the feature f are in general 

high. 

C2 For two representatives of dierent classes the similarities for the feature f are in general 

low. 

In our notation C1 and C2 can be summarized as follows: A feature k has large discriminative 

power if 

C1 Sim jet (I i k ;Gi k) is large for many i, and 

C2 Sim jet (I i k ;Gj k) is small for many j 6= i. 

We can even combine C1 and C2 into the following statement: A node k has large discriminative 

power if 

C3 Sim jet (I i k ;Gi ) , k Sim jet(I i k ;Gj k 

) is large for many i and j 6= i. 

4

Therefore we look at the values 

We dene 

(i; j) k := Sim jet (I i k ;Gi k) , Sim jet (I i k ;Gj k 

): (8) 

H i k = f(i; j) k jj :1;:::;Mg: 

Figure 2 shows examples of such sets. For every node there exist M sets H i k 

;i :1;:::;M. 

These sets are connected to signicant features in the following way. Figure 2 shows three sets 

Hk. i The large bar represents the similarity Sim jet (Ik i ;Gi k) and the dots represent a scatter plot 

of the Sim jet (Ik i ;Gj k 

) with j 6= i. Figure 2a) corresponds to a node which isvery suitable for 

discrimination of the i-th class from the other classes because Sim jet (I i k ;Gi k 

)ismuch higher 

than the Sim jet (I i k ;Gj k 

);j 6= i. Fig. 2b) corresponds to a node which is not as suitable for 

discrimination as 2a) but is more suitable than the node corresponding to 2c). 

If we nowhave a large number of Hk i for a certain feature of type 2a) and 2b), the corresponding 

feature is signicant and the function J(T k ) should give a high value. But if we have 

many H i k 

of type 2b) and 2c) the corresponding feature is not very signicant and the function 

J(T k ) should give alowvalue. 

We assume J(T k ) to be of the following form: 

where 

J(T k )=j3 

2 

4 X i 

1 

n j 2 

2 

4 X j6=i 

1 

n j 1 [(i; j) k ] 

j i : IR ! IR; i :1;:::;3 

3 

5 3 5 ; (9) 

are real monotonically increasing functions. Examples for functions J are given in section 5. 

This choice of J can be motivated as follows: 

Let us call eq. (8) \simple comparison". It represents the stability in time of a feature 

for a certain class and the dierence of this feature to another class. 

The simple comparisons are evaluated by a function j1. j1 must be monotonically increasing 

because a larger value of eq. (8) indicates a more signicant feature than a low 

value of eq. (8). (A similiar argument also holds for j2 and j3.) 

The simple comparisons for one class compared with all others are averaged and evaluated 

by j2, i.e., one set H i is evaluated. The histogram k Hi k 

can be regarded as a \complex" 

comparison in which a certain feature of one person is compared with the same feature 

for many other persons. 

Each judgement of a histogram H i k is averaged again and nally judged by a function j3. 

Therefore the \complex" comparison expressed in H i k 

is evaluated many times and this 

information is used to determine the nal k . 

In the j i we allow some free parameters 1;:::; N . The problem we have is the following. 

We know some properties of j1;j2 and j3, e.g. that they have to be monotonically increasing, 

but their exact shape is unknown. This uncertainty is expressed by the parameters 1;:::; N . 

We substitute 

k = J(T k ; 1;:::; N ) 

5

and we get 

Q(T ; 1;:::; N )= 

Q(J(T1; 1;:::; N );:::;J(T n ; 1;:::; N )): 

We apply the simplex algorithm [13] to this function Q whichnow depends on N parameters. 

It is assumed that N is much smaller than n. Therefore, we have a reduction of dimensionality 

from n to N. J representsalaw which calculates the signicance of a feature from the subset 

T k of the training set T . The function J is the same for each nodek but because it depends 

on T k it gives them dierent values. The whole method can be summarized as an \estimation 

of an incompletely known dependence". That is to say, some conditions like C1 and C2 are 

known, but there are still some unknown parameters. The cruical point isnowhowmuch a 

priori knowledge can be xed. If we do not give enough free variables we may miss the goal. 

If we give too many free variables we increase the search space and we will get problems in 

generalization. In section 5 we will use dierent functions for j1;j2 and j3. 

The extension to learn a weight matrix (k;s;o) for all jet components is straightforward and 

is described more precisly in [8]. The basic dierence to the algorithm described above is the 

choice of the {expressions. For the learning of the (k;s;o) we use as the dierences of the 

similarities of two faces of the same person and to a dierent person in each jet component. 

For pose estimation we learn class dependent weights c k (where c represents the classes frontal, 

half prole and prole) by setting to the dierence of the node similarities achieved on a 

picture of a face with correct pose and incorrect pose (for details see [8]). 

5 Results 

Our complete data set contains more than 1500 pictures of approximately 350 persons. The 

poses frontal, half prole and prole exist for most of the persons in the data set. We found out 

that the weights depend on the actual task. For example the weights for the discrimination 

problem (see gure 3 (left)) are dierent compared to the weights for the pose estimation 

problem (see gure 3 (right)). But even for the discrimination problem the weight matrices 

depend very much on the poses compared against each other. 

Let C[a; b] the space of continous functions from the interval [a; b] toIR. We have tond 

functions 

j i 2C[a i ;b i ];i:1;:::;3 

which are suitable as parametrization functions. Avery general approach is to approximate 

j i with splines (see [3] for a more precise description of splines). We have done this for task 1 

and task 2. A spline can be dened with dierent numbers of free variables. In our simulations 

we recognized that 3 free variables are optimal and that j2 can be set to the identity map 

j2(x) =x without loss of performance but with better generalization properties. The borders 

of the spline are set depending on the mean and variance of the distributions of j1(x) and 

j3(x) measured on the training set. As free parameters in the parametrization function we 

therefore have six free variables, three variables for both j1 and j3. The optimization problem 

is therefore reduced to six dimensions. 

We also tried the following settings: 

j1(x) = arctan(1 x) 

j2(x) = x (10) 

j3(x) = (max(x; 0)) 2 : 

6

In this case the problem is reduced to two dimension expressed by the parameters 1;2. 

At rst glance the settings in (10) look somehow arbitrary but we already got good results 

with these settings. The function arctan ensures that outliers in the expressions (8) do not 

have toomuch inuence and the function (max(x; 0)) 2 

is used as a scatter function for the 

calculation of the nal weights from the \judgement" which is already done in sums of the 

parametrization function. 2 But for other problems the more general spline approach maybe 

better. 

Table 1 gives the results for the learning of the node weights with the parametrization 

function (10). Using splines as parametrization functions we get similar results. The weighting 

of all jet components achieves slightly better results. Table 2 gives the results for the learning 

of class dependent weights for the pose estimation task. The time needed for learning for task 

1 with the settings (10) for one weight matrix is less than 5 minutes on a Sparc 10. If we 

approximate the j i with splines the learning takes approximately 15 minutes. The learning 

of weights for all jet components takes approximately 12 hours and the learning for the pose 

estimation problem takes less than 5 minutes. In [8] the results of our simulations are discussed 

in more detail. 

6 Conclusion 

We introduced a learning algorithm for weights in discrimination functions and we applied this 

algorithm to very dierent tasks in face recognition. Nevertheless we expect our algorithm 

can also be applied to other discrmination tasks because we only make use of simple properties, 

the dierence of similarities of submodules within classes compared to the similarities of 

submodules between dierent classes. We expect this algorithm might bevery useful in other 

pattern recognition systems which make use of a discrimination function of the type (1), e.g., 

vector quantization methods. The transformation of the input space induced by the weighting 

is simple: it is a stretching or compression in each dimension of the input space. The improvement 

which can be achieved will depend very much on the quality of the already extracted 

submodules or features. If there are already very signicant features for many classes and a 

lot of other features which are insignicant, the algorithm will nd the signicant features and 

will yield a large improvement (as in the case of pose estimation). But if all of the features 

are not suitable to recognize elements of the dierent classes, the transformation we can learn 

with our algorithm will give less improvement. Then a more complex transformation has to be 

performed, i.e., the feature extraction itself has to be improved. In recent work [9] we utilized 

a priori constraints similar to our criteria C1 and C2 to derive more ecient features. 

Acknowledgement: 

We like to thank Laurenz Wiskott, Jan Vorbruggen, Thomas Maurer and Christoph von 

der Malsburg for very fruitful discussions. Portions of the research use the FERET database 

of facial images collected under the ARPA/ARL FERET program. 

References 

[1] J.O. Berger, Statistical Decision Theory; foundations, concepts and methods (2en ed), New 

York, Springer, 1985. 

2 The maximum function is only needed to ensure that the power function x 2 

is dened. 

7

[2] J.G. Daugman, \Uncertainty relation for resolution in space, spatial frequency, and orientation 

optimized by 2D visual cortical lters", Journal of the Optical Society of America 

vol. 2 (7), pp. 1160-1169 (1985). 

[3] C. De Boor, Apractical Guide to Splines, New York, Springer Verlag, 1978. 

[4] T.S. Ferguson, Mathematical statistics: Adecision theoritic approach, New York, Academic 

Press, 1967. 

[5] K. Fukunaga, Introduction to statistical pattern recognition (2nd ed), Boston, Academic 

Press, Boston 1990. 

[6] R.M. Gray, Vector Quantization, IEEE Transactions on Acoustics, Speech, and Signal 

Processing, 1(2):4-29, 1984. 

[7] T. Kohonen, Self-Organisation and associative memory (3r. ed.), Springer Series in Information 

Science 8, Heidelberg 1989. 

[8] N. Kruger. \An algorithm for the Learning of Weights in Discrimination Functions", IR- 

INI 08{95 (Technical Report). 

[9] N. Kruger, G. Peters, C. v.d. Malsburg, \Object Recognition with Banana Wavelets", 

accepted for ESANN{97. 

[10] N. Kruger, M. Potzsch. T. Maurer, M. Rinne, \Estimation of Face Position and Pose with 

Labeled Graphs", Proceedings of the BMVC 1996. 

[11] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R.P. Wurtz, 

W. Konen, \Distortion Invariant Object Recognition in the Dynamik Link Architecture", 

IEEE Transactions on Computers, 42(3):300-311, March 1992. 

[12] M.S. Landy, L.T.Maloney, E.B. Johnsten, M. Young, \Measurement and modeling of 

depth cue combinations: in defense of weak fusion", Vision Research, Vol. 35, No. 35, 

pp:389-412, February 1995. 

[13] J.A. Nelder, R. Maed, \A simplex method for function minimization". Computer Journal, 

vol. 7, pp. 308-313, 1996. 

[14] L.R. Rabiner, R.W. Schafer, Digital Processing of Speech Signals, Bell Laboratories 1978. 

[15] L. Wiskott, J.-M. Fellous, N. Kruger, C. von der Malsburg, \Face Recognition and Gender 

Determination". Proceedings of the International Workshop on Automatic Face- and 

Gesture recognition, Zurich, June 1995. 

8

Fig. 1. Top: Flexible graphs for frontal and half prole views. At every node jets are extracted. 

For the nal decision which face in the gallery corresponds to a certain input face the jets in 

dierent poses belonging to the same landmark are compared with each other. Bottom: The 

graphs used for the preprocessing for the dierent poses (frontal, half prole, prole). The size 

dierences are quite typical for our data set. 

Fig. 2: Examples of Hk. i The large bar represents the similarity Sim jet (Ik i ;Gi k) and the dots 

represent the Sim jet (Ik i ;Gj k 

) with i 6= j. The similarityoftwo jets ranges between 0 and 1 (for 

further explanation see text). 

Fig. 3. Left: Aweight matrix for the comparison of frontal with half prole views. The 

corresponding graph is shown in gure 1 (top). The eyes are more important for discrimination 

of half proles and frontals compared to the mouth and chin. The part of the face rotated 

towards the camera is more important than the opposite part of the face. Right: The learned 

weight matrices for the pose estimation task for frontal (left), half prole (middle) and prole 

(right) views. The weights correspond to the nodes of the graphs shown in gure 1 (bottom). 

The nodes corresponding to the top of the head are very insignicant for pose estimation 

because for all poses they represent similar features. The nodes were selected in [15] for face 

segmentation under the assumption that the pose is already known. For this task they are 

very useful. The tip of the nose is signicant for the recognition of the frontal and half prole 

poses, the lips are signicant for the discrimination of frontals and half proles. The eyes are 

not very signicant for our pose estimation algorithm. 

Tab. 1: The results for weight learning for the nodes in percentages. As parametrization 

function we used (10). In the left column the pair of poses compared against each other is 

marked. f means frontal, h half prole and p prole. \ph" means prole is compared against 

half prole. That means prole is the pose of the faces we have stored in the gallery and 

half prole is the pose of the unknown pictures which are compared against the gallery. \hh" 

means that a half prole is compared against a half prole of opposite view of the face. The 

columns labeled with \norm." give the results without any weighting and the columns labeled 

with \weight." give the results with the learned weights. The colums labeled by \co" give 

the percentage of correctly recognized faces and the columns labeled by \ra" give the number 

of pictures for which the corresponding entry in the gallery was within the rank of the rst 

ten percent best matches. The size of training and test sets was between 130 and 150 pairs of 

entries. We had no problems with local minima, the simplex algorithm always found the same 

minima without regard to the initial conditions. The improvements are in the range between 

5% to 15%. A weight matrix for the comparison of half prole views and frontals is shown in 

gure 3 (left). 

Tab. 2: The results for the algorithm applied to pose estimation. Here we used the parametrization 

function in (10). The corresponding weight matrices are shown in gure 3 (right). We 

remark that in this case we used a kind of graph which was created for a dierent task: the 

segmentation of a face under the assumption that the pose is already known as described in 

[15]. Just by introducing weights in the discrimination function the errors could be reduced to 

the half. Here no ne tuning, like adding new nodes or selecting a special kind of gfk adapted 

to pose estimation, is done to increase the performance. This kind of work is described more 

precisely in [10]. 

9

Fig. 1. 

10

Sim 

S 1 

0 

a) b) c) 

Fig. 2. 

11

frontal half profile profile 

Fig. 3. 

12

train 

test 

norm. weight. norm. weight. 

pose co ra co ra co ra co ra 

fh 18% 61% 27% 63% 21% 50% 26% 64% 

hf 19% 50% 23% 58% 13% 50% 21% 61% 

hp 19% 49% 24% 53% 21% 43% 28% 51% 

ph 23% 58% 26% 62% 24% 53% 30% 57% 

hh 69% 85% 73% 88% 46% 88% 50% 90% 

Tab. 1. 

13

pose estimation norm. weight. 

training set (100 for each pose) 79% 89% 

test set (100 for each pose) 80% 91% 

Tab. 2. 

14

An Algorithm for the Learning of Weights in Discrimination Functions ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?