Thesis - VIBOT congrat page

Robust Face Descriptors in Uncontrolled Settings 

Kenneth Alberto Funes Mora 

LEAR Team 

INRIA Rhône-Alpes 

Supervisors 

Cordelia Schmid, Jakob Verbeek and Matthieu Guillaumin 

A Thesis Submitted for the Degree of 

MSc Erasmus Mundus in Vision and Robotics (VIBOT) 

· 2010 ·

Abstract 

Face Recognition is known to be a difficult problem for the computer vision community. 

Factors such as pose, expression, illumination conditions and occlusions, among others, span 

a very large set of images that can be generated by a single person. Therefore the automatic 

decision of whether a pair of images depict the same person or not, in uncontrolled settings, 

becomes a highly challenging problem. 

Due to the large quantity of potential applications, over the past years many algorithms 

have been proposed, which can be separated into three categories: holistic, facial feature based 

and hybrid. Even though some algorithms have achieved a high accuracy, there is still the need 

for a significant improvement to achieve robustness in uncontrolled conditions while achieving 

a high computational efficiency. 

In this thesis we explore the use of a Histogram of Oriented Gradients as a holistic descriptor. 

The experimental results show that a considerable performance is achieved when a proper set 

of parameters are combined with a prior face alignment. The classification function is given by 

a metric learning algorithm, i.e. an algorithm which finds the best Mahalanobis distance that 

separates the input data. 

Additionally a facial feature based descriptor is presented, which is the concatenation of 

SIFT descriptors, computed in the location of interest points found by a facial feature detection 

algorithm. More importantly, a method to handle occlusions is proposed, where a confidence 

is obtained from each facial feature and later combined into the classification function. Also, 

non-linear strategies for face recognition are discussed. 

Finally it is shown that there is complementary information between both descriptors, as 

their combination improves the performance such that it becomes comparable to the current 

state of the art algorithms.

Contents 

Acknowledgments iii 

1 Introduction 1 

1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.2 Outline and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

2 Related work 5 

2.1 Marginalized k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

2.2 Automatic naming of characters in TV video . . . . . . . . . . . . . . . . . . . . 6 

2.3 Attribute and simile descriptor for face identification . . . . . . . . . . . . . . . . 8 

2.4 Face recognition with learning based descriptor . . . . . . . . . . . . . . . . . . . 10 

2.5 Multiple one-shots using label class information . . . . . . . . . . . . . . . . . . . 12 

3 The face recognition pipeline 14 

3.1 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

3.2 Facial features localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

3.3 Face alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

3.4 Preprocessing for illumination invariance . . . . . . . . . . . . . . . . . . . . . . . 19 

3.5 Face descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

3.6 Learning/Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

3.7 Datasets and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

i

3.8 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

4 Histogram of Oriented Gradients for face recognition 29 

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

4.2 Alignment comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

4.3 HoG parametric study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

5 Facial feature based representations 36 

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 

5.2 Feature wise classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 

5.3 Non-linear approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 

6 Combining face representations 46 

6.1 Results for LFW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

6.2 Results for PubFig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

7 Conclusions and future work 50 

Bibliography 55 

ii

Acknowledgments 

First of all I want to thank all the people who brought the Vibot program into existence and 

that every year work very hard for its improvement. The coordinators Fabrice Meriadeau, 

David Fofi, Joaquim Salvi, Jordi Freixenet, Robert Martí and Yvan Petillot and every single 

one of the lecturers and administrative staff. Without your effort and initiative we would not 

be here. 

To my supervisors: Cordelia Schmid, Jakob Verbeek and Matthieu Guillaumin. I feel very 

thankful for receiving me in the LEAR team, and for your valuable guidance, which helped me 

to grow in knowledge and experience. As well to all the members of the LEAR team, for their 

friendship and for making these months a very gratifying and enrichful experience. 

To all my Vibot colleages, I have learned too many things from every single one of you. The 

cultures you were representing, your different world views and your experience. It is one of the 

things that I would never forget of the Vibot program. It helped to grow in so many ways that 

I can not express. The world is a small place but it contains great people. Your friendship will 

be always alive and I hope we will be meeting in the future. 

I want to thank my friends at home, who have been in contact with me all this time. Always 

willing to listen, always willing to advice, always willing to talk. Definitely a true friend is not 

separated by the distance. You guys know who you are. . . 

I want to thank my family, my parents Carlos Funes and Ruth Mora, and my brother 

Michael Funes for their support at the distance and encouranging words in moments of need, 

¡Papá, Mamá, Michael, los amo enormemente!, ¡Gracias!. 

I would like to thank my God and saviour Jesus Christ, you are the source of my strength 

and my motivation, you take me by the hand when I need it the most. Thank you. . . 

years. 

Last but not least, to the European Commission for funding my studies during these two 

iii

Chapter 1 

Introduction 

Face Recognition can be divided in two main applications: Face Identification and Face Verifi- 

cation. The former refers to the association between a set of probe faces and a gallery, in order 

to determine the identity of each of the exemplars from the probe set. The latter refers to the 

decision of whether a pair of face instances correspond or not to the same person. This defini- 

tion is different to that of visual identification [13], where the term identification is used for the 

pair matching problem. It can be noticed that the face verification problem is wider, in such 

way that the face identification task can be formulated by solving face verification subproblems. 

Within this thesis we focus on face verification. Therefore, the goal is to design an algorithm 

to automatically decide, whether a pair of unseen face images, depict the same person or not. 

It is a supervised classification problem, in which the decision function is trained based on a set 

of example faces labeled with identities, or pairs of face images labeled as similar or dissimilar. 

The availability of a solution to this problem is highly attractive for its many applications. 

It comprises fields such as entertainment, smart cards, information security, law inforcement, 

surveillance, etc. [45]. Within the context of scene interpretation, we want to be able to auto- 

matically determine what is happening in an image or a video [35]. Face recognition is highly 

valuable as it helps to determine the question of Who is in the scene [11,20]. This will open 

the possibility for applications such as categorization, retrieval and indexing based on iden- 

tity [15,16]. The use of face recognition technology is becoming more and more visible, e.g. the 

recent launch of tools for automatic face labeling in sites such as Picasa 1 . 

More than 35 years in research have generated many algorithms [1,3,7,11,15,16,22,26,35, 

36,38,40,42,45] and benchmarks [19,22,27,33], which have pushed face recognition to achieve 

outstanding results, a proof is the current availability of commercial software [28]. In general, 

this software is designed for the case in which the person cooperates in the image acquisition 

1 Picasa Web Albums, http://picasaweb.google.com/ 

1

Chapter 1: Introduction 2 

(a) (b) (c) 

(d) (e) (f) 

Figure 1.1: Face variations due to: (a) Viewpoint changes (b) Illumination variation (c) Occlusions 

(d) Expression (e) Age variations (f) Image quality 

in a controlled environment, and therefore there are no major changes in illumination, pose, 

expression, etc. However, face recognition in uncontrolled settings, from still images and videos, 

is still an unsolved problem. Despite the large amount of research carried out, a significant 

improvement is still required in order to achieve robustness in such settings. 

The main challenge is that a single person can virtually generate an infinite number of 

images. This is due to the many factors that influence the image acquisition. Among the most 

important are: major pose or viewpoint changes, including scaling differences, variations in the 

illumination conditions, the possibility of occlusions due to sunglasses, hats and other objects, 

differences in expression,aging, changes in hair and facial hair and image quality. Figure 1.1 

shows examples of how this factors affect the resulting image. 

1.1 Problem definition 

Even though many algorithms can be found in the literature, a general pipeline can be identified, 

shown in Fig. 1.2. Its steps are intended to overcome the challenges previously mentioned. Face 

detection is the first step, it defines a bounding box for the location and scale of the face. Then 

three optional steps can be applied: alignment, facial feature localization and/or preprocessing 

to gain invariance to illumination. The goal is to build a visual descriptor that can be used as 

the input for machine learning algorithms. These algorithms are capable of classifying a pair 

of examples as belonging to the same individual or not. Three categories of algorithms can be 

identified: Holistic, Feature based and Hybrid approaches.

3 1.1 Problem definition 

Face 

detection 

Alignment 

(optional) 

Facial features 

extraction 

(optional) 

Illumination 

normalization 

(optional) 

Figure 1.2: General face recognition pipeline 

Visual feature 

extraction 

Face 

identification 

Holistic face description methods consider the face image as a whole to build the descriptor. 

Examples of such approaches are the subspace learning algorithms, where a face is represented 

as a point in a high dimensional space, with the intensity of each pixel as one dimension, 

followed by the use of techniques such as Principal Component Analysis (Eigenfaces) [38] or 

Linear Discriminant Analysis (Fisherfaces) [3]. In such cases, the objective is to project the data 

into a lower dimensional space where most of the information is maintained (PCA) or the dis- 

criminant information between different classes (people) is emphasized (LDA) when computing 

the projection matrix. Bayesian methods also fall into this category, refering to those meth- 

ods that generate a Maximum a Posteriori (MAP) estimation of a intrapersonal/extrapersonal 

classifier [24]. 

Aditionally, proposals has been presented to unify Bayesian approaches with Eigenfaces 

and Fisherfaces [40]. These algorithms have shown to provide good results under controlled 

conditions, using benchmarks such as the FERET database [33]. However, they are not suitable 

for uncontrolled settings, where high non-linearities are introduced, e.g. as a result of major 

pose changes, and are sensitive to the localization given by the face detector. 

Proposals have been presented to improve the performance in uncontrolled conditions, by 

creating more complex descriptors than simply the set of pixel values, e.g. using Local Binary 

Patterns [1] or by extending subspace learning to handle non-linear data, using the kernel 

trick [6, 44]. Additionally through methods specialized in non-linear dimension reduction, by 

learning an invariant mapping [17]. In this thesis, a holistic approach based on Histogram of 

Oriented Gradients (HoG) will be presented in Chapter 4. 

Feature based face description algorithms are grounded in the localization of a set of facial 

features, such as the position of the mouth, the eyes, the nose, etc, after face detection [11,29]. 

A descriptor is built using the location information. In the past years, algorithms based on 

Facial Features localization have gained a growing attention [7,10,11,16,22], as they are less 

sensitive to pose variations and misalignments introduced by the face detector. 

Therefore they are appropriate for the face recognition tasks in uncontrolled settings. How- 

ever, the facial feature localization itself is still problematic, and needs further improvements. 

In this thesis a feature-based algorithm using multiscale SIFT [16, 23] will be presented, and

Chapter 1: Introduction 4 

compared to the Holistic approach based on HoG descriptors. 

Hybrid face description methods combine holistic and feature based paradigms, through 

either early or late fusion. Early fusion refers to the case in which descriptors are combined 

into one using aggregation methods, such as concatenation of the feature vectors. In this case, 

the information is combined prior to classification. Late fusion makes a classification based on 

each descriptor, and their corresponding scores are combined into one, to make a more robust 

decision. In this thesis we use a late fusion method, which combines the HoG and multiscale 

SIFT descriptors. 

1.2 Outline and contributions 

In Chapter 2 different state of the art algorithms are described in detail. These were identified 

for being the current state of the art for challenging benchmarks such as the Labeled Faces in 

the Wild [19] dataset, or because they were an important influence for our work. In Chapter 3 

there is a detailed description of the face recognition pipeline from Fig. 1.2. Each of the stages 

are described, together with algorithms for their implementation. 

The first contribution is given in Chapter 4, where we explore the use of a Histogram of 

Oriented Gradients descriptor for face recognition. We show in this chapter that an alignment 

robust regarding translations is necessary to obtain a good performance. Furthermore, we 

identify set of parameters for which a highest accuracy is achieved. 

Our second contribution, described in Chapter 5, is related with feature based algorithms. 

We propose a strategy in which learning is done for each facial feature, after which we combine 

them through late fusion. Even though this does not help the overall performance, it is good 

to handle occlusions. This is done by detecting outliers based on a discriminative appearance 

model. The occlusion information is later on inserted into the classification function. 

The third contribution is showed in Chapter 6, where we combine the use of HoG and mul- 

tiscale SIFT representations through late fusion. This combinations increases the performance 

of the algorithm such that it is comparable to the state of the art. Finally, in Chapter 7, we 

give a summary of our work pointing out the main conclusions, from which we define our future 

work.

Chapter 2 

Related work 

In Chapter 1 different face recognition algorithms were mentioned. We identified a few methods 

that have given promising results in uncontrolled settings, and are recognized as the state of 

the art. These algorithms are described in more detail in this chapter. 

2.1 Marginalized k-Nearest Neighbors 

Guillaumin et al. proposed the use of metric learning approaches for face recognition [16], more 

specifically, Logistic Discriminant Metric Learning (LDML), an algorithm that searches for 

the best Mahalanobis distance between pairs of feature vectors, explained in more detail in 

Section 3.6.2. 

Even though LDML has proven to be effective, any Metric Learning algorithm will generate 

a linear transformation of the input space. However data, for face recognition, is believed to 

be highly non linear, due to major changes in pose and expression. Therefore, metric learning 

approaches might not be able to effectively separate the classes. To overcome this problem, 

Guillaumin et al. proposed a modification of k-Nearest Neighbors (k-NN). In k-NN classification, 

an unseen example is assigned to the class with most occurrence within its k neighbors, that 

are defined according to some measure, e.g. minimim Euclidean distance. 

If n i c denote the quantity of neighbors of xi belonging to class c. Then the probability of xi 

to be of class c is estimated as p(yi = c|xi) = n i c/k. The proposal is to classify the pair (xi,xj) 

as belonging to the same class by marginalizing over all the possible classes within the training 

set. This is shown in Eq. (2.1). 

p(yi = yj|xi,xj) = � 

c 

p(yi = c|xi)p(yj = c|xj) = 1 

k 2 

5 

� 

n c in c j 

c 

(2.1)

Chapter 2: Related work 6 

This result can be thought as a binary k-Nearest Neighbors classifier in the implicit space 

of N 2 pairs. This can be observed in Fig. 2.1, where for each point of the pair to be classified, 

their k neighbors are selected and then the vote is given by all the pairs that can be generated 

from their neighbors, divided by the quantity of possible pairs Eq. (2.1). 

The descriptors used in [16] were Local Binary Patterns (LBP) [42] and SIFT [23], computed 

at 3 scales in the locations given by the facial feature localization algorithm, i.e. the corner of 

the eyes, nose and mouth. The metric used to define the neighborhood was given by a Large 

Margin Nearest Neighbors [41]. An algorithm designed to find a metric specifically optimized 

for the k-NN problem. 

xi 

B 

A 

C 

12 pairs 

24 pairs 

6 pairs 

C 

6 pairs 

Figure 2.1: Marginalized K Nearest Neighbors [16] 

2.2 Automatic naming of characters in TV video 

Everingham et al. [11] considered the problem of automatic naming of characters in video. They 

combined information such as subtitles and scripts to determine which characters are present 

in the scene and when. Using visual information are able to associate a name to each character 

for certain tracks. These tracks are used as well to generate a set of training examples for 

a face recognition algorithm, used to determine the identity of characters from the remaining 

unlabeled tracks. 

In this case, the problem is simpler in terms of face recognition, tracking can be used to 

associate faces in a sequence of frames. Moreover, video can easily generate a large amount of 

training examples, and generally, there is a small amount of characters to recognize. 

The first step is to align the script (dialogue-character) with the subtitles (dialogue-timing) 

to determine which characters are talking and when. Then they proceed to obtain face tracks, 

that are face detections linked as the same person over a group of not necessarilly sequential 

frames. This is done using a Kanade-Lucas-Tomasi (KLT) tracker [34], this algorithm uses a 

interest point detector for the first frame and then propagates the points over the following 

A 

xj 

B

7 2.2 Automatic naming of characters in TV video 

(a) (b) 

Figure 2.2: (a)Example of face tracking to build the training set (b) Features Patches extraction 

[11] 

frames. Based on the tracked interest points, which follow a path intersecting face detections, 

the creation of the face tracks are obtained as seen in Fig. 2.2a. The face tracking is done 

separately for each shot of the whole video, where a change of shot is detected by thresholding 

the difference of color histograms between succesive frames. Notice that this simplifies the 

problem of face matching and no real face recognition is done yet. 

In order to build a face descriptor, the facial feature detector, described in detail in Section 

3.2 is used. The pixel values surrounding each localization are extracted, as showed in Fig.2.2b, 

normalized to have zero mean and unitary variance, in order to acquire photometric invariance. 

Using the localization of the mouth, a speaker detection is used, simply by computing the 

variation of the mouth pixels in sequential frames and thresholding. Additionally to facial 

information, clothing information is used, with a color histogram for a bounding box below the 

face detection. Finally knowing which face track is speaking and associating it with the script 

and subtitle information, a set of face tracks can be properly labeled with an identity. These 

tracks can be used as training examples for a classification problem, in order to label the rest 

of the face tracks that could not be labeled in the previous steps. 

To label the rest of face tracks, a similarity measure comparing two characters combines 

facial and clothing information, as given in Eq. (2.2) 

� 

S(pi,pj) = exp − df(pi,pj) 

2σ2 � � 

exp − 

f 

dc(pi,pj) 

2σ2 � 

c 

(2.2) 

Taking into account this similarity measure, a classification based on Nearest Neighbors or 

Support Vector Machines can be used to label the rest of face tracks in the video. More details


can be found in [11]. 

Table 2.1: Low level features parameters for a single trait classifier 

Pixel Value Types Normalization Aggregation 

RGB(r) None(n) None(n) 

HSV (h) Mean-Normalization (m) Histogram (h) 

Image Intensity (i) Energy-Normalization (e) Statistics (s) 

Edge Magnitude (m) 

Edge Orientation (o) 

2.3 Attribute and simile descriptor for face identification 

The work presented by Kumar et al. [22] has presented one of the best results for the Labeled 

Faces in the Wild benchmark, when using the “restricted” protocol (explained in Section 3.7.1). 

They presented two separate strategies: the attribute and the simile classifier. 

2.3.1 Attribute descriptor 

The attribute classifier algorithm is based on the idea that a person’s identity can be infered 

from a set of high level attributes, such as gender, age, race, etc. The result is a descriptor 

with entries according to each of the attributes, as shown in Fig. 2.3a. Each trait is determined 

using the algorithm in [21]: the face image is divided into regions, as shown in Fig. 2.3c. The 

aim is to have a set of low level features that are created by the combination of a region, using 

a specific pixel value type, normalization and aggregation. The options are listed in Table 2.1. 

The selection of which combinations to use is trait dependent. 

Kumar et al. proposed to use forward feature selection to know which low-level features to se- 

lect for a given trait. Then a SVM classifier with RBF Kernel is trained concatenating the useful 

low-level features. In [22], the low level descriptor is defined as F(I) = 〈f1(I),f2(I),...,fk(I)〉 

where fi(I) represent the feature i of image I, a selection from Table 2.1. The attribute descrip- 

tor is build using the output of the trait classifiers as xi = 〈C1(F(Ii)),C2(F(Ii)),...,Cn(F(Ii))〉. 

Finally the recognition function is given in Eq. (2.3) 

f(Ii,Ij) = D(xi,xj) (2.3) 

With D(xi,xj) as a classification function, described in Section 2.3.3, such that the output 

is positive for the same identity and negative for different identities.

9 2.3 Attribute and simile descriptor for face identification 

(a) (b) 

(c) 

Figure 2.3: (a) Descriptor based on high level attributes (b) Training examples for the attributes 

(c) Face Regions for the attribute classifiers [21] 

2.3.2 Simile descriptor 

A problem with the attribute classifier is that a significant amount of annotation must be 

done, and only features that can be described with words such as gendre must be used. Simile 

descriptors are based on the intuition of describing a person based on similarities with reference 

individuals. For example: “Nose similar to subject 1” and “Mouth Not similar to subject 2”. To 

create such description, a set of reference face images was created. A classifier is trained based 

on at least 600 positive examples for each feature and at least 10 times more negative examples. 

The final descriptor is depicted in Fig.2.4a, while Fig.2.4b show some training examples. 

For a pair of unseen examples, their respective simile feature vectors, xi and xj, are com- 

puted. Then a classifier is used to take the decision of whether they depict the same person 

(Eq. (2.4)). 

2.3.3 Verification classifier 

f(Ii,Ij) = D(xi,xj) (2.4) 

Both Eq.(2.3) and Eq.(2.4) use the same algorithm, which is a Support Vector Machine classifier 

optimized to give higher importance to the sign than to the absolute value of the entries of the


(a) (b) 

Figure 2.4: (a) Descriptor based on similarity of features (b) Training examples for the features 

descriptor. This is done based on the observation that the trait classifiers are designed to be 

binary outputs, in the range [−1,1]. 

To do that they proposed to generate pairs pi = (|ai − bi|,ai.bi)g( 1 

2 (ai + bi)), where 

ai = Ci(I1), bi = Ci(I2) and g(z) is a Gaussian weighting. The concatenation of all the pairs 

generate the feature vector that is used for an SVM RBF classifier. Even though these algo- 

rithms have both achieved outstanding results for Labeled Faces in the Wild, they do not follow 

the strict evaluation protocol as they use training data not available in the Labeled Faces in 

the Wild dataset. It also has the disadvantage of using a large set of classifiers just to build the 

descriptor. This is not desirable in terms of computational efficiency. 

2.4 Face recognition with learning based descriptor 

Recently, Cao et.al [7] introduced a novel method which is comparable to the best performing 

algorithms for Labeled Faces in the Wild. It brings two main contributions, the first one is 

that there is no manually defined descriptor, but a proper encoding is learned specifically for 

facial images, in an unsupervised manner. The second contribution consist in a pose dependent 

classification. 

As illustrated in the top part of Fig. 2.5b, the descriptor is learned as follows: a sampling 

method is defined in which, for every pixel, its neighbors are retrieved in a predefined pattern, 

to form a low level vector. Examples can be observed in Fig. 2.5a where different options for 

patterns are presented. The sampling is done for every pixel in the image, for all the images in 

the training set, and therefore each pixel will have an associated low level feature vector. 

A vector quantization algorithm is used, which might be K-Means, PCA-tree or random- 

projection tree. Empirically they found that random-projection tree gives better results. The

11 2.4 Face recognition with learning based descriptor 

(1) 

(3) 

R 1 

R 1 

(2) 

R 1 

R 2 

(a) 

(4) 

R 1 

R 2 

Preprocessed 

image 

* 

Landmark 

detection 

R 2 

R 1 

Sampling and 

normalization 

left eye 

. 

. 

. 

nose 

left eye 

. 

. 

. 

nose 

Component 

alignment 

d d d 1,1 1,2 1,w 

d d d 2,1 2,2 2,w 

{ } 

d h,1 d h,2 

Normalized low-level 

feature vectors 

DoG 

d h,w 

LE 

descriptor 

extractor 

LE descriptor 

extraction 

• 

• • 

• •• • 

(b) 

Learning-based 

encoding 

{ 

{ 

left eye 

. 

. 

. 

nose 

left eye 

. 

. 

. 

nose 

Component 

representaion 

Code image 

s 2 

s 9 

... s1 

Component 

similarity vector 

PCA and 

normalization 

Concatenated 

patch histogram 

Pose 

evaluation 

Pose-adaptive 

classifier 

Pose-adaptive 

face similarity 

LE descriptor 

Face 

verificaton 

Figure 2.5: (a) Sampling patterns for Learning-based Descriptor. Neighboring pixels are sampled 

in a circular pattern [7] (b) Face Recognition with Learning-based Descriptor [7]. The top 

part shows the pipeline used to learn the face encoding (descriptor). The bottom section shows 

the overall pipeline, showing the pose adaptive recognition. 

quantization will transform the low level features into a single code, as shown in Fig. 2.5b, 

which will define a code image. Then a spatial grid is defined, and for each cell, a histogram of 

occurrence of codes is created. All the histograms are then concatenated to form a final vector. 

However, depending on the size of the grid, and the predefined number of codes, this histogram 

might be very large, therefore PCA is used to reduce its size. 

Empirically and surprisingly they showed that the discriminative power is even higher after 

the dimensionality reduction, and improves even further by simply normalizing the projected 

vector. They also show methods in which they can combine different sampling patterns to 

boost the performance as they might retrieve complementary information. It is important to 

remark that the higher results were obtained not for a holistic descriptor, but using feature 

localization, they use the encoding for each feature independently, and the alignment is done 

for each component, not as a global alignment. 

Besides the encoding, an adaptive matching was used, in which three exemplar images, 

with left, frontal and right pose were selected. For a unseen image, the similarity of the 

descriptors is computed against each of the exemplars, and the assigned pose is the one of 

the exemplar with the highest similarity. A classifier was trained for each combination of 

pose (left left, frontal frontal, right right, left right, left frontal, left right) in a way such that 

depending on the infered combination of pose for the input images, the corresponding classifier


is used. They showed with their results that this also brings an improvement in the accuracy 

of the classification. 

2.5 Multiple one-shots using label class information 

This method, introduced by Taigman et al. [36], is based in the one-shot similarity score (OSS). 

The OSS score is computed as follows: a set of face examples A is obtained, this has to be 

exclusive to the images to be compared in terms of identity. Then, if a pair of images xi and xj 

is to be classified, first a discriminative classifier fi is trained, using image xi as a single positive 

example and the set A as the negative examples. The process is repeated for xj to obtain a 

classifier fj. The OSS score is the average of the cross classification, i.e. s = (fi(xj)+fi(xj))/2. 

The work from Taigman [36] is an extension of this method which benefits from the use of 

label information. The proposal is to split the set A according to the identities, such that we 

have Ai,i = {1,2,...,n}. Then to create a single OSS score from each of the subsets to build 

a multiple one-shot vector. The motivation is to make classifiers which are more discriminative 

towards identity than to other factors, such as pose. If a subset of Ai has images of only one 

person and there is variety regarding factors such as pose, expression, etc. then the classifier 

will be more likely to discriminate identity. In the case a factor such as pose is constant within 

the subset Ai, then the OSS score will not be discriminative towards identity, but to pose, 

however they argue this information is beneficial when combining a large set of OSS scores into 

the multiple one-score vector. In such way that they also created subsets of images sharing the 

same pose to create more OSS scores. 

The pipeline for this algorithm can be observed in Fig. 2.6, and it is described as follows: the 

two images being compared are aligned, using a similar strategy to that of Section 3.3.2, from 

Figure 2.6: the multiple one-shot pipeline

13 2.5 Multiple one-shots using label class information 

which they create a feature vector. They tested with SIFT with a dense sampling, Local Binary 

Patterns (LBP), the three-patch and the four-patch LBP [42]. PCA is later used to reduce the 

dimensionality of the descriptor. Then Information Theoretic Metric Learning (ITML) is used to 

learn a Mahalanobis distance d(xi,xj) = (xi −xj) ⊤ S(xi −xj), which generates a distance above 

certain threshold for negative pairs while maintaining the distance below another threshold for 

positive pairs [9]. The learned matrix can be factorized using a Cholesky decomposition, as 

S = G ⊤ G, from which the matrix G is used to project the feature vectors. In the new space, 

the computation of the Euclidean distance is equivalent to computing the Mahalanobis distance 

in the previous space. The metric and the PCA projection are obtained from the training set 

prior to the computation of the OSS scores. 

Finally, for a pair of face images to be classified their feature vectors, projected using the 

matrix G, are used to generate multiple OSS scores using the subsets Ai, these are concatenated 

to create a vector which is fed into a SVM classifier. 

This algorithm has currently the highest accuracy reported for the Labeled Faces in the 

Wild benchmark, in the “unrestricted” protocol, explained in Section 3.7.1. However, notice 

the computation of OSS scores is very expensive, as many different discriminative models have 

to be trained in order to create the multiple OSS score vector.

Chapter 3 

The face recognition pipeline 

In this chapter the pipeline depicted in Fig. 3.1 is discussed in more detail. The function of 

each stage is described, and relevant algorithms for their implementation are presented. 

3.1 Face detection 

Face detection is the search of location and scale of instances of human faces within an arbitrary 

image. Again, the difficulty is to perform well in the presence of factors that affects images 

acquired in uncontrolled conditions (c.f. Fig. 1.1). Viola & Jones [39] proposed an efficient 

algorithm for face detection, based on Haar Wavelet Features and a cascade of classifiers, 

selected by the Adaboost algorithm. 

Adaboost [14] is an algorithm designed to create a “stronger classifier” from a set of “weak 

classifiers” through their linear combination. The algorithm iteratively selects, from the weak 

classifiers space, the one which minimizes a distributed error over the training data. The 

assigned weight to the selected classifier is dependent on the error and, at each iteration, the 

distribution is updated, in such way that, the training examples which were misclassified, are 

given higher importance in the following iterations. 

Face 

detection 

Alignment 

(optional) 

Facial features 

extraction 

(optional) 

Illumination 

normalization 

(optional) 

Figure 3.1: General face recognition pipeline 

14 

Visual feature 

extraction 

Face 

identification

15 3.2 Facial features localization 

A 

C 

¦¡ ¦¡ ¦ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¥¡ ¥ 

¥¡ 

¦¡ ¦ 

¦¡ 

¤¡ ¤¡ ¤ 

£¡ £ 

£¡ 

¤¡ ¤ 

¤¡ 

£¡ £ 

£¡ 

¤¡ ¤ 

¤¡ 

£¡ £ 

£¡ 

¤¡ ¤ 

¤¡ 

£¡ £ 

£¡ 

¤¡ ¤ 

¤¡ 

£¡ £ 

£¡ 

¤¡ ¤ 

¤¡ 

£¡ £ 

£¡ 

¤¡ ¤ 

¤¡ 

£¡ £ 

£¡ 

¤¡ ¤ 

¤¡ 

£¡ £ 

£¡ 

¤¡ ¤ 

¤¡ 

(a) 

¡ ¡ 

©¡ © 

©¡ 

¡ 

¡ 

©¡ © 

©¡ 

¡ 

¡ 

©¡ © 

©¡ 

¡ 

¡ 

©¡ © 

©¡ 

¡ 

¡ 

©¡ © 

©¡ 

¡ 

¡ 

©¡ © 

©¡ 

¡ 

¡ 

©¡ © 

©¡ 

¡ 

¡ 

©¡ © 

©¡ 

¡ 

¡ 

¨¡ ¨¡ ¨ 

§¡ § 

§¡ 

¨¡ ¨ 

¨¡ 

§¡ § 

§¡ 

¨¡ ¨ 

¨¡ 

§¡ § 

§¡ 

¨¡ ¨ 

¨¡ 

§¡ § 

§¡ 

¨¡ ¨ 

¨¡ 

§¡ § 

§¡ 

¨¡ ¨ 

¨¡ 

§¡ § 

§¡ 

¨¡ ¨ 

¨¡ 

§¡ § 

§¡ 

¨¡ ¨ 

¨¡ 

§¡ § 

§¡ 

¨¡ ¨ 

¨¡ 

¢¡ ¢¡ ¢¡ ¢¡ ¢ 

¡ ¡ ¡ 

¡ 

¢¡ ¢¡ ¢¡ ¢ 

¢¡ 

¡ ¡ ¡ 

¡ 

¢¡ ¢¡ ¢¡ ¢ 

¢¡ 

¡ ¡ ¡ 

¡ 

¢¡ ¢¡ ¢¡ ¢ 

¢¡ 

¡ ¡ ¡ 

¡ 

¢¡ ¢¡ ¢¡ ¢ 

¢¡ 

¡ ¡ ¡ 

¡ 

¢¡ ¢¡ ¢¡ ¢ 

¢¡ 

¡ ¡ ¡ 

¡ 

¢¡ ¢¡ ¢¡ ¢ 

¢¡ 

¡ ¡ ¡ 

¡ 

¢¡ ¢¡ ¢¡ ¢ 

¢¡ 

B 

D 

Figure 3.2: Viola-Jones object detection based on Haar Features (a) Examples of Haar Features 

(b) Feature computation from the integral image. Notice that the area marked as D can be 

computed using the points 1,2,3 and 4 from the Integral Image. D = 4 + 1 − (2 + 3) [39] 

Their algorithm has the advantage of providing a fast way to extract the Haar wavelets by 

precomputing what is called an Integral Image, Eq. (3.1). This is possible due to the rectangular 

geometry of Haar Wavelets (Fig. 3.2a), which can be later on computed by adding a few terms 

from the Integral image (Fig. 3.2b). This is an important asset for detection, as different Haar 

Filters must be computed in many locations and scales within the probe image. 

� 

Ĩ(x,y) = 

A 

C 

1 

3 

B 

D 

(b) 

x ′ ≤x,y ′ I(x 

≤x 

′ ,y ′ ) (3.1) 

While this algorithm is widely used because of its accuracy and speed, the implementation 

used in this thesis is an extension of the Viola-Jones algorithm. Besides Haar features, Histogram 

of Oriented Gradients (HoG) features (see Section 3.5.1) have been used. The advantage of using 

HoG features is that the same concept of the integral image can be applied, by creating the 

Integral Histogram [32,47]. This strategy boosts the speed of the algorithm, which benefits in 

terms of robustness from the use of additional features. Fig. 3.3 show some examples of face 

detections. 

3.2 Facial features localization 

Facial feature point localization is the first step in feature based algorithms. Its robustness 

is crucial for performance. The detector being used in this thesis is the one from [11], which 

is an improvement over the pictorial structure model [12]. The algorithm must maximize the 

following measure: 

2 

4

Chapter 3: The face recognition pipeline 16 

(a) (b) (c) 

Figure 3.3: (a) Correct face detections (b) Example of missed detection due to large pose 

variation (c) Incorrect detections due to a cluttered region 

p(F |p1,...,pn) ∝ p(p1,...,pn|F) 

n� 

i=1 

p(ai|F) 

p(ai|F) 

(3.2) 

Eq. (3.2) shows the probability of having the set of features F given a localization (p1,...,pn). 

This is proportional to the probability of having such localization (whether the relative posi- 

tioning of points is possible according to the expected geometry), multiplied by the ratio of 

obtaining the appearance ai for the respective feature, over the probability of also having that 

appearance given that the feature is not present. For the appearance model, there is the as- 

sumption of mutual independence between all the facial features, which is as well independent 

of their localization, and therefore appears as a multiplication. Eq. (3.2) can be understood as 

the combination of two models, one for the relative localization of the features and another for 

their appearance. 

The appearance ratios are modeled using a binary classifier, trained with feature/non fea- 

tures examples. It uses Haar Wavelets and Adaboost for the combination of the weak classifiers 

given by the Haar Features. It follows exactly the same algorithm as in Section 3.1, and the 

output is substituted directly into Eq.(3.2). On the other hand the localization is modeled with 

a tree-like Gaussian mixture in which there is a covariance dependency in the form of a tree. 

Each covariance depends on its parent node, as shown in Fig. 3.4 where nodes 2,3 and 4 are 

shown with an uncertainty relative to their parent node (1). 

The combination of both models present a highly reliable localization, which is able to cope 

with large pose variations. It is able to handle occlusions, as the expected positions compensate 

for appearance problems. 

As discussed in [12] the tree structure for the Gaussian Mixture Model allows for efficient al- 

gorithms for maximizing Eq. 3.2, and using the Viola-Jones algorithm for appearance modeling,

17 3.3 Face alignment 

Figure 3.4: Tree-like Gaussian Mixture Model for the localization of Facial Features 

speeds the algorithm as well. 

3.3 Face alignment 

Many recognition algorithms rely on the ability of the face detector to give a standard location 

and scale for the face. However, this is not always the case, standard face detectors such as 

Viola-Jones’s, and the one used for this project, give poorly aligned images. This is the trade- 

off between having the ability to detect faces with large changes in pose and expression with 

alignment and localization. In order to compesate those misalignments, different algorithms 

have been proposed to bring an arbitrary facial image to a canonical pose, in which facial 

features can be more easily compared. Recent algorithms have been proposed for non-rigid 

transformations, such that proper positioning of the facial features are infered, despite the 

pose, see Zhu et al. [46]. In this section, two algorithms restricted to rigid transformations are 

described. 

3.3.1 Funneling 

In 2007, Huang et al. [18], introduced a technique called unsupervised joint alignment. This 

algorithm models an arbitrary set of images (in this case, face images) as a distribution field, 

i.e. a model for which every pixel in the image is a random variable Xi, with possible values 

from an alphabet χ, for example, the set of pixel intensities for an 8-bit gray-scale image, i.e. 

χ = {1,2,...,256}. Then each pixel Xi is assigned with a distribution over χ. 

The first step of the algorithm, which can be considered as training, is called congealing. 

Computes the empirical distribution for each pixel, based on the stack of images, i.e. the em- 

pirical distribution field. Then, for each image, it performs a transformation (e.g. an affine 

transformation) such that the entropy over the distribution field is minimized. Then it recom- 

putes the empirical distribution field for the transformed images and repeats the iterations until


convergence. 

Distribution 

Field 1 

Distribution 

Field 2 

Figure 3.5: Congealing example [18] 

Distribution 

Field n 

Fig. 3.5 illustrates the idea of congealing. The distribution field is formed by a stack of 1D 

binary images, i.e. χ = {0,1}. At each iteration, a horizontal translation will be chosen for 

each image, in such way that the overall entropy is reduced. As a result, at iteration n, the 

images will be at a position such that they are considered aligned. 

Notice that congealing can be used directly to align a set of face images. However, it 

cannot be applied for an unseen example, unless the new image is inserted into the training 

set, and congealing is run again. Funneling is an efficient way of doing that, the idea is to 

keep the sequence of distribution fields at each iteration of congealing, and choose a sequence 

of transformations for the new image, based on the distribution field obtained at each iteration 

of congealing. In [18], instead of using pixel values, SIFT descriptors were used at each pixel 

location. Then k-Means is used to obtain 12 clusters, which are used as the alphabet χ. 

3.3.2 Facial features coordinates based alignment 

Another strategy consists in using the output of the facial features localization, i.e. the coordi- 

nates, to infer the necessary affine transformation which will bring the facial feature points to 

a canonical pose, one that will be shared among all the images. 

Let xf = (x f 

0 ,xf1 

,1)⊤ be the homogeneous coordinates for the feature f of a non aligned 

image, and yf = (y f 

0 ,yf 1 )⊤ the desired coordinates for the same feature. We want to obtain 

the affine transformation A(2 × 3) such that yf = Axf . To obtain the six parameters of A 

only three features are needed, however, in order to compensate for wrong localizations, all the 

features can be used to obtain the set of parameters which minimize the least squares error in 

localization. 

If A ′ is defined as the vector with the entries of A, Y is the vector with the target coordinates 

Y = (y0 0,y 0 F −1 F −1 

1,...,y 0 ,y1 ) ⊤ . And finally the matrix X, with the input coordinates for all 

the features, is defined as shown in Eq. (3.3).

19 3.4 Preprocessing for illumination invariance 

(a) (b) (c) 

Figure 3.6: Examples of Facial Features based alignment 

⎛ 

x 

⎜ 

X = ⎜ 

⎝ 

0 0 x0 0 

1 

0 

1 

0 

0 

x 

0 0 

0 0 x0 . 

F −1 

x0 . 

F −1 

x1 . 

1 

. 

0 

1 

. 

0 

⎞ 

⎟ 

1 ⎟ 

. 

⎟ 

. ⎟ 

0⎠ 

0 0 0 

F −1 

x 

F −1 

x 1 

0 

1 

(3.3) 

Then, for the new variables, y f = Ax f becomes Y = XA ′ . Its solution is given in Eq. (3.4) 

A ′ = (X ⊤ X) −1 X ⊤ Y (3.4) 

Figure 3.6 show some examples of alignments obtained using this strategy. The disadvantage 

of this approach is that facial feature localization algorithms are rather slow, and affected by 

high pose changes that can lead to wrong alignments. Furthermore, a single canonical pose 

is not suitable for major changes in viewpoint. Within this work, the target coordinates were 

obtained by averaging over the set of training examples. 

3.4 Preprocessing for illumination invariance 

In uncontrolled conditions the illumination setup in which the image was acquired might have 

a drastic influence over the obtained descriptor. Optionally, a preprocessing stage is desirable, 

in which the effect of illumination conditions, local shadowing and highlights is removed, while 

preserving the visual information that is important for recognition. 

Tan and Triggs [37] proposed an efficient pipeline to remove the effects of illumination, 

specifically for face recognition. First Gamma correction is used, i.e. a transformation of the 

pixel gray-level values I using the non-linear transform Î = Iγ , with 0 < γ < 1. This enhance 

the dynamic range by increasing the intensity in dark regions and decreasing it for bright 

regions. Next the image is convolved with a Difference of Gaussians (DoG) kernel, a bandpass 

filter which is intended to remove gradients caused by shadows (low frequency), to suppress


(a) (b) 

Figure 3.7: Preprocessing examples to gain illumination invariance (a) before preprocessing (b) 

after preprocessing 

noise (high frequency), and maintaining the useful signal for recognition (middle frequency). 

Additionally a mask could be used to remove regions which are irrelevant for recognition. Finally 

Contrast Equalization is used to have a standarized contrast spectrum for the image. This is 

done carefully by removing the effect of extreme values, such as artificial gradients introduced 

by the masking. Fig. 3.7 show examples of the resulting images after the preprocessing is 

applied. In this thesis we did not consider a preprocessing step for illumination invariance, as 

the used descriptors are based on gradients, and therefore are invariant to illumination shifts. 

3.5 Face descriptor 

The objective will be to transform an image into a feature vector xi ∈ R D . This vector must 

be discriminative, i.e. it must encode information that is relevant to determine the identity of 

the person. The learning algorithms described in section 3.6 show strategies to learn which 

information is relevant and which is not. 

In Section 2.2 a facial feature based descriptor was presented, which is the pixel intensities 

surrounding the localized facial features. Its intensities are normalized to have zero mean and 

unitary variance to gain robustness to illumination changes. We refer to that descriptor as 

a facial features patch. In this section, two more descriptors are described: a Histogram of 

Oriented Gradients and SIFT. 

3.5.1 Histogram of Oriented Gradients 

Histogram of Oriented Gradients (HoG) was initially proposed by Dalal and Triggs [8]. It is 

a global descriptor (Holistic), closely related to SIFT (see Section 3.5.2) and edge orientation 

histograms; designed for the human detection task. The pipeline used for their application is 

depicted in Fig. 3.8.

21 3.5 Face descriptor 

As illustrated in Fig. 3.8 the descriptor is build as follows: for an input image, the deriva- 

tives in x and y direction (Ix and Iy) are computed by convolving the image with the filters 

h = [−1,0,1] and h = [−1,0,1] ⊤ respectively. Then the magnitude and direction of the deriva- 

tives are obtained as M(i,j) = � Ix(i,j) 2 + Iy(i,j) 2 and Ω(i,j) = arctan(Iy(i,j)/Ix(i,j)) in 

such way that each pixel will have its gradient vector: magnitude and direction. Then accord- 

ing to a predefined number of cells to use, the image is splitted into a grid of cells × cells 

and, for each of them, a histogram is computed over the occurrence of the gradient angles of 

the pixels contained in that cell. The vote for each pixel is given by its magnitude, and a soft 

assignment is used, i.e. linear interpolation to share the vote among neighboring angle bins. 

The next step is to normalize the histogram by using blocks of cells, i.e. a group of cells over 

which their joint energy is used for normalization. Dalal and Triggs used overlaping blocks in 

such way that there is redundancy over the cells being used, differing only in the value used for 

their normalization. 

For this thesis the strategy for normalization is different: we allowed the cells to overlap, 

with the amount of overlap as a parameter, and we defined three types of normalization: 

Input 

image 

• Cell: The normalization value for each cell is computed using only the information within 

the same cell. This approach is highly invariant to non-uniform illumination changes, but 

the relative changes in gradient magnitudes between different cells is lost. 

• Global: All the cells are normalized with the same value, which is computed globally. 

In this case, the relative changes in magnitude between different cells is maintained, but 

there is poor illumination invariance. 

• Block: The objective of block normalization is to provide a local, but coarser normal- 

ization, in such way that it is a tradeoff between illumination invariance and maintaining 

changes in magnitude between different cells. The strategy is overlap dependent to com- 

ply with the geometry of the spatial grid, it can be used only for overlaps of 0% or 50%. 

In the case of 0% overlap, the normalization value is computed combining the energy of 

current cell (the one to be normalized) and 3 of its neighbors, as shown in Fig. 3.9a. 

In the case of 50%, the current cell is normalized using the neighbors in its diagonal. 

Considering that a cell is actually 4 small squares from Fig. 3.9b, the neighbors in the 

diagonal are covering the area of the current cell. 

Normalize 

gamma & 

colour 

Compute 

gradients 

Weighted vote 

into spatial & 

orientation cells 

Contrast normalize 

over overlapping 

spatial blocks 

Collect HOG’s 

over detection 

window 

Person / 

Linear non−person 

SVM classification 

Figure 3.8: Pipeline proposed by Dalal and Triggs for Human detection using HoG [8]


In all of the cases the normalization used is L2, i.e. for a vector x = (x0,x1,...,xD−1) ⊤ the 

normalized vector is obtained as x ′ = x/|x|, with: 

� 

� 

� 

|x| = � D−1 � 

i=0 

x 2 i 

(3.5) 

We also considered using a multiscale version of HoG. In this case, a HoG descriptor is 

computed for each level of the scale pyramid. The quantity of cells for level l, denoted as cl, 

depends on the scaling factor k, i.e. cl = c0k −l . As a summary, the parameters involved in the 

HoG descriptor computation are shown in Table 3.1 

Table 3.1: HoG parameters summary 

Parameter Description 

Cells Quantity of cells for the image grid 

Angles Quantify of angle bins for each histogram 

Overlap Fraction of overlap between neighboring cells 

Sign Wether the angle range is from 0-180 ◦ or 0-360 ◦ 

Normalization Either cell, global or block normalization 

Levels Quantity of levels for the Multiscale HoG 

scaling (k) Scaling factor for each level of Multiscale 

3.5.2 Scale invariant feature transform (SIFT) 

The SIFT descriptor was proposed by Lowe [23], and it has proven to be very useful for object 

recognition and matching applications. This descriptor is local in the sense that it describes the 

region surrounding a keypoint, in a specific scale and orientation. Normally its location, scale 

(a) (b) 

Figure 3.9: HoG Block Normalization (a) Zero percent overlap: the highlighted cell is 

normalized using its energy plus the energy of its 3 inmediate neighbors (b) Fifty percent 

overlap: the current cell is normalized using the energy of its 4 diagonal neighbors, which 

are covering its area due to the overlap.

23 3.6 Learning/Classification 

and orientation are obtained from a interest point (keypoint) detector. In the case of [23], the 

keypoints are obtained as space-scale extremas using Difference of Gaussians (DoG) filtering. 

The SIFT descriptor has the structure depicted in Fig. 3.10, the idea is similar to that of 

the HoG descriptor. The gradient is computed for each pixel (in the interest region) and the 

area is divided in subregions (2x2 in Fig. 3.10) from which a histogram of gradients is computed 

by using the magnitude of the gradient as the vote for the angle bins. However, it is important 

to remark that previous to the histogram computation, a Gaussian weighting is applied to the 

magnitude, centered in the middle of the descriptor with σ equal to one half of the width of 

the descriptor. This will give less importance to the pixels in the extremes of the area, and 

therefore, reduce the effect of misalignments. In this thesis, we used SIFT descriptors with 4x4 

subregions, each of 8 angle bins, generating a 128 dimensional descriptor. 

Image gradients 

3.6 Learning/Classification 

Keypoint descriptor 

Figure 3.10: SIFT Descriptor structure [23]. 

Each image i is represented by a descriptor vector xi ∈ R D . The vector xi is associated also 

with a categorical label yi corresponding to the person identity. A classification algorithm, for 

face recognition, models the binary decision of whether, images xi and xj, belong to the same 

class (yi = yj), or not (yi �= yj), as shown in Eq. (3.6) 

f(xi,xj) : R D×2 → {0,1} (3.6) 

In the following sections, relevant algorithms for classification are described. 

3.6.1 Spectral regression kernel discriminant analysis 

Kernel discriminant analysis (KDA) is an extension of the linear discriminant analysis (LDA) 

to handle non-linear data. In the case of LDA, it is asssumed that the data for each class follows


a normal distribution with equal covariance. The goal is to solve Eq. (3.7) 

Wopt = arg max 

W Tr{(W ⊤ SBW) −1 (W ⊤ SWW)} (3.7) 

Eq. (3.7) finds the optimal combination of features which separates the input data according 

to their classes. The objective function is such that the between class covariance SB is maxi- 

mized and the within class covariance SW is minimized. These terms are defined in Eq. (3.8) 

and Eq. (3.9) respectively. 

SB = 

SW = 

c� 

c� 

Ni(µi − µ)(µi − µ) ⊤ 

i=1 

� 

i=1 xk∈Xi 

(xk − µi)(xk − µi) ⊤ 

(3.8) 

(3.9) 

Where Ni and µi is the number of points and the mean for class i, and µ is the mean for all 

the data, independently of the class, and Xi is the subset of points that belong to class i. LDA 

can be described as an algorithm that finds, an optimal linear projection, such that the data 

belonging to the same class will be moved closer, and the data belonging to different classes 

will be pushed appart. 

In [2] it is shown the problem can be reformulated in terms of inner products. Therefore 

the Kernel trick can be used to handle non-linear data, which leads to the KDA algorithm. For 

this thesis we used an instance of KDA called Spectral Regression Kernel Discriminant Analysis 

(SR-KDA), from the work of Cai et al. [6]. It is a specific formulation of KDA in which the 

optimization process is theoretically 27 times faster. The limitation of SR-KDA is that the 

target space is limited to be of c − 1 dimensions, where c is the number of classes. 

3.6.2 Logistic regression 

General logistic Regression 

Logistic regression [4] models the probability of a feature vector xi to belong to a class as a 

logistic sigmoid function. Its argument is a linear combination of the entries of the feature 

vector. This is shown in Eq. (3.10). 

p(yi = 1|xi) = σ(w ⊤ xi), (3.10) 

where σ(z) = (1 + exp(−z)) −1 is the sigmoid function, and xi is given in homogeneous coor- 

dinates, i.e. allows for a bias term to be learned in w. Taking the negative log-likelihood (Eq.

25 3.6 Learning/Classification 

(3.11)) and its gradient (Eq. (3.12)) the optimal weights can be obtained by using a gradient 

descend algorithm until convergence (finding the minimum negative log-likelihood). 

Logistic discriminant metric learning 

L = − � 

tn lnpn + (1 − tn)ln (1 − pn) (3.11) 

n 

▽L = � 

(tn − pn)xn 

n 

(3.12) 

The objective of metric learning algorithms is to find, the matrix M ∈ R D×D , such that the 

Mahalanobis distance, Eq. (3.13), is minimized for positive examples (yi = yj), and maximized 

for negative pairwise examples (yi �= yj). 

dM(xi,xj) = (xi − xj) ⊤ M(xi − xj), (3.13) 

where M is restricted to be positive semidefinite 1 . Logistic Discriminant Metric Learning, 

proposed by Guillaumin et al. [16], model the probability of two examples to depict the same 

person as given by Eq. (3.14). 

pn(yi = yj|xi,xj;M,b) = σ(b − dM(xi,xj)), (3.14) 

where σ(z) = (1 + exp(−z)) −1 is the sigmoid function and b is a bias value. Let n be an index 

representing the pair ij. From Eq. (3.14), the likelihood over the seen data, taking tn as the 

target class for pair xn = (xi,xj), is given in Eq. (3.15). 

L = 

N� 

n 

p tn 

n (1 − pn) 1−tn (3.15) 

From which it can be shown that the negative log likelihood, and its gradient are given in 

Eq. (3.16) and Eq. (3.17) respectively. 

L = − � 

tn lnpn + (1 − tn)ln (1 − pn) (3.16) 

n 

▽L = � 

(tn − pn)Xn 

n 

(3.17) 

Xn is defined as the vectorization of (xi −xj)(xi −xj)⊤. Using Eq. (3.16) and Eq. (3.17) it 

1 A Matrix M ∈ R D×D is positive semidefinite if x T Mx ≥ 0, ∀x �= 0. It is denoted as M � 0


is possible to learn the values of M by minimizing the negative log-likelihood using a gradient 

descent algorithm. If the matrix is restricted to be positive semidefinite, then a Cholesky 

decomposition can be applied to it, i.e. M = LL ⊤ . In this case Eq. (3.13) can be reformulated 

as in Eq. (3.18) 

dL(xi,xj) = (L ⊤ xi − L ⊤ xj) ⊤ (L ⊤ xi − L ⊤ xj) (3.18) 

This result can be interpreted as a projection of the data followed by the computation of 

the Euclidean distance in the new space. Throughout this thesis, logistic discriminant metric 

learning will be used as the main learning algorithm. 

3.7 Datasets and evaluation 

In order to evaluate the performance of our algorithm, two datasets are used: Labeled Faces in 

the Wild (LFW) and Public Figures (PubFig). In this section a description of both datasets 

together with their evaluation protocol is presented. 

3.7.1 Labeled faces in the wild 

The main dataset used for this project is called Labeled Faces in the Wild (LFW) [19]. An im- 

portant dataset due to its high variability in pose, expression, illumination conditions, etc. and 

therefore, considered to be appropriate to evaluate face recognition approaches for uncontrolled 

settings [30]. Consist of 13233 images retrieved from Yahoo! News using a Viola-Jones face 

detector. With a resolution of 250 × 250, the scale and location of each face is approximately 

the same, therefore there is no need to use a face detector. Each image is labeled according to 

the person identity to give a total of 5749 identities. The quantity of images per person varies 

from 1 to 530. 

To redirect the research efforts towards algorithms of recognition more than alignment, there 

are three versions of LFW available: 

• Not Aligned: the set of images as taken directly from the face detector. 

• Aligned Funneled: aligned using the algorithm described in section 3.3.1. 

• Aligned Commercial: aligned using the algorithm introduced in [43]. 

In order to have a standard evaluation method to properly compare different algorithms, 

a protocol was established. Ten independent subsets (folds) of images were defined, mutually 

exclusive in terms of image exemplars and identity. The evaluation protocol allows for two

27 3.8 Baseline performance 

different paradigms: restricted and unrestricted. For the restricted case, a set of 600 pairs are 

predefined for each of the ten folds, each pair has an associated label which indicates whether 

the images belong or not to the same person, 300 pairs for each case. In this case the identity 

must not be used, i.e. no more pairs can be created. In the unrestricted paradigm, the identities 

can be used, so that a large quantity of pairs can be created. 

For both cases, performance is reported as the mean over 10-fold cross validation. This 

means that one of the 10 folds is held out, and the training is done using the remaining subsets, 

then the accuracy is obtained by classifying the “unseen” 600 pairs that were left aside. This 

is done 10 times, rotating over the different folds and the final report is the mean and standard 

deviation of the accuracy over the 10 folds. In this work we will focus in the unrestricted 

paradigm. 

3.7.2 Public figures (PubFig) 

The Public Figures dataset was compiled by Kumar et al. [22] and it is larger than LFW. It 

consist of 59470 images of 200 people, collected from the internet. Therefore there are many 

more images per person than in LFW. Similarly to LFW it contains a large variability in pose 

variations, illumination, expression, etc. 

An important difference with LFW is that images are given as a list of URL addresses, from 

different sources of the internet. That represents a problem as through time some images will 

be lost. That was confirmed when we retrieved the dataset, 15% of the URLs were invalid and, 

as a consequence, 25% of the test pairs could not be created. 

The evaluation protocol is 10 fold cross validation using a “restricted” paradigm equivalent 

to that of LFW, and therefore, no additional pairs can be used to train the algorithm. Different 

benchmarks to measure the performance of the algorithm under specific conditions are provided, 

e.g. the behavior using only frontal pose images, or only using neutral expressions, etc. 

In our evaluation, we use the dataset as an “unrestricted” paradigm, defining our own pairs 

for training, but using the benchmarks test pairs for evaluation. 

3.8 Baseline performance 

Our baseline algorithm is the following: facial features are detected (see section 3.2), using the 

found coordinates, two feature vectors are build. The first vector is formed by the concatenation 

of SIFT descriptors, obtained from three different scales (16, 32 and 48 pixels width) at the 

location of each facial feature (following [16]). The other case is the concatenation of the facial 

feature patches from section 2.2. The implementation was done in Matlab, and computationally 

expensive sections such as alignments or feature extractions were implemented in C.


Table 3.2 show results obtained for both descriptors in the Aligned Commercial version 

of LFW. For comparison, two classifiers are used, the Euclidean distance between the feature 

vectors of the pair of images being classified, and using LDML to learn a proper Metric. 

It can be observed the significant contribution of Metric Learning approaches for face recog- 

nition. Additionally, when Euclidean distance is used for classification, there is no significant 

contribution of using SIFT descriptors from facial feature patches. The difference is only ob- 

served when a proper metric is used. 

Table 3.2: Baseline algorithms performance 

Classification Facial Feature Patches Multiscale SIFT 

Euclidean Distance 0.6702 ± 0.0031 0.6845 ± 0.0051 

Logistic Discriminant Metric Learning 0.7385 ± 0.0042 0.8524 ± 0.0052

Chapter 4 

Histogram of Oriented Gradients 

for face recognition 

4.1 Motivation 

Facial feature based approaches have gained popularilty in the past years, due to their ro- 

bustness regarding pose variations, in comparison with holistic approaches. However, the per- 

formance of the face recognition is strongly dependent on the accuracy of the facial feature 

detection. Facial feature localization algorithms, even if they have gained significant improve- 

ments, are still not able to cope with large pose variations. Besides, the computation time is 

high, in order to maximize the objective function within the set of possible locations, Eq.(3.2). 

For those reasons, it is desirable to have a pipeline without facial feature detection. 

There is also the intuition that holistic approaches will provide more information to the 

learning process, which might give a higher discrimination power to the overall algorithm. 

Therefore, a Histogram of Oriented Gradients (HoG) descriptor, a holistic encoding, was im- 

plemented following the description from Section 3.5.1. The programming language for the 

implementation was C, using the OpenCV library [5]. Assuming the input image’s resolution 

is 250x250, the descriptor is created for the 100x100 pixel region in the center of the image. It 

is important to dismiss the background in order to reduce biases the dataset might have [31]. 

The objective is to find a set of parameters such that the discriminative power of the descriptor 

is suitable for face recognition 

29

Chapter 4: Histogram of Oriented Gradients for face recognition 30 

4.2 Alignment comparison 

It is important to decide whether the use of an alignment is imperative for holistic approaches, 

more specifically, for the use of a HoG descriptor. To answer this question, we compare the 

three variants of LFW: Not Aligned, Aligned Funneled and Aligned Commercial, and using the 

same parameters for the HoG descriptor. 

The first results are shown in Table 4.1. It reveals, in a consistent manner, that an alignment 

is crucial for face recognition using HoG. It seems interesting that the funneled version of LFW 

did not show any improvement over the not aligned version, in fact there is a decay. For that 

reason, we ran a face alignment using the location of the facial features (c.f. Section 3.3.2). 

It can be seen that this boost the results significantly, for the not aligned as for the aligned 

funneled version, with an increase of over 5%, while there is an insignificant decrease in the case 

of the aligned commercial version. Though it is not reported here, in our experiments, we did 

not observe a significant difference in accuracy between any of the LFW versions when using a 

facial feature based descriptor. 

These results brings two conclusions: a face alignment is indeed crucial for the use of HoG 

descriptors. However, as suggested by the decrease of accuracy of the funneled version with 

respect to the not aligned version, the alignment should be robust not only in terms of rotation 

and scale, but more important, to translation. We believe that funneling is not as robust in 

translation as a feature based alignment. 

It is intuitive to have a need for robust alignment regarding translation, as it is desirable for 

the corresponding features to fall in the same spatial cell. Once a parametric study was done for 

HoG (Section 4.3), the same experiment was performed using the best set of parameters (Table 

4.9). The results are shown in Table 4.2 which confirms the previous behavior. A disadvantage 

of this result is that, even if the descriptor is holistic, there will be a need for a facial feature 

detector prior to its computation. This will inherit the problems caused by the detector. 

Table 4.1: Alignment Comparison for an initial set of parameters for a HoG descriptor: 12x12 

cells, 16 angle bins, range [0-360] ◦ , 50% overlap with block normalization 

LFW variants 

Not Aligned Aligned Funneled Aligned Commercial 

No further Alignment 0.7568 ± 0.0053 0.7408 ± 0.0067 0.8205 ± 0.0063 

Feature based alignment 0.8069 ± 0.0066 0.8093 ± 0.0063 0.8171 ± 0.0047

31 4.3 HoG parametric study 

Table 4.2: Alignment Comparison for the final set of parameters: 16x16 cells, 16 angle bins, 

range [0-360] ◦ , 50% overlap with global normalization 

LFW variants 

Not Aligned Aligned Funneled Aligned Commercial 

No further Alignment 0.7660 ± 0.0061 0.7702 ± 0.0042 0.8432 ± 0.0062 

Feature based alignment 0.8276 ± 0.0051 0.8383 ± 0.0054 0.8357 ± 0.0058 

4.3 HoG parametric study 

We perform a parametric study for a Histogram or Oriented Gradients based face recognition. 

The evaluation follows the protocol established for LFW, i.e. evaluation using 10 fold cross- 

validation, and the results are reported as the mean and standard deviation of the accuracy 

over the 10 folds. Unless specified, the dataset used is LFW aligned commercial and the 

learning algorithm is LDML. As a search for the optimal parameters, considering all possible 

combinations, is almost intractable, we decided to optimize parameters one by one. 

4.3.1 Angle range 

As a first experiment we studied the effect of the angle range over the performance of the 

algorithm. To do that we set the rest of the parameters to a fixed value: 8 angle bins, as 

used for the SIFT descriptor [23], 8x8 cells and 50% overlap, using a block normalization. The 

experiment was repeated for the three variants of LFW to compare the results. 

It can be observed, from Table 4.3, that a range of [0-360] ◦ outperforms the range of [0-180] ◦ , 

when combined with LDML. This is consistent for the three variants of LFW. Therefore, in the 

following experiments the default is a signed angle, i.e. a range of [0 − 360] ◦ . 

4.3.2 Normalization 

The three variants for normalization are described in Section 3.5.1. These are cell, block and 

global normalization. Fig. 4.1 show examples of HoG descriptors, plotted over the original 

image. For cell normalization, as the norm is the same for each spatial bin, the relative changes 

Table 4.3: Angle range comparison for HoG. 8x8 cells, 8 angle bins, 50% overlap and block 

normalization 

LFW variants 

Angle Range Not Aligned Aligned Funneled Aligned Commercial 

[0 − 180] ◦ 0.7150 ± 0.0053 0.7077 ± 0.0052 0.7563 ± 0.0082 

[0 − 360] ◦ 0.7523 ± 0.0071 0.7495 ± 0.0054 0.8017 ± 0.0066


(a) (b) (c) 

Figure 4.1: HoG Normalization examples (a) Cell normalization (b) Block normalization and 

(c) Global Normalization 

Table 4.4: Normalization comparison for the HoG descriptor. Parameters: 16 angle bins, range 

[0-360] ◦ 

Number of cells/Overlap(%) 

12/0 12/50 16/0 16/50 

Cell 0.7933 ± 0.0061 0.8128 ± 0.0077 0.7578 ± 0.0091 0.8178 ± 0.0061 

Block 0.8192 ± 0.0064 0.8305 ± 0.0064 0.8291 ± 0.0058 0.8385 ± 0.0074 

Global 0.8247 ± 0.0071 0.8283 ± 0.0068 0.8317 ± 0.0056 0.8432 ± 0.0062 

in magnitude between different cells is lost, which will diminish the influence of strong gradients. 

However it will be very robust to non uniform changes in illumination. In the case of global 

normalization, the important gradients, that appear from regions such as the eyes, mouth and 

nose, will be emphasized at the cost of a weaker resistance to illumination changes. Block 

normalization is the trade-off between cell and global paradigms. 

A experiment was performed in which the parameters were left unchanged, except for the 

normalization type, overlap and the number of cells. The results, found in Table 4.4, show 

consistenly that cell normalization give the worst performance. Global normalization leads to 

similar results as block normalization. In most of the cases global is better except for for 12 

cells with 50% overlap. Because of these results, and for its simplicity of computation, we take 

global normalization as the default for further experiments. The exception is the quantity of 

cells experiment, which was computed in parallel.

33 4.3 HoG parametric study 

4.3.3 Quantity of cells 

Another important parameter to determine is the quantity of cells. Table 4.5 show experiments 

we performed changing only this parameter. Here we used 16 angle bins over a signed range, 

i.e. [0-360] ◦ , using 0% overlap and global normalization. It can be observed that for more than 

14 cells there is not a significant variation and below that value, the results start to degrade. A 

reason, why above 14 cells there is no improvement in performance, might be because LDML 

start to combine the information of finer cells as if they were coarser. More cells will not bring 

any improvement, but only generate larger descriptors, e.g. there is not a significant difference 

in performance between 16 × 16 and 20 × 20, however for 20 cells the descriptor size is almost 

doubled compared to 16 cells. Therefore, we decided to set 16 cells as our default value. 

Table 4.5: Number of cells comparison for the HoG descriptor. 16 angle bins, range [0-360] ◦ , 

0% overlap with block normalization 

Number of cells Accuracy 

10 0.8198 ± 0.0086 

12 0.8305 ± 0.0064 

14 0.8327 ± 0.0080 

16 0.8385 ± 0.0074 

18 0.8348 ± 0.0059 

20 0.8412 ± 0.0060 

22 0.8380 ± 0.0068 

4.3.4 Angle bins 

Angle bins refer to the quantity of partitions in which the angle range is split. Experiments 

were done to compare how is the performance affected by modifying the quantity of angle bins 

per cell. The results can be found in Table 4.6, it can be noticed the maximum is found at 16 

bins, therefore it is taken as the default for further experiments. 

Table 4.6: Accuracy obtained using different angle bins for the HoG descriptor. Parameters 

16x16 cells, range [0-360] ◦ , 0% overlap with global normalization 

Angle bins 

8 12 16 20 

0.8230 ± 0.0049 0.8270 ± 0.0052 0.8317 ± 0.0046 0.8295 ± 0.0077


4.3.5 Overlap 

In Table 4.7 is shown the variation in accuracy as a function of the overlap, when the rest 

of parameters are left unchanged. The maximum in accuracy was obtained for the case the 

overlap is of 50%, corresponding to 0.8432 ± 0.0062. However, it is not highly affected for a 

range between 10% and 60%. 

It is important to remark that the cell size in pixels is a function of the the overlap when the 

image size remains fixed. Therefore to show that overlap is beneficial, an additional experiment 

was done: a 9x9 cells descriptor was created with no overlap. In this case, the cell size is 

similar to that of 16x16 cells using 50% overlap (≈ 11 pixels). The accuracy obtained was 

0.8207 ± 0.0080, which is lower than using overlap. We argue that overlap is beneficial as it 

helps to correct misalignments due to problems in face detection or pose variations. 

Table 4.7: Overlap comparison. Parameters 16x16 cells, 16 angle bins in the range [0-360] ◦ , 

using global normalization 

Overlap (%) 

0 12.5 25 37.5 50 62.5 75 

0.8317 0.8423 0.8392 0.8412 0.8432 0.8415 0.8333 

±0.0056 ±0.0045 ±0.0054 ±0.0064 ±0.0062 ±0.0066 ±0.0052 

4.3.6 Multiscale HoG 

We also studied a multiscale HoG descriptor, in this case there are two parameters involved: 

the number of scales and the rescaling factor. The results from Table 4.8 show that the use of a 

multiscale approach does not bring any significant contribution to the performance. The reason 

might be related to the fact that a coarser level of the pyramid is only a linear combination of 

the finer cells. This will cause LDML to ignore coarser levels, as the information of the finest 

level of the pyramid is enough. 

Table 4.8: Multiscale HoG performance 

Levels/k Number of cells 

12 14 16 18 

2/1.15 0.8317 ± 0.0067 0.8407 ± 0.0065 0.8435 ± 0.0074 0.8425 ± 0.0074 

2/1.30 0.8287 ± 0.0067 0.8375 ± 0.0059 0.8380 ± 0.0047 0.8453 ± 0.0068 

2/1.45 0.8355 ± 0.0056 0.8388 ± 0.0068 0.8413 ± 0.0063 0.8410 ± 0.0074 

3/1.15 0.8322 ± 0.0074 0.8423 ± 0.0061 0.8398 ± 0.0062 0.8397 ± 0.0063 

3/1.30 0.8312 ± 0.0057 0.8383 ± 0.0065 0.8397 ± 0.0062 0.8435 ± 0.0070 

3/1.45 0.8312 ± 0.0059 0.8360 ± 0.0073 0.8440 ± 0.0057 0.8420 ± 0.0058

35 4.4 Discussion 

4.4 Discussion 

The conclusion of this study was the identification of appropriate parameters for face recogni- 

tion. The descriptor to be used will have 16x16 cells as the spatial grid, with an overlap of 50%, 

the angle histograms are created using 16 bins, which represent a range from 0 ◦ to 360 ◦ , the 

voting is done using soft assignment by linear interpolation. There is no need for a multiscale 

descriptor when using LDML as the classification algorithm. 

Further improvements could be achieved by reducing high differences of occurrence between 

certain angle bins. For example, it is expected for regions around the mouth to always have a 

high occurrence of horizontal lines. Therefore a large fraction of the feature vector energy will 

be distributed over the angle bins corresponding to those gradients, shadowing other bins with 

less occurrence. This problem is one of the main motivations for the work of Cao et al. [7], as 

this concentration of energy reduces the discriminative power of the descriptor. 

A simple way to balance the energy is by defining new descriptors x ′ by simply computing 

the square root of the input descriptors, i.e. x ′ = ( √ x0, √ x1,..., √ xD−1) ⊤ . This is similar 

to the computation of the Hellinger distance d(x,y) = � 

i (√ xi − √ yi) 2 , but extended to 

handle interfeatures correlation through the Mahalanobis distance. The effect of doing such 

test brought the results from 0.8432±0.0062 up to 0.8530 ± 0.0065 for the aligned commercial 

version of LFW. Notice that by using this method, the conclusion drawn for SIFT multiscale 

might not hold, as coarser cells would not be the linear combination of finer cells. 

This result suggest that it would be interesting to study different strategies to distribute 

the energy of the descriptor. For example, instead of computing the square root, a parameter 

γ ∈ [0,1] could be used to create a new feature vector x ′ = (x γ 

0 ,xγ 

1 ,...,xγ ) ⊤ . This is a 

generalization of the square root vector. 

Table 4.9: Best found parameters for HoG based recognition 

Parameter Description 

Cells 16 

Angle bins 16 

Overlap 50% 

Sign Signed, i.e. range: [0 − 360] ◦ 

Normalization global 

Additional Square root of features 

Levels 1 

scaling (k) -

Chapter 5 

Facial feature based 

representations 

5.1 Motivation 

Pose and expression represent a major challenge for face recognition. Their appearance intro- 

duce non-linearity in the data, which might be difficult, or even impossible to handle using 

linear algorithms. This include metric learning approaches such as LDML. 

A way to overcome this limitation is to design descriptors invariant to these factors. Feature 

based approaches have proven to be useful to build descriptors less sensitive to changes in pose, 

as they are build at each facial feature, no matter their relative position. Another alternative 

is to use non-linear machine learning algorithms. In this case, the non-linear data lying in a 

high dimensional space might be separated. In this chapter some experiments using non-linear 

strategies are presented. 

Another challenge is to handle occlusions, a common problem in uncontrolled settings. 

We propose to separate the metric learning according to spatial regions, i.e. a specific metric 

is learned to classify a region of the face, as whether it represents the same person or not, 

independently of the rest of the face. Then to make a classification by combining the results 

given by each region. As an example, in the case of the HoG descriptor, this could be done by 

grouping neighboring cells. If that is the case, each region can be classified as occluded or not, 

using outlier detection algorithms. Thus, in later stages of classification, facial regions can be 

dismissed or not. 

Our first goal is to separate the training stage according to spatial regions, and to achieve 

similar results as using a global training (c.f. Section 3.8). We will consider each of the 9 

36

37 5.2 Feature wise classification 

detected facial features as a spatial region. These are: left eye left, left eye right, right eye left, 

right eye right, nose left, nose center, nose right, mouth left and mouth right. 

The second goal is to classify, for each feature, whether it is inlier or an outlier. The output 

will be a confidence value, a measure of “normality”. This score represent how well the specific 

instance being classified fits a model given by the training data. 

As a final step, it is desirable to include the confidence values into the classification. The 

goal is to reduce the influence of the occluded features and equally increase the influence of the 

observed features into the final decision. 

5.2 Feature wise classification 

In this case, the multiscale SIFT descriptor xi, obtained in Section 3.8, is split into 9 feature 

vectors. x f 

i denote the descriptor for the feature f of image i. Then, a metric Mf is learned 

for each feature separately. To take a joint decision for the classification, we propose two 

approaches, described as follows. 

Distance sum 

A joint distance is obtained by adding the feature wise distances and their bias terms. Both the 

metric and the bias term are learned using the LDML algorithm. This is shown in Eq. (5.1). 

Logistic Regression 

⎛ 

p(yi = yj|xi,xj) = σ ⎝ � 

bf − � 

⎞ 

dMf(xi,xj) ⎠ (5.1) 

f 

The problem with the distance sum approach is that it assumes every feature has the same 

contribution to the final decision. However this is not the case, to confirm that assertion, 

we refer to the joint learning described in Section 3.8. If we take into account that the Ma- 

halobis distance can be seen as a weighted combination of the entries of the difference vector, 

i.e. (xi − xj) ⊤ M(xi − xj) = � 

u 

scribes how significant is the pair of entries uv. 

f 

� 

v muv(x u i − xu j )(xv i − xv j ), then the magnitude of muv, de- 

In Fig. 5.1 is plotted the energy of the entries of M, which correlates the facial feature 

pairs according to a global learning, i.e. the entry at row u and column v show how correlated 

is the facial feature u with the facial feature v. It can be noticed there is higher energy in 

the diagonal as expected. However, it is important to remark that the energy is not equally 

distributed over the diagonal, implying that there are features which are more important than 

others. Interestingly the eyes are much more discriminative than the nose and the mouth. We

Chapter 5: Facial feature based representations 38 

left_eye_left 

left_eye_right 

right_eye_left 

right_eye_right 

nose_left 

nose_center 

nose_right 

mouth_left 

mouth_right 

left_eye_left 

left_eye_right 

right_eye_left 

nose_left 

right_eye_right 

nose_center 

nose_right 

mouth_left 

mouth_right 

Figure 5.1: Energy distribution for a joint learning of a Facial Features based descriptor 

assume this is a consequence of expression variation, in the case of mouth, and because of pose 

affecting the nose. Based on these observations, we will use logistic regression to find proper 

weights for each facial feature, as shown in Eq. (5.2). 

⎛ 

p(yi = yj|xi,xj) = σ ⎝w0 + � 

⎞ 

(wfdMf(xi,xj)) ⎠ (5.2) 

In Table 5.1 is shown the accuracy achieved for each facial feature separately, reported for 

one fold of the aligned commercial version of LFW. Additionally, the accuracy for both types of 

combination is given (for 1 fold). Notice how a single feature is not highly discriminative, how- 

ever it is better than a simple Euclidean distance classification using all the features (see Table 

3.2). Furthermore the performance of the algorithm is improved when features are combined. 

In this case, there was no difference between the distance sum and logistic regression. However 

in our experiments, we noticed logistic regression assigned higher weights to the eyes, followed 

by the nose and with lower weights to the mouth, which is consistent with the global learning, 

as depicted in Fig. 5.1. When running the algorithm for more folds, a difference appeared 

between the distance sum and logistic regression. The last two lines of Table 5.1 show results 

for more folds, in which logistic regression gives an advantage over the distance sum, however 

there is not a significant difference. 

f

39 5.2 Feature wise classification 

5.2.1 Occlusion detection 

Table 5.1: Results for separate facial feature learning 

Number of folds Feature Accuracy 

1 left eye left 0.7311 

left eye right 0.7832 

right eye left 0.7849 

right eye right 0.7412 

nose left 0.7597 

nose center 0.7445 

nose right 0.7378 

mouth left 0.6840 

mouth right 0.7143 

Distance sum 0.8434 

Logistic regression 0.8434 

4 Distance sum 0.8170 ± 0.0050 

Logistic regression 0.8191 ± 0.0055 

To detect occlusions we adopt a discriminative model for each facial feature, in which the 

descriptor x f 

i is classified as normal or occluded. We profit from the already implemented 

appearance model used for the facial feature localization algorithm (c.f. Section 3.2). 

The confidence value is modeled as p(f i |I) = σs,b(p(ai|F)/p(ai|F)), i.e. the output from the 

appearance model passed through a sigmoid function. This give a probabilistic estimate of how 

well the feature fits the appearance model. Notice the sigmoid function has two parameters, 

s and b which are the slope and a bias. These parameters could be inserted into the learning, 

however, in our experiments we used s = 1 and b = 0 for simplicity. Fig. 5.2a show some 

examples of correctly detected abnormalities (p(f i |I) < 0.5), we found that this method not 

only detects outliers given by objects occluding the facial feature, but it is also useful to detect 

erroneous localizations as shown in the last two images from Fig. 5.2a. For a pair of images Ii 

and Ij, a confidence vector qij is created as given in Eq. (5.3) 

qij = (p(f 1 |Ii) × p(f 1 |Ij)),...,p(f 9 |Ii) × p(f 9 |Ij) ⊤ 

(5.3) 

To use the confidence values in the classification of an unseen pair of examples, we tried to 

normalize qij, in such way that the L1-norm is equal to 9, and then multiply the distance of 

the facial feature f by the the corresponding entry of the normalized qij. However, this did not 

affect much the performance for the distance sum, and decreased the accuracy for the case of 

logistic regression weighting. Instead we propose to use the confidence values not only for the 

distance, but as well for the bias. In the case of distance sum, the classification function from


Table 5.2: Feature combination comparison using confidence values. Results reported for 4 

folds 

Distance Sum Logistic Regression 

Normal 0.8170 ± 0.0050 0.8191 ± 0.0055 

Confidence weighting 0.8306 ± 0.0048 0.8272 ± 0.0050 

Eq. (5.1) is modified as shown in Eq. (5.5) 

⎛ 


q f 

ijbf − � 

q f 

ijdMf(xi,xj) ⎞ 

⎠ (5.4) 

f 

f 

⎛ 

= σ ⎝ � 

q f 

ij (bf 

⎞ 

− dMf(xi,xj)) ⎠ (5.5) 

f 

This can be thought as an adaptive threshold, function of the confidence for each of the 

facial features, or as the confidence weighting of the disparity between the feature distance 

and its threshold. In the worse case, when a facial feature is entirely occluded, the confidence 

value will remove its effect completely from the classification. For the case of logistic regression 

based combination, we assumed the learned bias value w0 can be split according to the learned 

weights. The classification function is shown in Eq. (5.6) and the results are given in Table 5.2. 

Notice the weights are being learned for the classifier in Eq. (5.2), and the confidence values 

are being inserted into the classification function only for the evaluation of the test set. However, 

it would desirable to also use the confidence values into the training process, such that logistic 

regression finds the optimal weights for this task. 

5.2.2 Discussion 

⎛ 


f 

q f 

ij w0 

� 

wf 

� k wk 

� 

− � 

q f 

ijwfd ⎞ 

Mf(xi,xj) ⎠ (5.6) 

In this section we have shown that separate learning can be done according to spatial regions, 

then the distances can be combined to make a joint decision. The results show that a single 

feature is not very discriminative, but their combination bring a significant improvement. We 

found there was no major difference between using a distance sum approach to that of logistic 

regression. 

Even though it was not possible to generate the same results as for a global learning, this 

f

41 5.3 Non-linear approaches 

(a) 

(b) 

Figure 5.2: Examples of outlier detections: red color for occluded feature and yellow for normal 

feature. For this image we refer the reader to the electronic version of the document.(a) correct 

outlier detections (b) wrong outlier detections 

proves that the separation can be done. A cause for this limitation might be that we can 

not benefit from inter feature correlation, or more importantly, that the distances are not 

comparable. One way to overcome this problem is to do a global learning, in which the Matrix 

M is restricted to be block diagonal. This is equivalent to learning each facial feature metric 

separately, but in a way the distances are comparable. 

The results from Table 5.2 also show that the confidence value for each facial feature, can 

be integrated effectively into the decision function. This allows to handle occlusions and/or 

wrong localizations. We hope that if the feature-wise learning gets to be comparable to that of 

global, by using the confidence values we can improve the accuracy even further. 

A limitation of this algorithm is that it depends of a robust apperance model. As illustrated 

in Fig. 5.2b, the outlier detection algorithm might fail giving false detections. We suggest it is 

necessary to implement an appearance model trained specifically for this task. 

5.3 Non-linear approaches 

In this section we describe some experiments using non-linear algorithms. In every case the 

descriptor is the SIFT multiscale computed at the location of the facial features, same as in the 

previous section but without the separation into feature wise vectors.


5.3.1 Spectral regression kernel discriminant analysis 

In this case we used SR-KDA(see Section 3.6.1) to find a non-linear projection of the input data 

such that discriminant information between the classes, i.e. the identities is emphasized. In the 

target space, a linear classification algorithm can be used. We compared Euclidean distance 

and LDML as classifiers. The results, computed over 10-fold cross-validation are shown in Table 

5.3. 

When comparing these results with the baseline from Table 3.2, which were added as well to 

Table 5.3, the contribution of using SR-KDA becomes evident. This can be observed specially 

for Euclidean distance classification, which presents an increase of accuracy of 10%. For LDML 

there is a 1% gain over the baseline. This results show that if Euclidean distance is used 

for classification, SR-KDA is effective to improve significantly the accuracy. For LDML the 

contribution is not that large. A limitation of using SR-KDA is that it is computationally 

expensive for a large quantity of data and classes, which is the case for LFW. 

Table 5.3: Results for using SRKDA projection of the input data. Results obtained for 10-fold 

cross-validation 

Euclidean distance LDML 

Not using SR-KDA 0.6845 ± 0.0051 0.8524 ± 0.0052 

Using SR-KDA 0.7883 ± 0.0029 0.8622 ± 0.0056 

5.3.2 Clustering 

In this section we describe another non-linear algorithm we explored, in which the input data 

is divided into clusters. This can be done before or after LDML learning. The intuition is that 

it is expected for similar faces to be grouped together as a cluster. Therefore, if a learning 

is done, specifically for that cluster, similar data might be separated in a way which was not 

possible when using a global training. This is a divide and conquer strategy. 

Pose adaptive classifier 

Following [7], described in Section 2.4, we build pose adaptive classifiers, where each pose 

is considered as a cluster. To assign a pose to an unseen example, a simple approach was 

implemented: three images were taken from the IMM database [25], one for each case: left (L), 

right (R) and frontal (F) pose. The identity, illumination and expression remained unchanged. 

For an unseen image, we assign the pose of the reference image for which there is a minimum 

Euclidean distance. This is the same approach as in [7] but with a different descriptor.


Once the images are clustered, according to pose, a LDML classifier is trained for each 

possible pair of poses, i.e. the six classifiers: LL, LR (RL), LF (FL), RR, RF (FR), FF. Table 

5.4 show the obtained results for 5 out of 10 folds. For comparison, the accuracy for the baseline 

algorithm is also shown as global learning. 

Table 5.4: Results for pose adaptive classification. Using 5 out 10 folds 

Accuracy 

Global learning 0.8504 ± 0.0049 

Pose adaptive 0.8358 ± 0.0026 

In Table 5.5 are shown the quantity of pairs from the test set which were assigned to each of 

the pose combinations. Additionally the achieved accuracy for each pose combination separately 

is also presented. The results show that frontal-frontal classification (FF) remained similar to 

that of global learning. However, the results for the combinations are not as good, with the 

worst case for pairs assigned to the left and right (LR) classifier. 

Table 5.5: Obtained pose combination accuracy. Results reported for 5 folds 

Pose combination 

LL FF RR LF (FL) LR (RL) FR (RF) 

Number of pairs 14 2302 16 310 23 290 

Accuracy 0.7833 0.8564 0.6524 0.7979 0.7517 0.8275 

Unsupervised clustering 

In this case the data is projected using the matrix L learned by the LDML algorithm. In the 

new space we explored using the different clustering strategies, described in Table 5.6. In this 

case, the objective is to train a classifier for each cluster separately, not training classifiers for a 

pair of clusters. For an unseen pair of examples, if the images are assigned to a different cluster, 

then are classified as having a different identity. In the case are assigned to the same cluster, 

the decision is done by the LDML classifier trained for that specific cluster. 

Table 5.6: Clustering algorithms 

Identifier Description 

KM Standard k-Means. 

S KM k-Means by adding supervision. At the assign step of the k-Means algorithm, 

points belonging to the same class are assigned to the same cluster. 

GMM Gaussian Mixture Model.


In terms of computation time, this will represent an improvement, as the LDML complexity 

is quadratic with respect to the quantity of points. Therefore, splitting the data in k clusters 

and training k classifiers, each using n/k points will make the algorithm be k times faster. 

However, as Table 5.7 show this approach did not give good results. The reported accuracy 

is the ratio of positive pairs which are assigned to the same cluster. This value is used to 

measure the performance of the clustering because, positive pairs assigned to different clusters, 

are labeled as having a different class without possible correction. This can be considered as an 

upper bound of performance and therefore the expected results are lower than those obtained 

when doing an unclustered training. Due to its low clustering accuracy, we did not proceed to 

compute the metric for each cluster. 

Table 5.7: Ratio of positive pairs assigned to the same cluster. 

Algorithm Number of clusters 

3 4 5 10 30 

KM 0.75676 0.69932 0.63851 0.59797 0.34122 

S KM 0.71622 0.67905 0.65203 0.55405 0.32095 

GMM 0.87162 0.79054 0.79392 0.59122 0.31757 

Mixture model classification 

The final clustering approach is to do a soft assignment using a Gaussian Mixture Model 

(GMM), where the covariance is restricted to be diagonal. In this case a classifier is trained 

for every combination of clusters. There is no gain in efficiency of computation, but there 

is a “finer” learning, which might capture different information than a global learning. The 

classification function for an unseen pair is given in Eq. (5.7). 

p(yi = yj|xi,xj) = � � 

p(u|xi)p(v|xj)p(yi = yj|xi,xj;Muv,buv) (5.7) 

u 

v 

Where p(k|x) is the posterior probability for x to belong to cluster k taken directly from 

the GMM. The parameters Muv and buv are learned using the set of points, from the training 

data, that belong either to cluster u or cluster v after making a hard assignment (MAP). Table 

5.8 show the results, and comparing with the baseline, which is 0.8672 of accuracy for the first 

fold, we can deduce that there is no gain in trying to make a finer classification. 

5.3.3 Discussion 

In this section were presented some non-linear approaches for face recognition, using feature 

based descriptors. The experiments with SR-KDA showed that there is indeed a gain of using


Table 5.8: Accuracy obtained over 1 fold using a GMM Model 

Number of clusters 

2 3 4 

0.8574 0.8387 0.8454 

non-linear algorithms to separate the input data. A simple classification, such as Euclidean 

distance is improved significantly, by more than 10% of accuracy. However for the case of 

LDML the gain was of only 1% of gain in accuracy. The computational cost is another factor to 

be taken into account, as SR-KDA is computationally expensive for a large quantity of classes 

and data. 

When using clustering approaches, there is a problem due to the large quantity of positive 

pairs assigned to different clusters. This effect can be reduced only by diminishing the quantity 

of clusters being considered. However, the results showed that even if this is reduced to 2 or 

3 clusters, the lost of positive pairs in the clustering is too high. The only way to overcome 

this limitation is by using a soft assignment by using a mixture model. In this case, the results 

show that it is similar to a global learning. Thus, there is no gain in using this approach.

Chapter 6 

Combining face representations 

In previous chapters, it has been demonstrated that a good recognition rate can be achieved by 

learning a proper metric, with algorithms such as LDML. The feature vectors representing the 

face can be either a HoG encoding, or SIFT descriptors in the location of each facial feature. 

In this chapter, as a last experiment, it is demonstrated that the performance of classification 

can be improved even further, by combining the distance for all the descriptors. 

To combine the descriptors two approaches were explored, both in a logistic framework. 

In the first case, a global distance is obtained by adding the distance for each descriptor, the 

learned biases are combined as well, and both terms are passed through a sigmoid function. 

Let x f 

i 

denote the feature vector of type f for face i, where f can be either SIFT descriptors 

computed in 3 scales at the location of facial features, a HoG descriptor using the parameters 

from Table 4.9, or the facial feature patches described in Section 2.2. Mf denote the learned 

metric for feature f. This approach is shown in Eq. (6.1). 

p(yi = yj|x 1 i,...,x F i ,x 1 j,...,x F ⎛ 

F� 

j ) = σ ⎝ b f − 

f=1 

F� 

f=1 

⎞ 

dMf (xfi 

,xfj 

) ⎠ (6.1) 

The other way to combine the features is by using logistic regression. In this case the sum, 

from Eq. (6.1), becomes a linear combination of the distances. The weight assigned to each 

feature type and the joint bias term are learned using the logistic regression algorithm. From 

the training examples, a large set of pairs are created from which their distances are computed, 

using the metric learned in the previous experiments. These set of distances are then used as 

the training examples for the logistic regression, see Section 3.6.2. The decision function is 

given in Eq. (6.2). 

46

47 6.1 Results for LFW 

p(yi = yj|x 1 i,...,x F i ,x 1 j,...,x F � 

j ) = σ w0 + 

6.1 Results for LFW 

F� 

f=1 

wfdMf (xf i ,xfj 

) 

� 

Table 6.1 show the results obtained for LFW-Aligned Commercial dataset. For comparison, 

the results for each feature trained separately are shown as well. In most of the cases, the 

accuracy is higher than for individual features. Based on these results we can conclude there is 

complementary information given by holistic and feature based descriptors. 

(6.2) 

In the case of combining SIFT multiscale and the facial feature patches there was a decay 

in the accuracy. Notice that, for the same case, the standard deviation increased significantly. 

The reason for this decay is that the weights found by the logistic regression are quite different 

between the folds, although their relative proportions are maintained, i.e. as expected SIFT 

multiscale is given a larger weight than facial feature patches. This causes a problem when a 

global threshold is selected for all the folds, as done in the accuracy computation. To correct 

this problem, a regularization was added such that the L2-norm of the weight vector (without 

including the bias) is constant. The results shown in Table 6.1 shows that this strategy corrects 

the problem. 

Notice that the results are not very different between distance sum and logistic regression. 

Specially for the case of HoG and SIFT multiscale combination, the reason is because logistic 

regression is assigning the same weights to both descriptors. When the facial feature patches 

descriptor was inserted the learning process assigned a low weight to it, reducing its contribution. 

This was confirmed in our experiments. 

What is important to remark is that the highest gain comes from the combination of the 

HoG descriptor(holistic) with the SIFT multiscale (facial feature based). The facial feature 

Table 6.1: Results for the combination of descriptors in the LFW benchmark 

Descriptor: used(+), not used(-) Combination type 

SIFT HoG Feature Distance LR LR 

Multiscale (squared) Patches sum (Regularized) 

+ - - 0.8524 ± 0.0052 

- + - 0.8530 ± 0.0065 

- - + 0.7385 ± 0.0046 

- + + 0.8607 ± 0.0054 0.7901 ± 0.0119 0.8600 ± 0.0060 

+ - + 0.8536 ± 0.0053 0.8154 ± 0.0152 0.8515 ± 0.0052 

+ + - 0.8766 ± 0.0050 0.8749 ± 0.0049 0.8759 ± 0.0052 

+ + + 0.8719 ± 0.0058 0.8746 ± 0.0047 0.8724 ± 0.0059

Chapter 6: Combining face representations 48 

True positives rate 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

SIFT_FF + HoG, aligned 

LMDL+MKNN, funneled 

Multishot combined, aligned 

0 

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

False positives rate 

Figure 6.1: Receiver Operating Characteristic curve for HoG and SIFT combination for the 

LFW benchmark. Results for [16] and [36] also shown 

patches do not bring any contribution, as the information they represent is already encoded by 

the SIFT multiscale descriptor. The performance of this algorithm is among the state of the 

art for the LFW benchmark. The ROC curve for the HoG and SIFT combination is shown in 

Fig. 6.1. 

Our results are comparable with performance of the state of the art algorithms, reported 

for the unrestricted paradigm of LFW. These methods gave an accuracy of 0.8517±0.0061 [36], 

0.8750 ± 0.0040 [16] and 0.8950 ± 0.0051 [36]. 

6.2 Results for PubFig 

We tested the algorithm over the PubFig dataset [22]. However, in this case, the pipeline 

included face detection and a facial features based alignment, prior to the computation of the 

descriptors. The facial feature patches were discarded due to the results obtained for LFW. 

The problem is that the results are not comparable to the ones reported in [22], due to 

the images removed from their original location. For that reason we do not follow the training 

protocol, which is defined as a “restricted” paradigm. We train using the label information 

so that many more pairs for training can be generated. However, we keep using 10-fold cross

49 6.2 Results for PubFig 

validation for evaluation. 

We take advantage of the separation of sets according to illumination, expression and pose, 

which allows us to observe how sensitive is our algorithm to these variants. The training images 

are the same, but we test only in the specified benchmark. 

Table 6.2 show the results given for all the PubFig benchmarks, the combination algorithm 

is distance sum. The logistic regression for the combination of features gave practically the 

same results. Again, the learned weights are practically the same. 

From the results it can be concluded that our algorithm is sensitive to pose changes, as there 

is a different of almost 5% between posefront and poseside benchmarks. The same happens with 

the light benchmarks, with a difference of almost 4% between lightfront and lightside. In the 

case of expression, there was almost no difference. The ROC curves are illustrated in Fig. 6.2. 

Table 6.2: Results for the different variants of the PubFig dataset 

Dataset Accuracy 

pubfig full 0.7763 ± 0.0068 

pubfig posefront 0.8111 ± 0.0139 

pubfig poseside 0.7656 ± 0.0108 

pubfig lightfront 0.7875 ± 0.0125 

pubfig lightside 0.7485 ± 0.0080 

pubfig exprneutral 0.7733 ± 0.0128 

pubfig exprexpr 0.7759 ± 0.0072 

True positives rate 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

Pubfig_exprexpr 

Pubfig_exprneutral 

Pubfig_full 

Pubfig_lightfront 

Pubfig_lightside 

Pubfig_posefront 

Pubfig_poseside 

0 

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

False positives rate 

Figure 6.2: Receiver Operating Characteristic curve for a combination of HoG descriptors and 

SIFT for the PubFig benchmarks

Chapter 7 

Conclusions and future work 

In this thesis we compared two robust descriptors for face recognition in uncontrolled settings. 

The first one is a Histogram of Oriented Gradients, computed over the entire face, and therefore 

a holistic approach. For the HoG descriptors we found a suitable set of parameters which gave a 

good performance for the Labeled Faces in the Wild benchmark. It was concluded that the use 

of a face alignment is crucial, when combined with a metric learning algorithm such as LDML. 

This alignment must be robust in terms of translations, in such way that facial features for 

the pair of images being compared, are localized approximately in the same spatial cell. The 

coordinates of facial features can be used to obtain a transformation which aligns the face into 

the desired pose. However there is the need for an improvement of the alignment algorithm 

and/or the facial feature point localization. 

The second visual feature vector we studied is a multiscale SIFT descriptor, computed in 

the location of facial features, therefore a feature based approach. This strategy gave good 

performance when combined with LDML. We concluded that it is possible to make a separate 

training for each facial feature and then combine their distances to make a joint decision. 

Even though the results were not as good as for a global learning, it opened the door to 

handle occlusions. We obtained a confidence value for each facial feature from a discriminative 

appearance model. This is a measure of how reliable is the information of the descriptor and 

it is not consider as an occlusion or a bad localization. The confidence level was succesfully 

integrated into the decision function which increased the accuracy. 

We also studied non-linear methods, from which we did not obtain good results for clustering 

strategies, neither based on pose, unsupervised clustering or as a Gaussian Mixture Model. 

However, algorithms, such as SR-KDA, are able to find non-linear discriminant information in 

the data. It was able to make a slight increase of the performance, at the expense of being 

more computationally expensive. These results show that it would be interesting to go further 

50

51 

into studying other types of non-linear algorithms. 

Finally, we demonstrated that the distances, given by different descriptors, can be inte- 

grated to boost the performance of the face recognition pipeline. The obtained performance 

is comparable with the state of the art for the Labeled Faces in the Wild and Public Figures 

benchmarks. 

The PubFig benchmark shows that our algorithm is highly sensitive to pose and illumination 

changes. In the case of illumination, it means that the normalization being used does not present 

a good invariance to this factor, and therefore it is necessary to address this issue.

Bibliography 

[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face recognition with local binary patterns. 

pages 469–481. 2004. 

[2] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach, 

2000. 

[3] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces: Recognition 

using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine 

Intelligence, 19:711–720, 1997. 

[4] C. Bishop. Pattern recognition and machine learning (Information Science and Statistics). 

Springer, 1st ed. 2006. corr. 2nd printing edition, October 2007. 

[5] G. Bradski. The OpenCV library. Dr. Dobb’s Journal of Software Tools, 2000. 

[6] Deng Cai. Efficient kernel discriminant analysis via spectral regression. Technical report, 

2007. 

[7] Z. Cao, Q. Yin, X. Tang, and Jian S. Face recognition with learning-based descriptor. In 

Proc. Computer Vision and Pattern Recognition, 2010. 

[8] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow 

and appearance. In European Conference on Computer Vision, 2006. 

[9] J. Davis, B. Kulis, S. Sra, and I. Dhillon. Information-theoretic metric learning. In in 

NIPS 2006 Workshop on Learning to Compare Examples, 2007. 

[10] M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is... Buffy Automatic naming 

of characters in TV video. In In BMVC, 2006. 

[11] M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automatic naming of 

characters in TV video. Image and Vision Computing, 27(5), 2009. 

52

53 BIBLIOGRAPHY 

[12] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. Int. J. 

Comput. Vision, 61(1):55–79, 2005. 

[13] A. Ferencz, E. Learned-Miller, and J. Malik. Learning hyper-features for visual identifica- 

tion. In Neural Information Processing Systems, volume 18, 2004. 

[14] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an 

application to boosting. In European conference on computational learning theory, pages 

23–37, 1995. 

[15] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Automatic face naming with 

caption-based supervision. In Conference on Computer Vision & Pattern Recognition, 

pages 1–8, jun 2008. 

[16] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? Metric learning approaches for 

face identification. In International Conference on Computer Vision, sep 2009. 

[17] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant 

mapping. In CVPR ’06: Proceedings of the 2006 IEEE Computer Society Conference on 

Computer Vision and Pattern Recognition, pages 1735–1742, Washington, DC, USA, 2006. 

IEEE Computer Society. 

[18] G. Huang and V. Jain. Unsupervised joint alignment of complex images. In In ICCV, 

2007. 

[19] G. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A 

database for studying face recognition in unconstrained environments. Technical Report 

07-49, University of Massachusetts, Amherst, October 2007. 

[20] A. Kläser. Human detection and character recognition in tv-style movies. In Informatiktage, 

pages 151–154, 2007. 

[21] N. Kumar, P. N. Belhumeur, and S. K. Nayar. FaceTracer: A search engine for large 

collections of images with faces. In European Conference on Computer Vision (ECCV), 

pages 340–353, Oct 2008. 

[22] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers 

for face verification. In IEEE International Conference on Computer Vision (ICCV), Oct 

2009. 

[23] D. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 

60(2):91–110, 2004.

BIBLIOGRAPHY 54 

[24] B. Moghaddam. Bayesian face recognition. Pattern Recognition, 33(11):1771–1782, Novem- 

ber 2000. 

[25] M. M. Nordstrøm, M. Larsen, J. Sierakowski, and M. B. Stegmann. The IMM face database 

- an annotated dataset of 240 face images. Technical report, Informatics and Mathematical 

Modelling, Technical University of Denmark, DTU, Richard Petersens Plads, Building 321, 

DK-2800 Kgs. Lyngby, may 2004. 

[26] E. Nowak and F. Jurie. Learning visual similarity measures for comparing never seen 

objects. In Conference on Computer Vision & Pattern Recognition, jun 2007. see also 

http://lear.inrialpes.fr/people/nowak/. 

[27] P. J. Phillips, P. Flynn, T. Scruggs, K. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, 

and W. Worek. Overview of the face recognition grand challenge. pages 947–954, 2005. 

[28] P. J. Phillips, W. T. Scruggs, A. O’Toole, P. Flynn, K. Bowyer, C. Schott, and M. Sharpe. 

Frvt 2006 and ice 2006 large-scale experimental results. IEEE Transactions on Pattern 

Analysis and Machine Intelligence, 32:831–846, 2010. 

[29] S. Phimoltares, C. Lursinsap, and K. Chamnongthai. Face detection and facial feature 

localization without considering the appearance of image context. Image Vision Comput., 

25(5):741–753, 2007. 

[30] N. Pinto, J. J. di Carlo, and D. D. Cox. Establishing good benchmarks and baselines for 

face recognition. In Faces in real life images workshop at ECCV08, 2008. 

[31] N. Pinto, J.J. DiCarlo, and D.D. Cox. How far can you get with a modern face recognition 

test set using only simple features? Computer Vision and Pattern Recognition, IEEE 

Computer Society Conference on, 0:2591–2598, 2009. 

[32] F. Porikli. Integral histogram: A fast way to extract histograms in cartesian spaces. In in 

Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 829–836, 2005. 

[33] S. Rizvi, P. J. Phillips, and H. Moon. The FERET verification testing protocol for face 

recognition algorithms, 1999. 

[34] J. Shi and C. Tomasi. Good features to track, 1994. 

[35] J. Sivic, M. Everingham, and A. Zisserman. “Who are you?”: Learning person specific 

classifiers from video. In Proceedings of the IEEE Conference on Computer Vision and 

Pattern Recognition, 2009.

55 BIBLIOGRAPHY 

[36] Y. Taigman, L. Wolf, and T. Hassner. Multiple one-shots for utilizing class label informa- 

tion. In The British Machine Vision Conference (BMVC), Sept. 2009. 

[37] X. Tan and B. Triggs. Enhanced local texture feature sets for face recognition under 

difficult lighting conditions. In Analysis and modelling of faces and gestures, volume 4778 

of LNCS, pages 168–182. Springer, oct 2007. 

[38] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 

3(1):71–86, 1991. 

[39] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. 

Proc. CVPR, 1:511–518, 2001. 

[40] X. Wang and X. Tang. A unified framework for subspace face recognition. IEEE Trans. 

Pattern Anal. Mach. Intell., 26(9):1222–1228, 2004. 

[41] K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor 

classification. J. Mach. Learn. Res., 10:207–244, 2009. 

[42] L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In Real-Life 

images workshop at the European Conference on Computer Vision (ECCV), October 2008. 

[43] L. Wolf, T. Hassner, and Y. Taigman. Similarity scores based on background samples. In 

Asian Conference on Computer Vision (ACCV), 2009. 

[44] M. Yang. Face recognition using kernel methods, 2001. 

[45] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: a literature 

survey. ACM Comput. Surv., 35(4):399–458, 2003. 

[46] J. Zhu, L. Van Gool, and S. Hoi. Unsupervised face alignment by robust nonrigid mapping. 

In IEEE International Conference on Computer Vision, 2009. 

[47] Q. Zhu, M. Yeh, K. Cheng, and S. Avidan. Fast human detection using a cascade of 

histograms of oriented gradients. In CVPR ’06: Proceedings of the 2006 IEEE Computer 

Society Conference on Computer Vision and Pattern Recognition, pages 1491–1498, Wash- 

ington, DC, USA, 2006. IEEE Computer Society.

Thesis - VIBOT congrat page

Create successful ePaper yourself

Delete template?

Save as template?