17.12.2012 Views

Thesis - VIBOT congrat page

Thesis - VIBOT congrat page

Thesis - VIBOT congrat page

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Robust Face Descriptors in Uncontrolled Settings<br />

Kenneth Alberto Funes Mora<br />

LEAR Team<br />

INRIA Rhône-Alpes<br />

Supervisors<br />

Cordelia Schmid, Jakob Verbeek and Matthieu Guillaumin<br />

A <strong>Thesis</strong> Submitted for the Degree of<br />

MSc Erasmus Mundus in Vision and Robotics (<strong>VIBOT</strong>)<br />

· 2010 ·


Abstract<br />

Face Recognition is known to be a difficult problem for the computer vision community.<br />

Factors such as pose, expression, illumination conditions and occlusions, among others, span<br />

a very large set of images that can be generated by a single person. Therefore the automatic<br />

decision of whether a pair of images depict the same person or not, in uncontrolled settings,<br />

becomes a highly challenging problem.<br />

Due to the large quantity of potential applications, over the past years many algorithms<br />

have been proposed, which can be separated into three categories: holistic, facial feature based<br />

and hybrid. Even though some algorithms have achieved a high accuracy, there is still the need<br />

for a significant improvement to achieve robustness in uncontrolled conditions while achieving<br />

a high computational efficiency.<br />

In this thesis we explore the use of a Histogram of Oriented Gradients as a holistic descriptor.<br />

The experimental results show that a considerable performance is achieved when a proper set<br />

of parameters are combined with a prior face alignment. The classification function is given by<br />

a metric learning algorithm, i.e. an algorithm which finds the best Mahalanobis distance that<br />

separates the input data.<br />

Additionally a facial feature based descriptor is presented, which is the concatenation of<br />

SIFT descriptors, computed in the location of interest points found by a facial feature detection<br />

algorithm. More importantly, a method to handle occlusions is proposed, where a confidence<br />

is obtained from each facial feature and later combined into the classification function. Also,<br />

non-linear strategies for face recognition are discussed.<br />

Finally it is shown that there is complementary information between both descriptors, as<br />

their combination improves the performance such that it becomes comparable to the current<br />

state of the art algorithms.


Contents<br />

Acknowledgments iii<br />

1 Introduction 1<br />

1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.2 Outline and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

2 Related work 5<br />

2.1 Marginalized k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

2.2 Automatic naming of characters in TV video . . . . . . . . . . . . . . . . . . . . 6<br />

2.3 Attribute and simile descriptor for face identification . . . . . . . . . . . . . . . . 8<br />

2.4 Face recognition with learning based descriptor . . . . . . . . . . . . . . . . . . . 10<br />

2.5 Multiple one-shots using label class information . . . . . . . . . . . . . . . . . . . 12<br />

3 The face recognition pipeline 14<br />

3.1 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

3.2 Facial features localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

3.3 Face alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

3.4 Preprocessing for illumination invariance . . . . . . . . . . . . . . . . . . . . . . . 19<br />

3.5 Face descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

3.6 Learning/Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

3.7 Datasets and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

i


3.8 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

4 Histogram of Oriented Gradients for face recognition 29<br />

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

4.2 Alignment comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />

4.3 HoG parametric study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

5 Facial feature based representations 36<br />

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />

5.2 Feature wise classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

5.3 Non-linear approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

6 Combining face representations 46<br />

6.1 Results for LFW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

6.2 Results for PubFig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

7 Conclusions and future work 50<br />

Bibliography 55<br />

ii


Acknowledgments<br />

First of all I want to thank all the people who brought the Vibot program into existence and<br />

that every year work very hard for its improvement. The coordinators Fabrice Meriadeau,<br />

David Fofi, Joaquim Salvi, Jordi Freixenet, Robert Martí and Yvan Petillot and every single<br />

one of the lecturers and administrative staff. Without your effort and initiative we would not<br />

be here.<br />

To my supervisors: Cordelia Schmid, Jakob Verbeek and Matthieu Guillaumin. I feel very<br />

thankful for receiving me in the LEAR team, and for your valuable guidance, which helped me<br />

to grow in knowledge and experience. As well to all the members of the LEAR team, for their<br />

friendship and for making these months a very gratifying and enrichful experience.<br />

To all my Vibot colleages, I have learned too many things from every single one of you. The<br />

cultures you were representing, your different world views and your experience. It is one of the<br />

things that I would never forget of the Vibot program. It helped to grow in so many ways that<br />

I can not express. The world is a small place but it contains great people. Your friendship will<br />

be always alive and I hope we will be meeting in the future.<br />

I want to thank my friends at home, who have been in contact with me all this time. Always<br />

willing to listen, always willing to advice, always willing to talk. Definitely a true friend is not<br />

separated by the distance. You guys know who you are. . .<br />

I want to thank my family, my parents Carlos Funes and Ruth Mora, and my brother<br />

Michael Funes for their support at the distance and encouranging words in moments of need,<br />

¡Papá, Mamá, Michael, los amo enormemente!, ¡Gracias!.<br />

I would like to thank my God and saviour Jesus Christ, you are the source of my strength<br />

and my motivation, you take me by the hand when I need it the most. Thank you. . .<br />

years.<br />

Last but not least, to the European Commission for funding my studies during these two<br />

iii


Chapter 1<br />

Introduction<br />

Face Recognition can be divided in two main applications: Face Identification and Face Verifi-<br />

cation. The former refers to the association between a set of probe faces and a gallery, in order<br />

to determine the identity of each of the exemplars from the probe set. The latter refers to the<br />

decision of whether a pair of face instances correspond or not to the same person. This defini-<br />

tion is different to that of visual identification [13], where the term identification is used for the<br />

pair matching problem. It can be noticed that the face verification problem is wider, in such<br />

way that the face identification task can be formulated by solving face verification subproblems.<br />

Within this thesis we focus on face verification. Therefore, the goal is to design an algorithm<br />

to automatically decide, whether a pair of unseen face images, depict the same person or not.<br />

It is a supervised classification problem, in which the decision function is trained based on a set<br />

of example faces labeled with identities, or pairs of face images labeled as similar or dissimilar.<br />

The availability of a solution to this problem is highly attractive for its many applications.<br />

It comprises fields such as entertainment, smart cards, information security, law inforcement,<br />

surveillance, etc. [45]. Within the context of scene interpretation, we want to be able to auto-<br />

matically determine what is happening in an image or a video [35]. Face recognition is highly<br />

valuable as it helps to determine the question of Who is in the scene [11,20]. This will open<br />

the possibility for applications such as categorization, retrieval and indexing based on iden-<br />

tity [15,16]. The use of face recognition technology is becoming more and more visible, e.g. the<br />

recent launch of tools for automatic face labeling in sites such as Picasa 1 .<br />

More than 35 years in research have generated many algorithms [1,3,7,11,15,16,22,26,35,<br />

36,38,40,42,45] and benchmarks [19,22,27,33], which have pushed face recognition to achieve<br />

outstanding results, a proof is the current availability of commercial software [28]. In general,<br />

this software is designed for the case in which the person cooperates in the image acquisition<br />

1 Picasa Web Albums, http://picasaweb.google.com/<br />

1


Chapter 1: Introduction 2<br />

(a) (b) (c)<br />

(d) (e) (f)<br />

Figure 1.1: Face variations due to: (a) Viewpoint changes (b) Illumination variation (c) Occlusions<br />

(d) Expression (e) Age variations (f) Image quality<br />

in a controlled environment, and therefore there are no major changes in illumination, pose,<br />

expression, etc. However, face recognition in uncontrolled settings, from still images and videos,<br />

is still an unsolved problem. Despite the large amount of research carried out, a significant<br />

improvement is still required in order to achieve robustness in such settings.<br />

The main challenge is that a single person can virtually generate an infinite number of<br />

images. This is due to the many factors that influence the image acquisition. Among the most<br />

important are: major pose or viewpoint changes, including scaling differences, variations in the<br />

illumination conditions, the possibility of occlusions due to sunglasses, hats and other objects,<br />

differences in expression,aging, changes in hair and facial hair and image quality. Figure 1.1<br />

shows examples of how this factors affect the resulting image.<br />

1.1 Problem definition<br />

Even though many algorithms can be found in the literature, a general pipeline can be identified,<br />

shown in Fig. 1.2. Its steps are intended to overcome the challenges previously mentioned. Face<br />

detection is the first step, it defines a bounding box for the location and scale of the face. Then<br />

three optional steps can be applied: alignment, facial feature localization and/or preprocessing<br />

to gain invariance to illumination. The goal is to build a visual descriptor that can be used as<br />

the input for machine learning algorithms. These algorithms are capable of classifying a pair<br />

of examples as belonging to the same individual or not. Three categories of algorithms can be<br />

identified: Holistic, Feature based and Hybrid approaches.


3 1.1 Problem definition<br />

Face<br />

detection<br />

Alignment<br />

(optional)<br />

Facial features<br />

extraction<br />

(optional)<br />

Illumination<br />

normalization<br />

(optional)<br />

Figure 1.2: General face recognition pipeline<br />

Visual feature<br />

extraction<br />

Face<br />

identification<br />

Holistic face description methods consider the face image as a whole to build the descriptor.<br />

Examples of such approaches are the subspace learning algorithms, where a face is represented<br />

as a point in a high dimensional space, with the intensity of each pixel as one dimension,<br />

followed by the use of techniques such as Principal Component Analysis (Eigenfaces) [38] or<br />

Linear Discriminant Analysis (Fisherfaces) [3]. In such cases, the objective is to project the data<br />

into a lower dimensional space where most of the information is maintained (PCA) or the dis-<br />

criminant information between different classes (people) is emphasized (LDA) when computing<br />

the projection matrix. Bayesian methods also fall into this category, refering to those meth-<br />

ods that generate a Maximum a Posteriori (MAP) estimation of a intrapersonal/extrapersonal<br />

classifier [24].<br />

Aditionally, proposals has been presented to unify Bayesian approaches with Eigenfaces<br />

and Fisherfaces [40]. These algorithms have shown to provide good results under controlled<br />

conditions, using benchmarks such as the FERET database [33]. However, they are not suitable<br />

for uncontrolled settings, where high non-linearities are introduced, e.g. as a result of major<br />

pose changes, and are sensitive to the localization given by the face detector.<br />

Proposals have been presented to improve the performance in uncontrolled conditions, by<br />

creating more complex descriptors than simply the set of pixel values, e.g. using Local Binary<br />

Patterns [1] or by extending subspace learning to handle non-linear data, using the kernel<br />

trick [6, 44]. Additionally through methods specialized in non-linear dimension reduction, by<br />

learning an invariant mapping [17]. In this thesis, a holistic approach based on Histogram of<br />

Oriented Gradients (HoG) will be presented in Chapter 4.<br />

Feature based face description algorithms are grounded in the localization of a set of facial<br />

features, such as the position of the mouth, the eyes, the nose, etc, after face detection [11,29].<br />

A descriptor is built using the location information. In the past years, algorithms based on<br />

Facial Features localization have gained a growing attention [7,10,11,16,22], as they are less<br />

sensitive to pose variations and misalignments introduced by the face detector.<br />

Therefore they are appropriate for the face recognition tasks in uncontrolled settings. How-<br />

ever, the facial feature localization itself is still problematic, and needs further improvements.<br />

In this thesis a feature-based algorithm using multiscale SIFT [16, 23] will be presented, and


Chapter 1: Introduction 4<br />

compared to the Holistic approach based on HoG descriptors.<br />

Hybrid face description methods combine holistic and feature based paradigms, through<br />

either early or late fusion. Early fusion refers to the case in which descriptors are combined<br />

into one using aggregation methods, such as concatenation of the feature vectors. In this case,<br />

the information is combined prior to classification. Late fusion makes a classification based on<br />

each descriptor, and their corresponding scores are combined into one, to make a more robust<br />

decision. In this thesis we use a late fusion method, which combines the HoG and multiscale<br />

SIFT descriptors.<br />

1.2 Outline and contributions<br />

In Chapter 2 different state of the art algorithms are described in detail. These were identified<br />

for being the current state of the art for challenging benchmarks such as the Labeled Faces in<br />

the Wild [19] dataset, or because they were an important influence for our work. In Chapter 3<br />

there is a detailed description of the face recognition pipeline from Fig. 1.2. Each of the stages<br />

are described, together with algorithms for their implementation.<br />

The first contribution is given in Chapter 4, where we explore the use of a Histogram of<br />

Oriented Gradients descriptor for face recognition. We show in this chapter that an alignment<br />

robust regarding translations is necessary to obtain a good performance. Furthermore, we<br />

identify set of parameters for which a highest accuracy is achieved.<br />

Our second contribution, described in Chapter 5, is related with feature based algorithms.<br />

We propose a strategy in which learning is done for each facial feature, after which we combine<br />

them through late fusion. Even though this does not help the overall performance, it is good<br />

to handle occlusions. This is done by detecting outliers based on a discriminative appearance<br />

model. The occlusion information is later on inserted into the classification function.<br />

The third contribution is showed in Chapter 6, where we combine the use of HoG and mul-<br />

tiscale SIFT representations through late fusion. This combinations increases the performance<br />

of the algorithm such that it is comparable to the state of the art. Finally, in Chapter 7, we<br />

give a summary of our work pointing out the main conclusions, from which we define our future<br />

work.


Chapter 2<br />

Related work<br />

In Chapter 1 different face recognition algorithms were mentioned. We identified a few methods<br />

that have given promising results in uncontrolled settings, and are recognized as the state of<br />

the art. These algorithms are described in more detail in this chapter.<br />

2.1 Marginalized k-Nearest Neighbors<br />

Guillaumin et al. proposed the use of metric learning approaches for face recognition [16], more<br />

specifically, Logistic Discriminant Metric Learning (LDML), an algorithm that searches for<br />

the best Mahalanobis distance between pairs of feature vectors, explained in more detail in<br />

Section 3.6.2.<br />

Even though LDML has proven to be effective, any Metric Learning algorithm will generate<br />

a linear transformation of the input space. However data, for face recognition, is believed to<br />

be highly non linear, due to major changes in pose and expression. Therefore, metric learning<br />

approaches might not be able to effectively separate the classes. To overcome this problem,<br />

Guillaumin et al. proposed a modification of k-Nearest Neighbors (k-NN). In k-NN classification,<br />

an unseen example is assigned to the class with most occurrence within its k neighbors, that<br />

are defined according to some measure, e.g. minimim Euclidean distance.<br />

If n i c denote the quantity of neighbors of xi belonging to class c. Then the probability of xi<br />

to be of class c is estimated as p(yi = c|xi) = n i c/k. The proposal is to classify the pair (xi,xj)<br />

as belonging to the same class by marginalizing over all the possible classes within the training<br />

set. This is shown in Eq. (2.1).<br />

p(yi = yj|xi,xj) = �<br />

c<br />

p(yi = c|xi)p(yj = c|xj) = 1<br />

k 2<br />

5<br />

�<br />

n c in c j<br />

c<br />

(2.1)


Chapter 2: Related work 6<br />

This result can be thought as a binary k-Nearest Neighbors classifier in the implicit space<br />

of N 2 pairs. This can be observed in Fig. 2.1, where for each point of the pair to be classified,<br />

their k neighbors are selected and then the vote is given by all the pairs that can be generated<br />

from their neighbors, divided by the quantity of possible pairs Eq. (2.1).<br />

The descriptors used in [16] were Local Binary Patterns (LBP) [42] and SIFT [23], computed<br />

at 3 scales in the locations given by the facial feature localization algorithm, i.e. the corner of<br />

the eyes, nose and mouth. The metric used to define the neighborhood was given by a Large<br />

Margin Nearest Neighbors [41]. An algorithm designed to find a metric specifically optimized<br />

for the k-NN problem.<br />

xi<br />

B<br />

A<br />

C<br />

12 pairs<br />

24 pairs<br />

6 pairs<br />

C<br />

6 pairs<br />

Figure 2.1: Marginalized K Nearest Neighbors [16]<br />

2.2 Automatic naming of characters in TV video<br />

Everingham et al. [11] considered the problem of automatic naming of characters in video. They<br />

combined information such as subtitles and scripts to determine which characters are present<br />

in the scene and when. Using visual information are able to associate a name to each character<br />

for certain tracks. These tracks are used as well to generate a set of training examples for<br />

a face recognition algorithm, used to determine the identity of characters from the remaining<br />

unlabeled tracks.<br />

In this case, the problem is simpler in terms of face recognition, tracking can be used to<br />

associate faces in a sequence of frames. Moreover, video can easily generate a large amount of<br />

training examples, and generally, there is a small amount of characters to recognize.<br />

The first step is to align the script (dialogue-character) with the subtitles (dialogue-timing)<br />

to determine which characters are talking and when. Then they proceed to obtain face tracks,<br />

that are face detections linked as the same person over a group of not necessarilly sequential<br />

frames. This is done using a Kanade-Lucas-Tomasi (KLT) tracker [34], this algorithm uses a<br />

interest point detector for the first frame and then propagates the points over the following<br />

A<br />

xj<br />

B


7 2.2 Automatic naming of characters in TV video<br />

(a) (b)<br />

Figure 2.2: (a)Example of face tracking to build the training set (b) Features Patches extraction<br />

[11]<br />

frames. Based on the tracked interest points, which follow a path intersecting face detections,<br />

the creation of the face tracks are obtained as seen in Fig. 2.2a. The face tracking is done<br />

separately for each shot of the whole video, where a change of shot is detected by thresholding<br />

the difference of color histograms between succesive frames. Notice that this simplifies the<br />

problem of face matching and no real face recognition is done yet.<br />

In order to build a face descriptor, the facial feature detector, described in detail in Section<br />

3.2 is used. The pixel values surrounding each localization are extracted, as showed in Fig.2.2b,<br />

normalized to have zero mean and unitary variance, in order to acquire photometric invariance.<br />

Using the localization of the mouth, a speaker detection is used, simply by computing the<br />

variation of the mouth pixels in sequential frames and thresholding. Additionally to facial<br />

information, clothing information is used, with a color histogram for a bounding box below the<br />

face detection. Finally knowing which face track is speaking and associating it with the script<br />

and subtitle information, a set of face tracks can be properly labeled with an identity. These<br />

tracks can be used as training examples for a classification problem, in order to label the rest<br />

of the face tracks that could not be labeled in the previous steps.<br />

To label the rest of face tracks, a similarity measure comparing two characters combines<br />

facial and clothing information, as given in Eq. (2.2)<br />

�<br />

S(pi,pj) = exp − df(pi,pj)<br />

2σ2 � �<br />

exp −<br />

f<br />

dc(pi,pj)<br />

2σ2 �<br />

c<br />

(2.2)<br />

Taking into account this similarity measure, a classification based on Nearest Neighbors or<br />

Support Vector Machines can be used to label the rest of face tracks in the video. More details


Chapter 2: Related work 8<br />

can be found in [11].<br />

Table 2.1: Low level features parameters for a single trait classifier<br />

Pixel Value Types Normalization Aggregation<br />

RGB(r) None(n) None(n)<br />

HSV (h) Mean-Normalization (m) Histogram (h)<br />

Image Intensity (i) Energy-Normalization (e) Statistics (s)<br />

Edge Magnitude (m)<br />

Edge Orientation (o)<br />

2.3 Attribute and simile descriptor for face identification<br />

The work presented by Kumar et al. [22] has presented one of the best results for the Labeled<br />

Faces in the Wild benchmark, when using the “restricted” protocol (explained in Section 3.7.1).<br />

They presented two separate strategies: the attribute and the simile classifier.<br />

2.3.1 Attribute descriptor<br />

The attribute classifier algorithm is based on the idea that a person’s identity can be infered<br />

from a set of high level attributes, such as gender, age, race, etc. The result is a descriptor<br />

with entries according to each of the attributes, as shown in Fig. 2.3a. Each trait is determined<br />

using the algorithm in [21]: the face image is divided into regions, as shown in Fig. 2.3c. The<br />

aim is to have a set of low level features that are created by the combination of a region, using<br />

a specific pixel value type, normalization and aggregation. The options are listed in Table 2.1.<br />

The selection of which combinations to use is trait dependent.<br />

Kumar et al. proposed to use forward feature selection to know which low-level features to se-<br />

lect for a given trait. Then a SVM classifier with RBF Kernel is trained concatenating the useful<br />

low-level features. In [22], the low level descriptor is defined as F(I) = 〈f1(I),f2(I),...,fk(I)〉<br />

where fi(I) represent the feature i of image I, a selection from Table 2.1. The attribute descrip-<br />

tor is build using the output of the trait classifiers as xi = 〈C1(F(Ii)),C2(F(Ii)),...,Cn(F(Ii))〉.<br />

Finally the recognition function is given in Eq. (2.3)<br />

f(Ii,Ij) = D(xi,xj) (2.3)<br />

With D(xi,xj) as a classification function, described in Section 2.3.3, such that the output<br />

is positive for the same identity and negative for different identities.


9 2.3 Attribute and simile descriptor for face identification<br />

(a) (b)<br />

(c)<br />

Figure 2.3: (a) Descriptor based on high level attributes (b) Training examples for the attributes<br />

(c) Face Regions for the attribute classifiers [21]<br />

2.3.2 Simile descriptor<br />

A problem with the attribute classifier is that a significant amount of annotation must be<br />

done, and only features that can be described with words such as gendre must be used. Simile<br />

descriptors are based on the intuition of describing a person based on similarities with reference<br />

individuals. For example: “Nose similar to subject 1” and “Mouth Not similar to subject 2”. To<br />

create such description, a set of reference face images was created. A classifier is trained based<br />

on at least 600 positive examples for each feature and at least 10 times more negative examples.<br />

The final descriptor is depicted in Fig.2.4a, while Fig.2.4b show some training examples.<br />

For a pair of unseen examples, their respective simile feature vectors, xi and xj, are com-<br />

puted. Then a classifier is used to take the decision of whether they depict the same person<br />

(Eq. (2.4)).<br />

2.3.3 Verification classifier<br />

f(Ii,Ij) = D(xi,xj) (2.4)<br />

Both Eq.(2.3) and Eq.(2.4) use the same algorithm, which is a Support Vector Machine classifier<br />

optimized to give higher importance to the sign than to the absolute value of the entries of the


Chapter 2: Related work 10<br />

(a) (b)<br />

Figure 2.4: (a) Descriptor based on similarity of features (b) Training examples for the features<br />

descriptor. This is done based on the observation that the trait classifiers are designed to be<br />

binary outputs, in the range [−1,1].<br />

To do that they proposed to generate pairs pi = (|ai − bi|,ai.bi)g( 1<br />

2 (ai + bi)), where<br />

ai = Ci(I1), bi = Ci(I2) and g(z) is a Gaussian weighting. The concatenation of all the pairs<br />

generate the feature vector that is used for an SVM RBF classifier. Even though these algo-<br />

rithms have both achieved outstanding results for Labeled Faces in the Wild, they do not follow<br />

the strict evaluation protocol as they use training data not available in the Labeled Faces in<br />

the Wild dataset. It also has the disadvantage of using a large set of classifiers just to build the<br />

descriptor. This is not desirable in terms of computational efficiency.<br />

2.4 Face recognition with learning based descriptor<br />

Recently, Cao et.al [7] introduced a novel method which is comparable to the best performing<br />

algorithms for Labeled Faces in the Wild. It brings two main contributions, the first one is<br />

that there is no manually defined descriptor, but a proper encoding is learned specifically for<br />

facial images, in an unsupervised manner. The second contribution consist in a pose dependent<br />

classification.<br />

As illustrated in the top part of Fig. 2.5b, the descriptor is learned as follows: a sampling<br />

method is defined in which, for every pixel, its neighbors are retrieved in a predefined pattern,<br />

to form a low level vector. Examples can be observed in Fig. 2.5a where different options for<br />

patterns are presented. The sampling is done for every pixel in the image, for all the images in<br />

the training set, and therefore each pixel will have an associated low level feature vector.<br />

A vector quantization algorithm is used, which might be K-Means, PCA-tree or random-<br />

projection tree. Empirically they found that random-projection tree gives better results. The


11 2.4 Face recognition with learning based descriptor<br />

(1)<br />

(3)<br />

R 1<br />

R 1<br />

(2)<br />

R 1<br />

R 2<br />

(a)<br />

(4)<br />

R 1<br />

R 2<br />

Preprocessed<br />

image<br />

*<br />

Landmark<br />

detection<br />

R 2<br />

R 1<br />

Sampling and<br />

normalization<br />

left eye<br />

.<br />

.<br />

.<br />

nose<br />

left eye<br />

.<br />

.<br />

.<br />

nose<br />

Component<br />

alignment<br />

d d d 1,1 1,2 1,w<br />

d d d 2,1 2,2 2,w<br />

{ }<br />

d h,1 d h,2<br />

Normalized low-level<br />

feature vectors<br />

DoG<br />

d h,w<br />

LE<br />

descriptor<br />

extractor<br />

LE descriptor<br />

extraction<br />

•<br />

• •<br />

• •• •<br />

(b)<br />

Learning-based<br />

encoding<br />

{<br />

{<br />

left eye<br />

.<br />

.<br />

.<br />

nose<br />

left eye<br />

.<br />

.<br />

.<br />

nose<br />

Component<br />

representaion<br />

Code image<br />

s 2<br />

s 9<br />

... s1<br />

Component<br />

similarity vector<br />

PCA and<br />

normalization<br />

Concatenated<br />

patch histogram<br />

Pose<br />

evaluation<br />

Pose-adaptive<br />

classifier<br />

Pose-adaptive<br />

face similarity<br />

LE descriptor<br />

Face<br />

verificaton<br />

Figure 2.5: (a) Sampling patterns for Learning-based Descriptor. Neighboring pixels are sampled<br />

in a circular pattern [7] (b) Face Recognition with Learning-based Descriptor [7]. The top<br />

part shows the pipeline used to learn the face encoding (descriptor). The bottom section shows<br />

the overall pipeline, showing the pose adaptive recognition.<br />

quantization will transform the low level features into a single code, as shown in Fig. 2.5b,<br />

which will define a code image. Then a spatial grid is defined, and for each cell, a histogram of<br />

occurrence of codes is created. All the histograms are then concatenated to form a final vector.<br />

However, depending on the size of the grid, and the predefined number of codes, this histogram<br />

might be very large, therefore PCA is used to reduce its size.<br />

Empirically and surprisingly they showed that the discriminative power is even higher after<br />

the dimensionality reduction, and improves even further by simply normalizing the projected<br />

vector. They also show methods in which they can combine different sampling patterns to<br />

boost the performance as they might retrieve complementary information. It is important to<br />

remark that the higher results were obtained not for a holistic descriptor, but using feature<br />

localization, they use the encoding for each feature independently, and the alignment is done<br />

for each component, not as a global alignment.<br />

Besides the encoding, an adaptive matching was used, in which three exemplar images,<br />

with left, frontal and right pose were selected. For a unseen image, the similarity of the<br />

descriptors is computed against each of the exemplars, and the assigned pose is the one of<br />

the exemplar with the highest similarity. A classifier was trained for each combination of<br />

pose (left left, frontal frontal, right right, left right, left frontal, left right) in a way such that<br />

depending on the infered combination of pose for the input images, the corresponding classifier


Chapter 2: Related work 12<br />

is used. They showed with their results that this also brings an improvement in the accuracy<br />

of the classification.<br />

2.5 Multiple one-shots using label class information<br />

This method, introduced by Taigman et al. [36], is based in the one-shot similarity score (OSS).<br />

The OSS score is computed as follows: a set of face examples A is obtained, this has to be<br />

exclusive to the images to be compared in terms of identity. Then, if a pair of images xi and xj<br />

is to be classified, first a discriminative classifier fi is trained, using image xi as a single positive<br />

example and the set A as the negative examples. The process is repeated for xj to obtain a<br />

classifier fj. The OSS score is the average of the cross classification, i.e. s = (fi(xj)+fi(xj))/2.<br />

The work from Taigman [36] is an extension of this method which benefits from the use of<br />

label information. The proposal is to split the set A according to the identities, such that we<br />

have Ai,i = {1,2,...,n}. Then to create a single OSS score from each of the subsets to build<br />

a multiple one-shot vector. The motivation is to make classifiers which are more discriminative<br />

towards identity than to other factors, such as pose. If a subset of Ai has images of only one<br />

person and there is variety regarding factors such as pose, expression, etc. then the classifier<br />

will be more likely to discriminate identity. In the case a factor such as pose is constant within<br />

the subset Ai, then the OSS score will not be discriminative towards identity, but to pose,<br />

however they argue this information is beneficial when combining a large set of OSS scores into<br />

the multiple one-score vector. In such way that they also created subsets of images sharing the<br />

same pose to create more OSS scores.<br />

The pipeline for this algorithm can be observed in Fig. 2.6, and it is described as follows: the<br />

two images being compared are aligned, using a similar strategy to that of Section 3.3.2, from<br />

Figure 2.6: the multiple one-shot pipeline


13 2.5 Multiple one-shots using label class information<br />

which they create a feature vector. They tested with SIFT with a dense sampling, Local Binary<br />

Patterns (LBP), the three-patch and the four-patch LBP [42]. PCA is later used to reduce the<br />

dimensionality of the descriptor. Then Information Theoretic Metric Learning (ITML) is used to<br />

learn a Mahalanobis distance d(xi,xj) = (xi −xj) ⊤ S(xi −xj), which generates a distance above<br />

certain threshold for negative pairs while maintaining the distance below another threshold for<br />

positive pairs [9]. The learned matrix can be factorized using a Cholesky decomposition, as<br />

S = G ⊤ G, from which the matrix G is used to project the feature vectors. In the new space,<br />

the computation of the Euclidean distance is equivalent to computing the Mahalanobis distance<br />

in the previous space. The metric and the PCA projection are obtained from the training set<br />

prior to the computation of the OSS scores.<br />

Finally, for a pair of face images to be classified their feature vectors, projected using the<br />

matrix G, are used to generate multiple OSS scores using the subsets Ai, these are concatenated<br />

to create a vector which is fed into a SVM classifier.<br />

This algorithm has currently the highest accuracy reported for the Labeled Faces in the<br />

Wild benchmark, in the “unrestricted” protocol, explained in Section 3.7.1. However, notice<br />

the computation of OSS scores is very expensive, as many different discriminative models have<br />

to be trained in order to create the multiple OSS score vector.


Chapter 3<br />

The face recognition pipeline<br />

In this chapter the pipeline depicted in Fig. 3.1 is discussed in more detail. The function of<br />

each stage is described, and relevant algorithms for their implementation are presented.<br />

3.1 Face detection<br />

Face detection is the search of location and scale of instances of human faces within an arbitrary<br />

image. Again, the difficulty is to perform well in the presence of factors that affects images<br />

acquired in uncontrolled conditions (c.f. Fig. 1.1). Viola & Jones [39] proposed an efficient<br />

algorithm for face detection, based on Haar Wavelet Features and a cascade of classifiers,<br />

selected by the Adaboost algorithm.<br />

Adaboost [14] is an algorithm designed to create a “stronger classifier” from a set of “weak<br />

classifiers” through their linear combination. The algorithm iteratively selects, from the weak<br />

classifiers space, the one which minimizes a distributed error over the training data. The<br />

assigned weight to the selected classifier is dependent on the error and, at each iteration, the<br />

distribution is updated, in such way that, the training examples which were misclassified, are<br />

given higher importance in the following iterations.<br />

Face<br />

detection<br />

Alignment<br />

(optional)<br />

Facial features<br />

extraction<br />

(optional)<br />

Illumination<br />

normalization<br />

(optional)<br />

Figure 3.1: General face recognition pipeline<br />

14<br />

Visual feature<br />

extraction<br />

Face<br />

identification


15 3.2 Facial features localization<br />

A<br />

C<br />

¦¡ ¦¡ ¦<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¥¡ ¥<br />

¥¡<br />

¦¡ ¦<br />

¦¡<br />

¤¡ ¤¡ ¤<br />

£¡ £<br />

£¡<br />

¤¡ ¤<br />

¤¡<br />

£¡ £<br />

£¡<br />

¤¡ ¤<br />

¤¡<br />

£¡ £<br />

£¡<br />

¤¡ ¤<br />

¤¡<br />

£¡ £<br />

£¡<br />

¤¡ ¤<br />

¤¡<br />

£¡ £<br />

£¡<br />

¤¡ ¤<br />

¤¡<br />

£¡ £<br />

£¡<br />

¤¡ ¤<br />

¤¡<br />

£¡ £<br />

£¡<br />

¤¡ ¤<br />

¤¡<br />

£¡ £<br />

£¡<br />

¤¡ ¤<br />

¤¡<br />

(a)<br />

¡ ¡<br />

©¡ ©<br />

©¡<br />

¡<br />

¡<br />

©¡ ©<br />

©¡<br />

¡<br />

¡<br />

©¡ ©<br />

©¡<br />

¡<br />

¡<br />

©¡ ©<br />

©¡<br />

¡<br />

¡<br />

©¡ ©<br />

©¡<br />

¡<br />

¡<br />

©¡ ©<br />

©¡<br />

¡<br />

¡<br />

©¡ ©<br />

©¡<br />

¡<br />

¡<br />

©¡ ©<br />

©¡<br />

¡<br />

¡<br />

¨¡ ¨¡ ¨<br />

§¡ §<br />

§¡<br />

¨¡ ¨<br />

¨¡<br />

§¡ §<br />

§¡<br />

¨¡ ¨<br />

¨¡<br />

§¡ §<br />

§¡<br />

¨¡ ¨<br />

¨¡<br />

§¡ §<br />

§¡<br />

¨¡ ¨<br />

¨¡<br />

§¡ §<br />

§¡<br />

¨¡ ¨<br />

¨¡<br />

§¡ §<br />

§¡<br />

¨¡ ¨<br />

¨¡<br />

§¡ §<br />

§¡<br />

¨¡ ¨<br />

¨¡<br />

§¡ §<br />

§¡<br />

¨¡ ¨<br />

¨¡<br />

¢¡ ¢¡ ¢¡ ¢¡ ¢<br />

¡ ¡ ¡<br />

¡<br />

¢¡ ¢¡ ¢¡ ¢<br />

¢¡<br />

¡ ¡ ¡<br />

¡<br />

¢¡ ¢¡ ¢¡ ¢<br />

¢¡<br />

¡ ¡ ¡<br />

¡<br />

¢¡ ¢¡ ¢¡ ¢<br />

¢¡<br />

¡ ¡ ¡<br />

¡<br />

¢¡ ¢¡ ¢¡ ¢<br />

¢¡<br />

¡ ¡ ¡<br />

¡<br />

¢¡ ¢¡ ¢¡ ¢<br />

¢¡<br />

¡ ¡ ¡<br />

¡<br />

¢¡ ¢¡ ¢¡ ¢<br />

¢¡<br />

¡ ¡ ¡<br />

¡<br />

¢¡ ¢¡ ¢¡ ¢<br />

¢¡<br />

B<br />

D<br />

Figure 3.2: Viola-Jones object detection based on Haar Features (a) Examples of Haar Features<br />

(b) Feature computation from the integral image. Notice that the area marked as D can be<br />

computed using the points 1,2,3 and 4 from the Integral Image. D = 4 + 1 − (2 + 3) [39]<br />

Their algorithm has the advantage of providing a fast way to extract the Haar wavelets by<br />

precomputing what is called an Integral Image, Eq. (3.1). This is possible due to the rectangular<br />

geometry of Haar Wavelets (Fig. 3.2a), which can be later on computed by adding a few terms<br />

from the Integral image (Fig. 3.2b). This is an important asset for detection, as different Haar<br />

Filters must be computed in many locations and scales within the probe image.<br />

�<br />

Ĩ(x,y) =<br />

A<br />

C<br />

1<br />

3<br />

B<br />

D<br />

(b)<br />

x ′ ≤x,y ′ I(x<br />

≤x<br />

′ ,y ′ ) (3.1)<br />

While this algorithm is widely used because of its accuracy and speed, the implementation<br />

used in this thesis is an extension of the Viola-Jones algorithm. Besides Haar features, Histogram<br />

of Oriented Gradients (HoG) features (see Section 3.5.1) have been used. The advantage of using<br />

HoG features is that the same concept of the integral image can be applied, by creating the<br />

Integral Histogram [32,47]. This strategy boosts the speed of the algorithm, which benefits in<br />

terms of robustness from the use of additional features. Fig. 3.3 show some examples of face<br />

detections.<br />

3.2 Facial features localization<br />

Facial feature point localization is the first step in feature based algorithms. Its robustness<br />

is crucial for performance. The detector being used in this thesis is the one from [11], which<br />

is an improvement over the pictorial structure model [12]. The algorithm must maximize the<br />

following measure:<br />

2<br />

4


Chapter 3: The face recognition pipeline 16<br />

(a) (b) (c)<br />

Figure 3.3: (a) Correct face detections (b) Example of missed detection due to large pose<br />

variation (c) Incorrect detections due to a cluttered region<br />

p(F |p1,...,pn) ∝ p(p1,...,pn|F)<br />

n�<br />

i=1<br />

p(ai|F)<br />

p(ai|F)<br />

(3.2)<br />

Eq. (3.2) shows the probability of having the set of features F given a localization (p1,...,pn).<br />

This is proportional to the probability of having such localization (whether the relative posi-<br />

tioning of points is possible according to the expected geometry), multiplied by the ratio of<br />

obtaining the appearance ai for the respective feature, over the probability of also having that<br />

appearance given that the feature is not present. For the appearance model, there is the as-<br />

sumption of mutual independence between all the facial features, which is as well independent<br />

of their localization, and therefore appears as a multiplication. Eq. (3.2) can be understood as<br />

the combination of two models, one for the relative localization of the features and another for<br />

their appearance.<br />

The appearance ratios are modeled using a binary classifier, trained with feature/non fea-<br />

tures examples. It uses Haar Wavelets and Adaboost for the combination of the weak classifiers<br />

given by the Haar Features. It follows exactly the same algorithm as in Section 3.1, and the<br />

output is substituted directly into Eq.(3.2). On the other hand the localization is modeled with<br />

a tree-like Gaussian mixture in which there is a covariance dependency in the form of a tree.<br />

Each covariance depends on its parent node, as shown in Fig. 3.4 where nodes 2,3 and 4 are<br />

shown with an uncertainty relative to their parent node (1).<br />

The combination of both models present a highly reliable localization, which is able to cope<br />

with large pose variations. It is able to handle occlusions, as the expected positions compensate<br />

for appearance problems.<br />

As discussed in [12] the tree structure for the Gaussian Mixture Model allows for efficient al-<br />

gorithms for maximizing Eq. 3.2, and using the Viola-Jones algorithm for appearance modeling,


17 3.3 Face alignment<br />

Figure 3.4: Tree-like Gaussian Mixture Model for the localization of Facial Features<br />

speeds the algorithm as well.<br />

3.3 Face alignment<br />

Many recognition algorithms rely on the ability of the face detector to give a standard location<br />

and scale for the face. However, this is not always the case, standard face detectors such as<br />

Viola-Jones’s, and the one used for this project, give poorly aligned images. This is the trade-<br />

off between having the ability to detect faces with large changes in pose and expression with<br />

alignment and localization. In order to compesate those misalignments, different algorithms<br />

have been proposed to bring an arbitrary facial image to a canonical pose, in which facial<br />

features can be more easily compared. Recent algorithms have been proposed for non-rigid<br />

transformations, such that proper positioning of the facial features are infered, despite the<br />

pose, see Zhu et al. [46]. In this section, two algorithms restricted to rigid transformations are<br />

described.<br />

3.3.1 Funneling<br />

In 2007, Huang et al. [18], introduced a technique called unsupervised joint alignment. This<br />

algorithm models an arbitrary set of images (in this case, face images) as a distribution field,<br />

i.e. a model for which every pixel in the image is a random variable Xi, with possible values<br />

from an alphabet χ, for example, the set of pixel intensities for an 8-bit gray-scale image, i.e.<br />

χ = {1,2,...,256}. Then each pixel Xi is assigned with a distribution over χ.<br />

The first step of the algorithm, which can be considered as training, is called congealing.<br />

Computes the empirical distribution for each pixel, based on the stack of images, i.e. the em-<br />

pirical distribution field. Then, for each image, it performs a transformation (e.g. an affine<br />

transformation) such that the entropy over the distribution field is minimized. Then it recom-<br />

putes the empirical distribution field for the transformed images and repeats the iterations until


Chapter 3: The face recognition pipeline 18<br />

convergence.<br />

Distribution<br />

Field 1<br />

Distribution<br />

Field 2<br />

Figure 3.5: Congealing example [18]<br />

Distribution<br />

Field n<br />

Fig. 3.5 illustrates the idea of congealing. The distribution field is formed by a stack of 1D<br />

binary images, i.e. χ = {0,1}. At each iteration, a horizontal translation will be chosen for<br />

each image, in such way that the overall entropy is reduced. As a result, at iteration n, the<br />

images will be at a position such that they are considered aligned.<br />

Notice that congealing can be used directly to align a set of face images. However, it<br />

cannot be applied for an unseen example, unless the new image is inserted into the training<br />

set, and congealing is run again. Funneling is an efficient way of doing that, the idea is to<br />

keep the sequence of distribution fields at each iteration of congealing, and choose a sequence<br />

of transformations for the new image, based on the distribution field obtained at each iteration<br />

of congealing. In [18], instead of using pixel values, SIFT descriptors were used at each pixel<br />

location. Then k-Means is used to obtain 12 clusters, which are used as the alphabet χ.<br />

3.3.2 Facial features coordinates based alignment<br />

Another strategy consists in using the output of the facial features localization, i.e. the coordi-<br />

nates, to infer the necessary affine transformation which will bring the facial feature points to<br />

a canonical pose, one that will be shared among all the images.<br />

Let xf = (x f<br />

0 ,xf1<br />

,1)⊤ be the homogeneous coordinates for the feature f of a non aligned<br />

image, and yf = (y f<br />

0 ,yf 1 )⊤ the desired coordinates for the same feature. We want to obtain<br />

the affine transformation A(2 × 3) such that yf = Axf . To obtain the six parameters of A<br />

only three features are needed, however, in order to compensate for wrong localizations, all the<br />

features can be used to obtain the set of parameters which minimize the least squares error in<br />

localization.<br />

If A ′ is defined as the vector with the entries of A, Y is the vector with the target coordinates<br />

Y = (y0 0,y 0 F −1 F −1<br />

1,...,y 0 ,y1 ) ⊤ . And finally the matrix X, with the input coordinates for all<br />

the features, is defined as shown in Eq. (3.3).


19 3.4 Preprocessing for illumination invariance<br />

(a) (b) (c)<br />

Figure 3.6: Examples of Facial Features based alignment<br />

⎛<br />

x<br />

⎜<br />

X = ⎜<br />

⎝<br />

0 0 x0 0<br />

1<br />

0<br />

1<br />

0<br />

0<br />

x<br />

0 0<br />

0 0 x0 .<br />

F −1<br />

x0 .<br />

F −1<br />

x1 .<br />

1<br />

.<br />

0<br />

1<br />

.<br />

0<br />

⎞<br />

⎟<br />

1 ⎟<br />

.<br />

⎟<br />

. ⎟<br />

0⎠<br />

0 0 0<br />

F −1<br />

x<br />

F −1<br />

x 1<br />

0<br />

1<br />

(3.3)<br />

Then, for the new variables, y f = Ax f becomes Y = XA ′ . Its solution is given in Eq. (3.4)<br />

A ′ = (X ⊤ X) −1 X ⊤ Y (3.4)<br />

Figure 3.6 show some examples of alignments obtained using this strategy. The disadvantage<br />

of this approach is that facial feature localization algorithms are rather slow, and affected by<br />

high pose changes that can lead to wrong alignments. Furthermore, a single canonical pose<br />

is not suitable for major changes in viewpoint. Within this work, the target coordinates were<br />

obtained by averaging over the set of training examples.<br />

3.4 Preprocessing for illumination invariance<br />

In uncontrolled conditions the illumination setup in which the image was acquired might have<br />

a drastic influence over the obtained descriptor. Optionally, a preprocessing stage is desirable,<br />

in which the effect of illumination conditions, local shadowing and highlights is removed, while<br />

preserving the visual information that is important for recognition.<br />

Tan and Triggs [37] proposed an efficient pipeline to remove the effects of illumination,<br />

specifically for face recognition. First Gamma correction is used, i.e. a transformation of the<br />

pixel gray-level values I using the non-linear transform Î = Iγ , with 0 < γ < 1. This enhance<br />

the dynamic range by increasing the intensity in dark regions and decreasing it for bright<br />

regions. Next the image is convolved with a Difference of Gaussians (DoG) kernel, a bandpass<br />

filter which is intended to remove gradients caused by shadows (low frequency), to suppress


Chapter 3: The face recognition pipeline 20<br />

(a) (b)<br />

Figure 3.7: Preprocessing examples to gain illumination invariance (a) before preprocessing (b)<br />

after preprocessing<br />

noise (high frequency), and maintaining the useful signal for recognition (middle frequency).<br />

Additionally a mask could be used to remove regions which are irrelevant for recognition. Finally<br />

Contrast Equalization is used to have a standarized contrast spectrum for the image. This is<br />

done carefully by removing the effect of extreme values, such as artificial gradients introduced<br />

by the masking. Fig. 3.7 show examples of the resulting images after the preprocessing is<br />

applied. In this thesis we did not consider a preprocessing step for illumination invariance, as<br />

the used descriptors are based on gradients, and therefore are invariant to illumination shifts.<br />

3.5 Face descriptor<br />

The objective will be to transform an image into a feature vector xi ∈ R D . This vector must<br />

be discriminative, i.e. it must encode information that is relevant to determine the identity of<br />

the person. The learning algorithms described in section 3.6 show strategies to learn which<br />

information is relevant and which is not.<br />

In Section 2.2 a facial feature based descriptor was presented, which is the pixel intensities<br />

surrounding the localized facial features. Its intensities are normalized to have zero mean and<br />

unitary variance to gain robustness to illumination changes. We refer to that descriptor as<br />

a facial features patch. In this section, two more descriptors are described: a Histogram of<br />

Oriented Gradients and SIFT.<br />

3.5.1 Histogram of Oriented Gradients<br />

Histogram of Oriented Gradients (HoG) was initially proposed by Dalal and Triggs [8]. It is<br />

a global descriptor (Holistic), closely related to SIFT (see Section 3.5.2) and edge orientation<br />

histograms; designed for the human detection task. The pipeline used for their application is<br />

depicted in Fig. 3.8.


21 3.5 Face descriptor<br />

As illustrated in Fig. 3.8 the descriptor is build as follows: for an input image, the deriva-<br />

tives in x and y direction (Ix and Iy) are computed by convolving the image with the filters<br />

h = [−1,0,1] and h = [−1,0,1] ⊤ respectively. Then the magnitude and direction of the deriva-<br />

tives are obtained as M(i,j) = � Ix(i,j) 2 + Iy(i,j) 2 and Ω(i,j) = arctan(Iy(i,j)/Ix(i,j)) in<br />

such way that each pixel will have its gradient vector: magnitude and direction. Then accord-<br />

ing to a predefined number of cells to use, the image is splitted into a grid of cells × cells<br />

and, for each of them, a histogram is computed over the occurrence of the gradient angles of<br />

the pixels contained in that cell. The vote for each pixel is given by its magnitude, and a soft<br />

assignment is used, i.e. linear interpolation to share the vote among neighboring angle bins.<br />

The next step is to normalize the histogram by using blocks of cells, i.e. a group of cells over<br />

which their joint energy is used for normalization. Dalal and Triggs used overlaping blocks in<br />

such way that there is redundancy over the cells being used, differing only in the value used for<br />

their normalization.<br />

For this thesis the strategy for normalization is different: we allowed the cells to overlap,<br />

with the amount of overlap as a parameter, and we defined three types of normalization:<br />

Input<br />

image<br />

• Cell: The normalization value for each cell is computed using only the information within<br />

the same cell. This approach is highly invariant to non-uniform illumination changes, but<br />

the relative changes in gradient magnitudes between different cells is lost.<br />

• Global: All the cells are normalized with the same value, which is computed globally.<br />

In this case, the relative changes in magnitude between different cells is maintained, but<br />

there is poor illumination invariance.<br />

• Block: The objective of block normalization is to provide a local, but coarser normal-<br />

ization, in such way that it is a tradeoff between illumination invariance and maintaining<br />

changes in magnitude between different cells. The strategy is overlap dependent to com-<br />

ply with the geometry of the spatial grid, it can be used only for overlaps of 0% or 50%.<br />

In the case of 0% overlap, the normalization value is computed combining the energy of<br />

current cell (the one to be normalized) and 3 of its neighbors, as shown in Fig. 3.9a.<br />

In the case of 50%, the current cell is normalized using the neighbors in its diagonal.<br />

Considering that a cell is actually 4 small squares from Fig. 3.9b, the neighbors in the<br />

diagonal are covering the area of the current cell.<br />

Normalize<br />

gamma &<br />

colour<br />

Compute<br />

gradients<br />

Weighted vote<br />

into spatial &<br />

orientation cells<br />

Contrast normalize<br />

over overlapping<br />

spatial blocks<br />

Collect HOG’s<br />

over detection<br />

window<br />

Person /<br />

Linear non−person<br />

SVM classification<br />

Figure 3.8: Pipeline proposed by Dalal and Triggs for Human detection using HoG [8]


Chapter 3: The face recognition pipeline 22<br />

In all of the cases the normalization used is L2, i.e. for a vector x = (x0,x1,...,xD−1) ⊤ the<br />

normalized vector is obtained as x ′ = x/|x|, with:<br />

�<br />

�<br />

�<br />

|x| = � D−1 �<br />

i=0<br />

x 2 i<br />

(3.5)<br />

We also considered using a multiscale version of HoG. In this case, a HoG descriptor is<br />

computed for each level of the scale pyramid. The quantity of cells for level l, denoted as cl,<br />

depends on the scaling factor k, i.e. cl = c0k −l . As a summary, the parameters involved in the<br />

HoG descriptor computation are shown in Table 3.1<br />

Table 3.1: HoG parameters summary<br />

Parameter Description<br />

Cells Quantity of cells for the image grid<br />

Angles Quantify of angle bins for each histogram<br />

Overlap Fraction of overlap between neighboring cells<br />

Sign Wether the angle range is from 0-180 ◦ or 0-360 ◦<br />

Normalization Either cell, global or block normalization<br />

Levels Quantity of levels for the Multiscale HoG<br />

scaling (k) Scaling factor for each level of Multiscale<br />

3.5.2 Scale invariant feature transform (SIFT)<br />

The SIFT descriptor was proposed by Lowe [23], and it has proven to be very useful for object<br />

recognition and matching applications. This descriptor is local in the sense that it describes the<br />

region surrounding a keypoint, in a specific scale and orientation. Normally its location, scale<br />

(a) (b)<br />

Figure 3.9: HoG Block Normalization (a) Zero percent overlap: the highlighted cell is<br />

normalized using its energy plus the energy of its 3 inmediate neighbors (b) Fifty percent<br />

overlap: the current cell is normalized using the energy of its 4 diagonal neighbors, which<br />

are covering its area due to the overlap.


23 3.6 Learning/Classification<br />

and orientation are obtained from a interest point (keypoint) detector. In the case of [23], the<br />

keypoints are obtained as space-scale extremas using Difference of Gaussians (DoG) filtering.<br />

The SIFT descriptor has the structure depicted in Fig. 3.10, the idea is similar to that of<br />

the HoG descriptor. The gradient is computed for each pixel (in the interest region) and the<br />

area is divided in subregions (2x2 in Fig. 3.10) from which a histogram of gradients is computed<br />

by using the magnitude of the gradient as the vote for the angle bins. However, it is important<br />

to remark that previous to the histogram computation, a Gaussian weighting is applied to the<br />

magnitude, centered in the middle of the descriptor with σ equal to one half of the width of<br />

the descriptor. This will give less importance to the pixels in the extremes of the area, and<br />

therefore, reduce the effect of misalignments. In this thesis, we used SIFT descriptors with 4x4<br />

subregions, each of 8 angle bins, generating a 128 dimensional descriptor.<br />

Image gradients<br />

3.6 Learning/Classification<br />

Keypoint descriptor<br />

Figure 3.10: SIFT Descriptor structure [23].<br />

Each image i is represented by a descriptor vector xi ∈ R D . The vector xi is associated also<br />

with a categorical label yi corresponding to the person identity. A classification algorithm, for<br />

face recognition, models the binary decision of whether, images xi and xj, belong to the same<br />

class (yi = yj), or not (yi �= yj), as shown in Eq. (3.6)<br />

f(xi,xj) : R D×2 → {0,1} (3.6)<br />

In the following sections, relevant algorithms for classification are described.<br />

3.6.1 Spectral regression kernel discriminant analysis<br />

Kernel discriminant analysis (KDA) is an extension of the linear discriminant analysis (LDA)<br />

to handle non-linear data. In the case of LDA, it is asssumed that the data for each class follows


Chapter 3: The face recognition pipeline 24<br />

a normal distribution with equal covariance. The goal is to solve Eq. (3.7)<br />

Wopt = arg max<br />

W Tr{(W ⊤ SBW) −1 (W ⊤ SWW)} (3.7)<br />

Eq. (3.7) finds the optimal combination of features which separates the input data according<br />

to their classes. The objective function is such that the between class covariance SB is maxi-<br />

mized and the within class covariance SW is minimized. These terms are defined in Eq. (3.8)<br />

and Eq. (3.9) respectively.<br />

SB =<br />

SW =<br />

c�<br />

c�<br />

Ni(µi − µ)(µi − µ) ⊤<br />

i=1<br />

�<br />

i=1 xk∈Xi<br />

(xk − µi)(xk − µi) ⊤<br />

(3.8)<br />

(3.9)<br />

Where Ni and µi is the number of points and the mean for class i, and µ is the mean for all<br />

the data, independently of the class, and Xi is the subset of points that belong to class i. LDA<br />

can be described as an algorithm that finds, an optimal linear projection, such that the data<br />

belonging to the same class will be moved closer, and the data belonging to different classes<br />

will be pushed appart.<br />

In [2] it is shown the problem can be reformulated in terms of inner products. Therefore<br />

the Kernel trick can be used to handle non-linear data, which leads to the KDA algorithm. For<br />

this thesis we used an instance of KDA called Spectral Regression Kernel Discriminant Analysis<br />

(SR-KDA), from the work of Cai et al. [6]. It is a specific formulation of KDA in which the<br />

optimization process is theoretically 27 times faster. The limitation of SR-KDA is that the<br />

target space is limited to be of c − 1 dimensions, where c is the number of classes.<br />

3.6.2 Logistic regression<br />

General logistic Regression<br />

Logistic regression [4] models the probability of a feature vector xi to belong to a class as a<br />

logistic sigmoid function. Its argument is a linear combination of the entries of the feature<br />

vector. This is shown in Eq. (3.10).<br />

p(yi = 1|xi) = σ(w ⊤ xi), (3.10)<br />

where σ(z) = (1 + exp(−z)) −1 is the sigmoid function, and xi is given in homogeneous coor-<br />

dinates, i.e. allows for a bias term to be learned in w. Taking the negative log-likelihood (Eq.


25 3.6 Learning/Classification<br />

(3.11)) and its gradient (Eq. (3.12)) the optimal weights can be obtained by using a gradient<br />

descend algorithm until convergence (finding the minimum negative log-likelihood).<br />

Logistic discriminant metric learning<br />

L = − �<br />

tn lnpn + (1 − tn)ln (1 − pn) (3.11)<br />

n<br />

▽L = �<br />

(tn − pn)xn<br />

n<br />

(3.12)<br />

The objective of metric learning algorithms is to find, the matrix M ∈ R D×D , such that the<br />

Mahalanobis distance, Eq. (3.13), is minimized for positive examples (yi = yj), and maximized<br />

for negative pairwise examples (yi �= yj).<br />

dM(xi,xj) = (xi − xj) ⊤ M(xi − xj), (3.13)<br />

where M is restricted to be positive semidefinite 1 . Logistic Discriminant Metric Learning,<br />

proposed by Guillaumin et al. [16], model the probability of two examples to depict the same<br />

person as given by Eq. (3.14).<br />

pn(yi = yj|xi,xj;M,b) = σ(b − dM(xi,xj)), (3.14)<br />

where σ(z) = (1 + exp(−z)) −1 is the sigmoid function and b is a bias value. Let n be an index<br />

representing the pair ij. From Eq. (3.14), the likelihood over the seen data, taking tn as the<br />

target class for pair xn = (xi,xj), is given in Eq. (3.15).<br />

L =<br />

N�<br />

n<br />

p tn<br />

n (1 − pn) 1−tn (3.15)<br />

From which it can be shown that the negative log likelihood, and its gradient are given in<br />

Eq. (3.16) and Eq. (3.17) respectively.<br />

L = − �<br />

tn lnpn + (1 − tn)ln (1 − pn) (3.16)<br />

n<br />

▽L = �<br />

(tn − pn)Xn<br />

n<br />

(3.17)<br />

Xn is defined as the vectorization of (xi −xj)(xi −xj)⊤. Using Eq. (3.16) and Eq. (3.17) it<br />

1 A Matrix M ∈ R D×D is positive semidefinite if x T Mx ≥ 0, ∀x �= 0. It is denoted as M � 0


Chapter 3: The face recognition pipeline 26<br />

is possible to learn the values of M by minimizing the negative log-likelihood using a gradient<br />

descent algorithm. If the matrix is restricted to be positive semidefinite, then a Cholesky<br />

decomposition can be applied to it, i.e. M = LL ⊤ . In this case Eq. (3.13) can be reformulated<br />

as in Eq. (3.18)<br />

dL(xi,xj) = (L ⊤ xi − L ⊤ xj) ⊤ (L ⊤ xi − L ⊤ xj) (3.18)<br />

This result can be interpreted as a projection of the data followed by the computation of<br />

the Euclidean distance in the new space. Throughout this thesis, logistic discriminant metric<br />

learning will be used as the main learning algorithm.<br />

3.7 Datasets and evaluation<br />

In order to evaluate the performance of our algorithm, two datasets are used: Labeled Faces in<br />

the Wild (LFW) and Public Figures (PubFig). In this section a description of both datasets<br />

together with their evaluation protocol is presented.<br />

3.7.1 Labeled faces in the wild<br />

The main dataset used for this project is called Labeled Faces in the Wild (LFW) [19]. An im-<br />

portant dataset due to its high variability in pose, expression, illumination conditions, etc. and<br />

therefore, considered to be appropriate to evaluate face recognition approaches for uncontrolled<br />

settings [30]. Consist of 13233 images retrieved from Yahoo! News using a Viola-Jones face<br />

detector. With a resolution of 250 × 250, the scale and location of each face is approximately<br />

the same, therefore there is no need to use a face detector. Each image is labeled according to<br />

the person identity to give a total of 5749 identities. The quantity of images per person varies<br />

from 1 to 530.<br />

To redirect the research efforts towards algorithms of recognition more than alignment, there<br />

are three versions of LFW available:<br />

• Not Aligned: the set of images as taken directly from the face detector.<br />

• Aligned Funneled: aligned using the algorithm described in section 3.3.1.<br />

• Aligned Commercial: aligned using the algorithm introduced in [43].<br />

In order to have a standard evaluation method to properly compare different algorithms,<br />

a protocol was established. Ten independent subsets (folds) of images were defined, mutually<br />

exclusive in terms of image exemplars and identity. The evaluation protocol allows for two


27 3.8 Baseline performance<br />

different paradigms: restricted and unrestricted. For the restricted case, a set of 600 pairs are<br />

predefined for each of the ten folds, each pair has an associated label which indicates whether<br />

the images belong or not to the same person, 300 pairs for each case. In this case the identity<br />

must not be used, i.e. no more pairs can be created. In the unrestricted paradigm, the identities<br />

can be used, so that a large quantity of pairs can be created.<br />

For both cases, performance is reported as the mean over 10-fold cross validation. This<br />

means that one of the 10 folds is held out, and the training is done using the remaining subsets,<br />

then the accuracy is obtained by classifying the “unseen” 600 pairs that were left aside. This<br />

is done 10 times, rotating over the different folds and the final report is the mean and standard<br />

deviation of the accuracy over the 10 folds. In this work we will focus in the unrestricted<br />

paradigm.<br />

3.7.2 Public figures (PubFig)<br />

The Public Figures dataset was compiled by Kumar et al. [22] and it is larger than LFW. It<br />

consist of 59470 images of 200 people, collected from the internet. Therefore there are many<br />

more images per person than in LFW. Similarly to LFW it contains a large variability in pose<br />

variations, illumination, expression, etc.<br />

An important difference with LFW is that images are given as a list of URL addresses, from<br />

different sources of the internet. That represents a problem as through time some images will<br />

be lost. That was confirmed when we retrieved the dataset, 15% of the URLs were invalid and,<br />

as a consequence, 25% of the test pairs could not be created.<br />

The evaluation protocol is 10 fold cross validation using a “restricted” paradigm equivalent<br />

to that of LFW, and therefore, no additional pairs can be used to train the algorithm. Different<br />

benchmarks to measure the performance of the algorithm under specific conditions are provided,<br />

e.g. the behavior using only frontal pose images, or only using neutral expressions, etc.<br />

In our evaluation, we use the dataset as an “unrestricted” paradigm, defining our own pairs<br />

for training, but using the benchmarks test pairs for evaluation.<br />

3.8 Baseline performance<br />

Our baseline algorithm is the following: facial features are detected (see section 3.2), using the<br />

found coordinates, two feature vectors are build. The first vector is formed by the concatenation<br />

of SIFT descriptors, obtained from three different scales (16, 32 and 48 pixels width) at the<br />

location of each facial feature (following [16]). The other case is the concatenation of the facial<br />

feature patches from section 2.2. The implementation was done in Matlab, and computationally<br />

expensive sections such as alignments or feature extractions were implemented in C.


Chapter 3: The face recognition pipeline 28<br />

Table 3.2 show results obtained for both descriptors in the Aligned Commercial version<br />

of LFW. For comparison, two classifiers are used, the Euclidean distance between the feature<br />

vectors of the pair of images being classified, and using LDML to learn a proper Metric.<br />

It can be observed the significant contribution of Metric Learning approaches for face recog-<br />

nition. Additionally, when Euclidean distance is used for classification, there is no significant<br />

contribution of using SIFT descriptors from facial feature patches. The difference is only ob-<br />

served when a proper metric is used.<br />

Table 3.2: Baseline algorithms performance<br />

Classification Facial Feature Patches Multiscale SIFT<br />

Euclidean Distance 0.6702 ± 0.0031 0.6845 ± 0.0051<br />

Logistic Discriminant Metric Learning 0.7385 ± 0.0042 0.8524 ± 0.0052


Chapter 4<br />

Histogram of Oriented Gradients<br />

for face recognition<br />

4.1 Motivation<br />

Facial feature based approaches have gained popularilty in the past years, due to their ro-<br />

bustness regarding pose variations, in comparison with holistic approaches. However, the per-<br />

formance of the face recognition is strongly dependent on the accuracy of the facial feature<br />

detection. Facial feature localization algorithms, even if they have gained significant improve-<br />

ments, are still not able to cope with large pose variations. Besides, the computation time is<br />

high, in order to maximize the objective function within the set of possible locations, Eq.(3.2).<br />

For those reasons, it is desirable to have a pipeline without facial feature detection.<br />

There is also the intuition that holistic approaches will provide more information to the<br />

learning process, which might give a higher discrimination power to the overall algorithm.<br />

Therefore, a Histogram of Oriented Gradients (HoG) descriptor, a holistic encoding, was im-<br />

plemented following the description from Section 3.5.1. The programming language for the<br />

implementation was C, using the OpenCV library [5]. Assuming the input image’s resolution<br />

is 250x250, the descriptor is created for the 100x100 pixel region in the center of the image. It<br />

is important to dismiss the background in order to reduce biases the dataset might have [31].<br />

The objective is to find a set of parameters such that the discriminative power of the descriptor<br />

is suitable for face recognition<br />

29


Chapter 4: Histogram of Oriented Gradients for face recognition 30<br />

4.2 Alignment comparison<br />

It is important to decide whether the use of an alignment is imperative for holistic approaches,<br />

more specifically, for the use of a HoG descriptor. To answer this question, we compare the<br />

three variants of LFW: Not Aligned, Aligned Funneled and Aligned Commercial, and using the<br />

same parameters for the HoG descriptor.<br />

The first results are shown in Table 4.1. It reveals, in a consistent manner, that an alignment<br />

is crucial for face recognition using HoG. It seems interesting that the funneled version of LFW<br />

did not show any improvement over the not aligned version, in fact there is a decay. For that<br />

reason, we ran a face alignment using the location of the facial features (c.f. Section 3.3.2).<br />

It can be seen that this boost the results significantly, for the not aligned as for the aligned<br />

funneled version, with an increase of over 5%, while there is an insignificant decrease in the case<br />

of the aligned commercial version. Though it is not reported here, in our experiments, we did<br />

not observe a significant difference in accuracy between any of the LFW versions when using a<br />

facial feature based descriptor.<br />

These results brings two conclusions: a face alignment is indeed crucial for the use of HoG<br />

descriptors. However, as suggested by the decrease of accuracy of the funneled version with<br />

respect to the not aligned version, the alignment should be robust not only in terms of rotation<br />

and scale, but more important, to translation. We believe that funneling is not as robust in<br />

translation as a feature based alignment.<br />

It is intuitive to have a need for robust alignment regarding translation, as it is desirable for<br />

the corresponding features to fall in the same spatial cell. Once a parametric study was done for<br />

HoG (Section 4.3), the same experiment was performed using the best set of parameters (Table<br />

4.9). The results are shown in Table 4.2 which confirms the previous behavior. A disadvantage<br />

of this result is that, even if the descriptor is holistic, there will be a need for a facial feature<br />

detector prior to its computation. This will inherit the problems caused by the detector.<br />

Table 4.1: Alignment Comparison for an initial set of parameters for a HoG descriptor: 12x12<br />

cells, 16 angle bins, range [0-360] ◦ , 50% overlap with block normalization<br />

LFW variants<br />

Not Aligned Aligned Funneled Aligned Commercial<br />

No further Alignment 0.7568 ± 0.0053 0.7408 ± 0.0067 0.8205 ± 0.0063<br />

Feature based alignment 0.8069 ± 0.0066 0.8093 ± 0.0063 0.8171 ± 0.0047


31 4.3 HoG parametric study<br />

Table 4.2: Alignment Comparison for the final set of parameters: 16x16 cells, 16 angle bins,<br />

range [0-360] ◦ , 50% overlap with global normalization<br />

LFW variants<br />

Not Aligned Aligned Funneled Aligned Commercial<br />

No further Alignment 0.7660 ± 0.0061 0.7702 ± 0.0042 0.8432 ± 0.0062<br />

Feature based alignment 0.8276 ± 0.0051 0.8383 ± 0.0054 0.8357 ± 0.0058<br />

4.3 HoG parametric study<br />

We perform a parametric study for a Histogram or Oriented Gradients based face recognition.<br />

The evaluation follows the protocol established for LFW, i.e. evaluation using 10 fold cross-<br />

validation, and the results are reported as the mean and standard deviation of the accuracy<br />

over the 10 folds. Unless specified, the dataset used is LFW aligned commercial and the<br />

learning algorithm is LDML. As a search for the optimal parameters, considering all possible<br />

combinations, is almost intractable, we decided to optimize parameters one by one.<br />

4.3.1 Angle range<br />

As a first experiment we studied the effect of the angle range over the performance of the<br />

algorithm. To do that we set the rest of the parameters to a fixed value: 8 angle bins, as<br />

used for the SIFT descriptor [23], 8x8 cells and 50% overlap, using a block normalization. The<br />

experiment was repeated for the three variants of LFW to compare the results.<br />

It can be observed, from Table 4.3, that a range of [0-360] ◦ outperforms the range of [0-180] ◦ ,<br />

when combined with LDML. This is consistent for the three variants of LFW. Therefore, in the<br />

following experiments the default is a signed angle, i.e. a range of [0 − 360] ◦ .<br />

4.3.2 Normalization<br />

The three variants for normalization are described in Section 3.5.1. These are cell, block and<br />

global normalization. Fig. 4.1 show examples of HoG descriptors, plotted over the original<br />

image. For cell normalization, as the norm is the same for each spatial bin, the relative changes<br />

Table 4.3: Angle range comparison for HoG. 8x8 cells, 8 angle bins, 50% overlap and block<br />

normalization<br />

LFW variants<br />

Angle Range Not Aligned Aligned Funneled Aligned Commercial<br />

[0 − 180] ◦ 0.7150 ± 0.0053 0.7077 ± 0.0052 0.7563 ± 0.0082<br />

[0 − 360] ◦ 0.7523 ± 0.0071 0.7495 ± 0.0054 0.8017 ± 0.0066


Chapter 4: Histogram of Oriented Gradients for face recognition 32<br />

(a) (b) (c)<br />

Figure 4.1: HoG Normalization examples (a) Cell normalization (b) Block normalization and<br />

(c) Global Normalization<br />

Table 4.4: Normalization comparison for the HoG descriptor. Parameters: 16 angle bins, range<br />

[0-360] ◦<br />

Number of cells/Overlap(%)<br />

12/0 12/50 16/0 16/50<br />

Cell 0.7933 ± 0.0061 0.8128 ± 0.0077 0.7578 ± 0.0091 0.8178 ± 0.0061<br />

Block 0.8192 ± 0.0064 0.8305 ± 0.0064 0.8291 ± 0.0058 0.8385 ± 0.0074<br />

Global 0.8247 ± 0.0071 0.8283 ± 0.0068 0.8317 ± 0.0056 0.8432 ± 0.0062<br />

in magnitude between different cells is lost, which will diminish the influence of strong gradients.<br />

However it will be very robust to non uniform changes in illumination. In the case of global<br />

normalization, the important gradients, that appear from regions such as the eyes, mouth and<br />

nose, will be emphasized at the cost of a weaker resistance to illumination changes. Block<br />

normalization is the trade-off between cell and global paradigms.<br />

A experiment was performed in which the parameters were left unchanged, except for the<br />

normalization type, overlap and the number of cells. The results, found in Table 4.4, show<br />

consistenly that cell normalization give the worst performance. Global normalization leads to<br />

similar results as block normalization. In most of the cases global is better except for for 12<br />

cells with 50% overlap. Because of these results, and for its simplicity of computation, we take<br />

global normalization as the default for further experiments. The exception is the quantity of<br />

cells experiment, which was computed in parallel.


33 4.3 HoG parametric study<br />

4.3.3 Quantity of cells<br />

Another important parameter to determine is the quantity of cells. Table 4.5 show experiments<br />

we performed changing only this parameter. Here we used 16 angle bins over a signed range,<br />

i.e. [0-360] ◦ , using 0% overlap and global normalization. It can be observed that for more than<br />

14 cells there is not a significant variation and below that value, the results start to degrade. A<br />

reason, why above 14 cells there is no improvement in performance, might be because LDML<br />

start to combine the information of finer cells as if they were coarser. More cells will not bring<br />

any improvement, but only generate larger descriptors, e.g. there is not a significant difference<br />

in performance between 16 × 16 and 20 × 20, however for 20 cells the descriptor size is almost<br />

doubled compared to 16 cells. Therefore, we decided to set 16 cells as our default value.<br />

Table 4.5: Number of cells comparison for the HoG descriptor. 16 angle bins, range [0-360] ◦ ,<br />

0% overlap with block normalization<br />

Number of cells Accuracy<br />

10 0.8198 ± 0.0086<br />

12 0.8305 ± 0.0064<br />

14 0.8327 ± 0.0080<br />

16 0.8385 ± 0.0074<br />

18 0.8348 ± 0.0059<br />

20 0.8412 ± 0.0060<br />

22 0.8380 ± 0.0068<br />

4.3.4 Angle bins<br />

Angle bins refer to the quantity of partitions in which the angle range is split. Experiments<br />

were done to compare how is the performance affected by modifying the quantity of angle bins<br />

per cell. The results can be found in Table 4.6, it can be noticed the maximum is found at 16<br />

bins, therefore it is taken as the default for further experiments.<br />

Table 4.6: Accuracy obtained using different angle bins for the HoG descriptor. Parameters<br />

16x16 cells, range [0-360] ◦ , 0% overlap with global normalization<br />

Angle bins<br />

8 12 16 20<br />

0.8230 ± 0.0049 0.8270 ± 0.0052 0.8317 ± 0.0046 0.8295 ± 0.0077


Chapter 4: Histogram of Oriented Gradients for face recognition 34<br />

4.3.5 Overlap<br />

In Table 4.7 is shown the variation in accuracy as a function of the overlap, when the rest<br />

of parameters are left unchanged. The maximum in accuracy was obtained for the case the<br />

overlap is of 50%, corresponding to 0.8432 ± 0.0062. However, it is not highly affected for a<br />

range between 10% and 60%.<br />

It is important to remark that the cell size in pixels is a function of the the overlap when the<br />

image size remains fixed. Therefore to show that overlap is beneficial, an additional experiment<br />

was done: a 9x9 cells descriptor was created with no overlap. In this case, the cell size is<br />

similar to that of 16x16 cells using 50% overlap (≈ 11 pixels). The accuracy obtained was<br />

0.8207 ± 0.0080, which is lower than using overlap. We argue that overlap is beneficial as it<br />

helps to correct misalignments due to problems in face detection or pose variations.<br />

Table 4.7: Overlap comparison. Parameters 16x16 cells, 16 angle bins in the range [0-360] ◦ ,<br />

using global normalization<br />

Overlap (%)<br />

0 12.5 25 37.5 50 62.5 75<br />

0.8317 0.8423 0.8392 0.8412 0.8432 0.8415 0.8333<br />

±0.0056 ±0.0045 ±0.0054 ±0.0064 ±0.0062 ±0.0066 ±0.0052<br />

4.3.6 Multiscale HoG<br />

We also studied a multiscale HoG descriptor, in this case there are two parameters involved:<br />

the number of scales and the rescaling factor. The results from Table 4.8 show that the use of a<br />

multiscale approach does not bring any significant contribution to the performance. The reason<br />

might be related to the fact that a coarser level of the pyramid is only a linear combination of<br />

the finer cells. This will cause LDML to ignore coarser levels, as the information of the finest<br />

level of the pyramid is enough.<br />

Table 4.8: Multiscale HoG performance<br />

Levels/k Number of cells<br />

12 14 16 18<br />

2/1.15 0.8317 ± 0.0067 0.8407 ± 0.0065 0.8435 ± 0.0074 0.8425 ± 0.0074<br />

2/1.30 0.8287 ± 0.0067 0.8375 ± 0.0059 0.8380 ± 0.0047 0.8453 ± 0.0068<br />

2/1.45 0.8355 ± 0.0056 0.8388 ± 0.0068 0.8413 ± 0.0063 0.8410 ± 0.0074<br />

3/1.15 0.8322 ± 0.0074 0.8423 ± 0.0061 0.8398 ± 0.0062 0.8397 ± 0.0063<br />

3/1.30 0.8312 ± 0.0057 0.8383 ± 0.0065 0.8397 ± 0.0062 0.8435 ± 0.0070<br />

3/1.45 0.8312 ± 0.0059 0.8360 ± 0.0073 0.8440 ± 0.0057 0.8420 ± 0.0058


35 4.4 Discussion<br />

4.4 Discussion<br />

The conclusion of this study was the identification of appropriate parameters for face recogni-<br />

tion. The descriptor to be used will have 16x16 cells as the spatial grid, with an overlap of 50%,<br />

the angle histograms are created using 16 bins, which represent a range from 0 ◦ to 360 ◦ , the<br />

voting is done using soft assignment by linear interpolation. There is no need for a multiscale<br />

descriptor when using LDML as the classification algorithm.<br />

Further improvements could be achieved by reducing high differences of occurrence between<br />

certain angle bins. For example, it is expected for regions around the mouth to always have a<br />

high occurrence of horizontal lines. Therefore a large fraction of the feature vector energy will<br />

be distributed over the angle bins corresponding to those gradients, shadowing other bins with<br />

less occurrence. This problem is one of the main motivations for the work of Cao et al. [7], as<br />

this concentration of energy reduces the discriminative power of the descriptor.<br />

A simple way to balance the energy is by defining new descriptors x ′ by simply computing<br />

the square root of the input descriptors, i.e. x ′ = ( √ x0, √ x1,..., √ xD−1) ⊤ . This is similar<br />

to the computation of the Hellinger distance d(x,y) = �<br />

i (√ xi − √ yi) 2 , but extended to<br />

handle interfeatures correlation through the Mahalanobis distance. The effect of doing such<br />

test brought the results from 0.8432±0.0062 up to 0.8530 ± 0.0065 for the aligned commercial<br />

version of LFW. Notice that by using this method, the conclusion drawn for SIFT multiscale<br />

might not hold, as coarser cells would not be the linear combination of finer cells.<br />

This result suggest that it would be interesting to study different strategies to distribute<br />

the energy of the descriptor. For example, instead of computing the square root, a parameter<br />

γ ∈ [0,1] could be used to create a new feature vector x ′ = (x γ<br />

0 ,xγ<br />

1 ,...,xγ ) ⊤ . This is a<br />

generalization of the square root vector.<br />

Table 4.9: Best found parameters for HoG based recognition<br />

Parameter Description<br />

Cells 16<br />

Angle bins 16<br />

Overlap 50%<br />

Sign Signed, i.e. range: [0 − 360] ◦<br />

Normalization global<br />

Additional Square root of features<br />

Levels 1<br />

scaling (k) -


Chapter 5<br />

Facial feature based<br />

representations<br />

5.1 Motivation<br />

Pose and expression represent a major challenge for face recognition. Their appearance intro-<br />

duce non-linearity in the data, which might be difficult, or even impossible to handle using<br />

linear algorithms. This include metric learning approaches such as LDML.<br />

A way to overcome this limitation is to design descriptors invariant to these factors. Feature<br />

based approaches have proven to be useful to build descriptors less sensitive to changes in pose,<br />

as they are build at each facial feature, no matter their relative position. Another alternative<br />

is to use non-linear machine learning algorithms. In this case, the non-linear data lying in a<br />

high dimensional space might be separated. In this chapter some experiments using non-linear<br />

strategies are presented.<br />

Another challenge is to handle occlusions, a common problem in uncontrolled settings.<br />

We propose to separate the metric learning according to spatial regions, i.e. a specific metric<br />

is learned to classify a region of the face, as whether it represents the same person or not,<br />

independently of the rest of the face. Then to make a classification by combining the results<br />

given by each region. As an example, in the case of the HoG descriptor, this could be done by<br />

grouping neighboring cells. If that is the case, each region can be classified as occluded or not,<br />

using outlier detection algorithms. Thus, in later stages of classification, facial regions can be<br />

dismissed or not.<br />

Our first goal is to separate the training stage according to spatial regions, and to achieve<br />

similar results as using a global training (c.f. Section 3.8). We will consider each of the 9<br />

36


37 5.2 Feature wise classification<br />

detected facial features as a spatial region. These are: left eye left, left eye right, right eye left,<br />

right eye right, nose left, nose center, nose right, mouth left and mouth right.<br />

The second goal is to classify, for each feature, whether it is inlier or an outlier. The output<br />

will be a confidence value, a measure of “normality”. This score represent how well the specific<br />

instance being classified fits a model given by the training data.<br />

As a final step, it is desirable to include the confidence values into the classification. The<br />

goal is to reduce the influence of the occluded features and equally increase the influence of the<br />

observed features into the final decision.<br />

5.2 Feature wise classification<br />

In this case, the multiscale SIFT descriptor xi, obtained in Section 3.8, is split into 9 feature<br />

vectors. x f<br />

i denote the descriptor for the feature f of image i. Then, a metric Mf is learned<br />

for each feature separately. To take a joint decision for the classification, we propose two<br />

approaches, described as follows.<br />

Distance sum<br />

A joint distance is obtained by adding the feature wise distances and their bias terms. Both the<br />

metric and the bias term are learned using the LDML algorithm. This is shown in Eq. (5.1).<br />

Logistic Regression<br />

⎛<br />

p(yi = yj|xi,xj) = σ ⎝ �<br />

bf − �<br />

⎞<br />

dMf(xi,xj) ⎠ (5.1)<br />

f<br />

The problem with the distance sum approach is that it assumes every feature has the same<br />

contribution to the final decision. However this is not the case, to confirm that assertion,<br />

we refer to the joint learning described in Section 3.8. If we take into account that the Ma-<br />

halobis distance can be seen as a weighted combination of the entries of the difference vector,<br />

i.e. (xi − xj) ⊤ M(xi − xj) = �<br />

u<br />

scribes how significant is the pair of entries uv.<br />

f<br />

�<br />

v muv(x u i − xu j )(xv i − xv j ), then the magnitude of muv, de-<br />

In Fig. 5.1 is plotted the energy of the entries of M, which correlates the facial feature<br />

pairs according to a global learning, i.e. the entry at row u and column v show how correlated<br />

is the facial feature u with the facial feature v. It can be noticed there is higher energy in<br />

the diagonal as expected. However, it is important to remark that the energy is not equally<br />

distributed over the diagonal, implying that there are features which are more important than<br />

others. Interestingly the eyes are much more discriminative than the nose and the mouth. We


Chapter 5: Facial feature based representations 38<br />

left_eye_left<br />

left_eye_right<br />

right_eye_left<br />

right_eye_right<br />

nose_left<br />

nose_center<br />

nose_right<br />

mouth_left<br />

mouth_right<br />

left_eye_left<br />

left_eye_right<br />

right_eye_left<br />

nose_left<br />

right_eye_right<br />

nose_center<br />

nose_right<br />

mouth_left<br />

mouth_right<br />

Figure 5.1: Energy distribution for a joint learning of a Facial Features based descriptor<br />

assume this is a consequence of expression variation, in the case of mouth, and because of pose<br />

affecting the nose. Based on these observations, we will use logistic regression to find proper<br />

weights for each facial feature, as shown in Eq. (5.2).<br />

⎛<br />

p(yi = yj|xi,xj) = σ ⎝w0 + �<br />

⎞<br />

(wfdMf(xi,xj)) ⎠ (5.2)<br />

In Table 5.1 is shown the accuracy achieved for each facial feature separately, reported for<br />

one fold of the aligned commercial version of LFW. Additionally, the accuracy for both types of<br />

combination is given (for 1 fold). Notice how a single feature is not highly discriminative, how-<br />

ever it is better than a simple Euclidean distance classification using all the features (see Table<br />

3.2). Furthermore the performance of the algorithm is improved when features are combined.<br />

In this case, there was no difference between the distance sum and logistic regression. However<br />

in our experiments, we noticed logistic regression assigned higher weights to the eyes, followed<br />

by the nose and with lower weights to the mouth, which is consistent with the global learning,<br />

as depicted in Fig. 5.1. When running the algorithm for more folds, a difference appeared<br />

between the distance sum and logistic regression. The last two lines of Table 5.1 show results<br />

for more folds, in which logistic regression gives an advantage over the distance sum, however<br />

there is not a significant difference.<br />

f


39 5.2 Feature wise classification<br />

5.2.1 Occlusion detection<br />

Table 5.1: Results for separate facial feature learning<br />

Number of folds Feature Accuracy<br />

1 left eye left 0.7311<br />

left eye right 0.7832<br />

right eye left 0.7849<br />

right eye right 0.7412<br />

nose left 0.7597<br />

nose center 0.7445<br />

nose right 0.7378<br />

mouth left 0.6840<br />

mouth right 0.7143<br />

Distance sum 0.8434<br />

Logistic regression 0.8434<br />

4 Distance sum 0.8170 ± 0.0050<br />

Logistic regression 0.8191 ± 0.0055<br />

To detect occlusions we adopt a discriminative model for each facial feature, in which the<br />

descriptor x f<br />

i is classified as normal or occluded. We profit from the already implemented<br />

appearance model used for the facial feature localization algorithm (c.f. Section 3.2).<br />

The confidence value is modeled as p(f i |I) = σs,b(p(ai|F)/p(ai|F)), i.e. the output from the<br />

appearance model passed through a sigmoid function. This give a probabilistic estimate of how<br />

well the feature fits the appearance model. Notice the sigmoid function has two parameters,<br />

s and b which are the slope and a bias. These parameters could be inserted into the learning,<br />

however, in our experiments we used s = 1 and b = 0 for simplicity. Fig. 5.2a show some<br />

examples of correctly detected abnormalities (p(f i |I) < 0.5), we found that this method not<br />

only detects outliers given by objects occluding the facial feature, but it is also useful to detect<br />

erroneous localizations as shown in the last two images from Fig. 5.2a. For a pair of images Ii<br />

and Ij, a confidence vector qij is created as given in Eq. (5.3)<br />

qij = (p(f 1 |Ii) × p(f 1 |Ij)),...,p(f 9 |Ii) × p(f 9 |Ij) ⊤<br />

(5.3)<br />

To use the confidence values in the classification of an unseen pair of examples, we tried to<br />

normalize qij, in such way that the L1-norm is equal to 9, and then multiply the distance of<br />

the facial feature f by the the corresponding entry of the normalized qij. However, this did not<br />

affect much the performance for the distance sum, and decreased the accuracy for the case of<br />

logistic regression weighting. Instead we propose to use the confidence values not only for the<br />

distance, but as well for the bias. In the case of distance sum, the classification function from


Chapter 5: Facial feature based representations 40<br />

Table 5.2: Feature combination comparison using confidence values. Results reported for 4<br />

folds<br />

Distance Sum Logistic Regression<br />

Normal 0.8170 ± 0.0050 0.8191 ± 0.0055<br />

Confidence weighting 0.8306 ± 0.0048 0.8272 ± 0.0050<br />

Eq. (5.1) is modified as shown in Eq. (5.5)<br />

⎛<br />

p(yi = yj|xi,xj) = σ ⎝ �<br />

q f<br />

ijbf − �<br />

q f<br />

ijdMf(xi,xj) ⎞<br />

⎠ (5.4)<br />

f<br />

f<br />

⎛<br />

= σ ⎝ �<br />

q f<br />

ij (bf<br />

⎞<br />

− dMf(xi,xj)) ⎠ (5.5)<br />

f<br />

This can be thought as an adaptive threshold, function of the confidence for each of the<br />

facial features, or as the confidence weighting of the disparity between the feature distance<br />

and its threshold. In the worse case, when a facial feature is entirely occluded, the confidence<br />

value will remove its effect completely from the classification. For the case of logistic regression<br />

based combination, we assumed the learned bias value w0 can be split according to the learned<br />

weights. The classification function is shown in Eq. (5.6) and the results are given in Table 5.2.<br />

Notice the weights are being learned for the classifier in Eq. (5.2), and the confidence values<br />

are being inserted into the classification function only for the evaluation of the test set. However,<br />

it would desirable to also use the confidence values into the training process, such that logistic<br />

regression finds the optimal weights for this task.<br />

5.2.2 Discussion<br />

⎛<br />

p(yi = yj|xi,xj) = σ ⎝ �<br />

f<br />

q f<br />

ij w0<br />

�<br />

wf<br />

� k wk<br />

�<br />

− �<br />

q f<br />

ijwfd ⎞<br />

Mf(xi,xj) ⎠ (5.6)<br />

In this section we have shown that separate learning can be done according to spatial regions,<br />

then the distances can be combined to make a joint decision. The results show that a single<br />

feature is not very discriminative, but their combination bring a significant improvement. We<br />

found there was no major difference between using a distance sum approach to that of logistic<br />

regression.<br />

Even though it was not possible to generate the same results as for a global learning, this<br />

f


41 5.3 Non-linear approaches<br />

(a)<br />

(b)<br />

Figure 5.2: Examples of outlier detections: red color for occluded feature and yellow for normal<br />

feature. For this image we refer the reader to the electronic version of the document.(a) correct<br />

outlier detections (b) wrong outlier detections<br />

proves that the separation can be done. A cause for this limitation might be that we can<br />

not benefit from inter feature correlation, or more importantly, that the distances are not<br />

comparable. One way to overcome this problem is to do a global learning, in which the Matrix<br />

M is restricted to be block diagonal. This is equivalent to learning each facial feature metric<br />

separately, but in a way the distances are comparable.<br />

The results from Table 5.2 also show that the confidence value for each facial feature, can<br />

be integrated effectively into the decision function. This allows to handle occlusions and/or<br />

wrong localizations. We hope that if the feature-wise learning gets to be comparable to that of<br />

global, by using the confidence values we can improve the accuracy even further.<br />

A limitation of this algorithm is that it depends of a robust apperance model. As illustrated<br />

in Fig. 5.2b, the outlier detection algorithm might fail giving false detections. We suggest it is<br />

necessary to implement an appearance model trained specifically for this task.<br />

5.3 Non-linear approaches<br />

In this section we describe some experiments using non-linear algorithms. In every case the<br />

descriptor is the SIFT multiscale computed at the location of the facial features, same as in the<br />

previous section but without the separation into feature wise vectors.


Chapter 5: Facial feature based representations 42<br />

5.3.1 Spectral regression kernel discriminant analysis<br />

In this case we used SR-KDA(see Section 3.6.1) to find a non-linear projection of the input data<br />

such that discriminant information between the classes, i.e. the identities is emphasized. In the<br />

target space, a linear classification algorithm can be used. We compared Euclidean distance<br />

and LDML as classifiers. The results, computed over 10-fold cross-validation are shown in Table<br />

5.3.<br />

When comparing these results with the baseline from Table 3.2, which were added as well to<br />

Table 5.3, the contribution of using SR-KDA becomes evident. This can be observed specially<br />

for Euclidean distance classification, which presents an increase of accuracy of 10%. For LDML<br />

there is a 1% gain over the baseline. This results show that if Euclidean distance is used<br />

for classification, SR-KDA is effective to improve significantly the accuracy. For LDML the<br />

contribution is not that large. A limitation of using SR-KDA is that it is computationally<br />

expensive for a large quantity of data and classes, which is the case for LFW.<br />

Table 5.3: Results for using SRKDA projection of the input data. Results obtained for 10-fold<br />

cross-validation<br />

Euclidean distance LDML<br />

Not using SR-KDA 0.6845 ± 0.0051 0.8524 ± 0.0052<br />

Using SR-KDA 0.7883 ± 0.0029 0.8622 ± 0.0056<br />

5.3.2 Clustering<br />

In this section we describe another non-linear algorithm we explored, in which the input data<br />

is divided into clusters. This can be done before or after LDML learning. The intuition is that<br />

it is expected for similar faces to be grouped together as a cluster. Therefore, if a learning<br />

is done, specifically for that cluster, similar data might be separated in a way which was not<br />

possible when using a global training. This is a divide and conquer strategy.<br />

Pose adaptive classifier<br />

Following [7], described in Section 2.4, we build pose adaptive classifiers, where each pose<br />

is considered as a cluster. To assign a pose to an unseen example, a simple approach was<br />

implemented: three images were taken from the IMM database [25], one for each case: left (L),<br />

right (R) and frontal (F) pose. The identity, illumination and expression remained unchanged.<br />

For an unseen image, we assign the pose of the reference image for which there is a minimum<br />

Euclidean distance. This is the same approach as in [7] but with a different descriptor.


43 5.3 Non-linear approaches<br />

Once the images are clustered, according to pose, a LDML classifier is trained for each<br />

possible pair of poses, i.e. the six classifiers: LL, LR (RL), LF (FL), RR, RF (FR), FF. Table<br />

5.4 show the obtained results for 5 out of 10 folds. For comparison, the accuracy for the baseline<br />

algorithm is also shown as global learning.<br />

Table 5.4: Results for pose adaptive classification. Using 5 out 10 folds<br />

Accuracy<br />

Global learning 0.8504 ± 0.0049<br />

Pose adaptive 0.8358 ± 0.0026<br />

In Table 5.5 are shown the quantity of pairs from the test set which were assigned to each of<br />

the pose combinations. Additionally the achieved accuracy for each pose combination separately<br />

is also presented. The results show that frontal-frontal classification (FF) remained similar to<br />

that of global learning. However, the results for the combinations are not as good, with the<br />

worst case for pairs assigned to the left and right (LR) classifier.<br />

Table 5.5: Obtained pose combination accuracy. Results reported for 5 folds<br />

Pose combination<br />

LL FF RR LF (FL) LR (RL) FR (RF)<br />

Number of pairs 14 2302 16 310 23 290<br />

Accuracy 0.7833 0.8564 0.6524 0.7979 0.7517 0.8275<br />

Unsupervised clustering<br />

In this case the data is projected using the matrix L learned by the LDML algorithm. In the<br />

new space we explored using the different clustering strategies, described in Table 5.6. In this<br />

case, the objective is to train a classifier for each cluster separately, not training classifiers for a<br />

pair of clusters. For an unseen pair of examples, if the images are assigned to a different cluster,<br />

then are classified as having a different identity. In the case are assigned to the same cluster,<br />

the decision is done by the LDML classifier trained for that specific cluster.<br />

Table 5.6: Clustering algorithms<br />

Identifier Description<br />

KM Standard k-Means.<br />

S KM k-Means by adding supervision. At the assign step of the k-Means algorithm,<br />

points belonging to the same class are assigned to the same cluster.<br />

GMM Gaussian Mixture Model.


Chapter 5: Facial feature based representations 44<br />

In terms of computation time, this will represent an improvement, as the LDML complexity<br />

is quadratic with respect to the quantity of points. Therefore, splitting the data in k clusters<br />

and training k classifiers, each using n/k points will make the algorithm be k times faster.<br />

However, as Table 5.7 show this approach did not give good results. The reported accuracy<br />

is the ratio of positive pairs which are assigned to the same cluster. This value is used to<br />

measure the performance of the clustering because, positive pairs assigned to different clusters,<br />

are labeled as having a different class without possible correction. This can be considered as an<br />

upper bound of performance and therefore the expected results are lower than those obtained<br />

when doing an unclustered training. Due to its low clustering accuracy, we did not proceed to<br />

compute the metric for each cluster.<br />

Table 5.7: Ratio of positive pairs assigned to the same cluster.<br />

Algorithm Number of clusters<br />

3 4 5 10 30<br />

KM 0.75676 0.69932 0.63851 0.59797 0.34122<br />

S KM 0.71622 0.67905 0.65203 0.55405 0.32095<br />

GMM 0.87162 0.79054 0.79392 0.59122 0.31757<br />

Mixture model classification<br />

The final clustering approach is to do a soft assignment using a Gaussian Mixture Model<br />

(GMM), where the covariance is restricted to be diagonal. In this case a classifier is trained<br />

for every combination of clusters. There is no gain in efficiency of computation, but there<br />

is a “finer” learning, which might capture different information than a global learning. The<br />

classification function for an unseen pair is given in Eq. (5.7).<br />

p(yi = yj|xi,xj) = � �<br />

p(u|xi)p(v|xj)p(yi = yj|xi,xj;Muv,buv) (5.7)<br />

u<br />

v<br />

Where p(k|x) is the posterior probability for x to belong to cluster k taken directly from<br />

the GMM. The parameters Muv and buv are learned using the set of points, from the training<br />

data, that belong either to cluster u or cluster v after making a hard assignment (MAP). Table<br />

5.8 show the results, and comparing with the baseline, which is 0.8672 of accuracy for the first<br />

fold, we can deduce that there is no gain in trying to make a finer classification.<br />

5.3.3 Discussion<br />

In this section were presented some non-linear approaches for face recognition, using feature<br />

based descriptors. The experiments with SR-KDA showed that there is indeed a gain of using


45 5.3 Non-linear approaches<br />

Table 5.8: Accuracy obtained over 1 fold using a GMM Model<br />

Number of clusters<br />

2 3 4<br />

0.8574 0.8387 0.8454<br />

non-linear algorithms to separate the input data. A simple classification, such as Euclidean<br />

distance is improved significantly, by more than 10% of accuracy. However for the case of<br />

LDML the gain was of only 1% of gain in accuracy. The computational cost is another factor to<br />

be taken into account, as SR-KDA is computationally expensive for a large quantity of classes<br />

and data.<br />

When using clustering approaches, there is a problem due to the large quantity of positive<br />

pairs assigned to different clusters. This effect can be reduced only by diminishing the quantity<br />

of clusters being considered. However, the results showed that even if this is reduced to 2 or<br />

3 clusters, the lost of positive pairs in the clustering is too high. The only way to overcome<br />

this limitation is by using a soft assignment by using a mixture model. In this case, the results<br />

show that it is similar to a global learning. Thus, there is no gain in using this approach.


Chapter 6<br />

Combining face representations<br />

In previous chapters, it has been demonstrated that a good recognition rate can be achieved by<br />

learning a proper metric, with algorithms such as LDML. The feature vectors representing the<br />

face can be either a HoG encoding, or SIFT descriptors in the location of each facial feature.<br />

In this chapter, as a last experiment, it is demonstrated that the performance of classification<br />

can be improved even further, by combining the distance for all the descriptors.<br />

To combine the descriptors two approaches were explored, both in a logistic framework.<br />

In the first case, a global distance is obtained by adding the distance for each descriptor, the<br />

learned biases are combined as well, and both terms are passed through a sigmoid function.<br />

Let x f<br />

i<br />

denote the feature vector of type f for face i, where f can be either SIFT descriptors<br />

computed in 3 scales at the location of facial features, a HoG descriptor using the parameters<br />

from Table 4.9, or the facial feature patches described in Section 2.2. Mf denote the learned<br />

metric for feature f. This approach is shown in Eq. (6.1).<br />

p(yi = yj|x 1 i,...,x F i ,x 1 j,...,x F ⎛<br />

F�<br />

j ) = σ ⎝ b f −<br />

f=1<br />

F�<br />

f=1<br />

⎞<br />

dMf (xfi<br />

,xfj<br />

) ⎠ (6.1)<br />

The other way to combine the features is by using logistic regression. In this case the sum,<br />

from Eq. (6.1), becomes a linear combination of the distances. The weight assigned to each<br />

feature type and the joint bias term are learned using the logistic regression algorithm. From<br />

the training examples, a large set of pairs are created from which their distances are computed,<br />

using the metric learned in the previous experiments. These set of distances are then used as<br />

the training examples for the logistic regression, see Section 3.6.2. The decision function is<br />

given in Eq. (6.2).<br />

46


47 6.1 Results for LFW<br />

p(yi = yj|x 1 i,...,x F i ,x 1 j,...,x F �<br />

j ) = σ w0 +<br />

6.1 Results for LFW<br />

F�<br />

f=1<br />

wfdMf (xf i ,xfj<br />

)<br />

�<br />

Table 6.1 show the results obtained for LFW-Aligned Commercial dataset. For comparison,<br />

the results for each feature trained separately are shown as well. In most of the cases, the<br />

accuracy is higher than for individual features. Based on these results we can conclude there is<br />

complementary information given by holistic and feature based descriptors.<br />

(6.2)<br />

In the case of combining SIFT multiscale and the facial feature patches there was a decay<br />

in the accuracy. Notice that, for the same case, the standard deviation increased significantly.<br />

The reason for this decay is that the weights found by the logistic regression are quite different<br />

between the folds, although their relative proportions are maintained, i.e. as expected SIFT<br />

multiscale is given a larger weight than facial feature patches. This causes a problem when a<br />

global threshold is selected for all the folds, as done in the accuracy computation. To correct<br />

this problem, a regularization was added such that the L2-norm of the weight vector (without<br />

including the bias) is constant. The results shown in Table 6.1 shows that this strategy corrects<br />

the problem.<br />

Notice that the results are not very different between distance sum and logistic regression.<br />

Specially for the case of HoG and SIFT multiscale combination, the reason is because logistic<br />

regression is assigning the same weights to both descriptors. When the facial feature patches<br />

descriptor was inserted the learning process assigned a low weight to it, reducing its contribution.<br />

This was confirmed in our experiments.<br />

What is important to remark is that the highest gain comes from the combination of the<br />

HoG descriptor(holistic) with the SIFT multiscale (facial feature based). The facial feature<br />

Table 6.1: Results for the combination of descriptors in the LFW benchmark<br />

Descriptor: used(+), not used(-) Combination type<br />

SIFT HoG Feature Distance LR LR<br />

Multiscale (squared) Patches sum (Regularized)<br />

+ - - 0.8524 ± 0.0052<br />

- + - 0.8530 ± 0.0065<br />

- - + 0.7385 ± 0.0046<br />

- + + 0.8607 ± 0.0054 0.7901 ± 0.0119 0.8600 ± 0.0060<br />

+ - + 0.8536 ± 0.0053 0.8154 ± 0.0152 0.8515 ± 0.0052<br />

+ + - 0.8766 ± 0.0050 0.8749 ± 0.0049 0.8759 ± 0.0052<br />

+ + + 0.8719 ± 0.0058 0.8746 ± 0.0047 0.8724 ± 0.0059


Chapter 6: Combining face representations 48<br />

True positives rate<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

SIFT_FF + HoG, aligned<br />

LMDL+MKNN, funneled<br />

Multishot combined, aligned<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1<br />

False positives rate<br />

Figure 6.1: Receiver Operating Characteristic curve for HoG and SIFT combination for the<br />

LFW benchmark. Results for [16] and [36] also shown<br />

patches do not bring any contribution, as the information they represent is already encoded by<br />

the SIFT multiscale descriptor. The performance of this algorithm is among the state of the<br />

art for the LFW benchmark. The ROC curve for the HoG and SIFT combination is shown in<br />

Fig. 6.1.<br />

Our results are comparable with performance of the state of the art algorithms, reported<br />

for the unrestricted paradigm of LFW. These methods gave an accuracy of 0.8517±0.0061 [36],<br />

0.8750 ± 0.0040 [16] and 0.8950 ± 0.0051 [36].<br />

6.2 Results for PubFig<br />

We tested the algorithm over the PubFig dataset [22]. However, in this case, the pipeline<br />

included face detection and a facial features based alignment, prior to the computation of the<br />

descriptors. The facial feature patches were discarded due to the results obtained for LFW.<br />

The problem is that the results are not comparable to the ones reported in [22], due to<br />

the images removed from their original location. For that reason we do not follow the training<br />

protocol, which is defined as a “restricted” paradigm. We train using the label information<br />

so that many more pairs for training can be generated. However, we keep using 10-fold cross


49 6.2 Results for PubFig<br />

validation for evaluation.<br />

We take advantage of the separation of sets according to illumination, expression and pose,<br />

which allows us to observe how sensitive is our algorithm to these variants. The training images<br />

are the same, but we test only in the specified benchmark.<br />

Table 6.2 show the results given for all the PubFig benchmarks, the combination algorithm<br />

is distance sum. The logistic regression for the combination of features gave practically the<br />

same results. Again, the learned weights are practically the same.<br />

From the results it can be concluded that our algorithm is sensitive to pose changes, as there<br />

is a different of almost 5% between posefront and poseside benchmarks. The same happens with<br />

the light benchmarks, with a difference of almost 4% between lightfront and lightside. In the<br />

case of expression, there was almost no difference. The ROC curves are illustrated in Fig. 6.2.<br />

Table 6.2: Results for the different variants of the PubFig dataset<br />

Dataset Accuracy<br />

pubfig full 0.7763 ± 0.0068<br />

pubfig posefront 0.8111 ± 0.0139<br />

pubfig poseside 0.7656 ± 0.0108<br />

pubfig lightfront 0.7875 ± 0.0125<br />

pubfig lightside 0.7485 ± 0.0080<br />

pubfig exprneutral 0.7733 ± 0.0128<br />

pubfig exprexpr 0.7759 ± 0.0072<br />

True positives rate<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

Pubfig_exprexpr<br />

Pubfig_exprneutral<br />

Pubfig_full<br />

Pubfig_lightfront<br />

Pubfig_lightside<br />

Pubfig_posefront<br />

Pubfig_poseside<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1<br />

False positives rate<br />

Figure 6.2: Receiver Operating Characteristic curve for a combination of HoG descriptors and<br />

SIFT for the PubFig benchmarks


Chapter 7<br />

Conclusions and future work<br />

In this thesis we compared two robust descriptors for face recognition in uncontrolled settings.<br />

The first one is a Histogram of Oriented Gradients, computed over the entire face, and therefore<br />

a holistic approach. For the HoG descriptors we found a suitable set of parameters which gave a<br />

good performance for the Labeled Faces in the Wild benchmark. It was concluded that the use<br />

of a face alignment is crucial, when combined with a metric learning algorithm such as LDML.<br />

This alignment must be robust in terms of translations, in such way that facial features for<br />

the pair of images being compared, are localized approximately in the same spatial cell. The<br />

coordinates of facial features can be used to obtain a transformation which aligns the face into<br />

the desired pose. However there is the need for an improvement of the alignment algorithm<br />

and/or the facial feature point localization.<br />

The second visual feature vector we studied is a multiscale SIFT descriptor, computed in<br />

the location of facial features, therefore a feature based approach. This strategy gave good<br />

performance when combined with LDML. We concluded that it is possible to make a separate<br />

training for each facial feature and then combine their distances to make a joint decision.<br />

Even though the results were not as good as for a global learning, it opened the door to<br />

handle occlusions. We obtained a confidence value for each facial feature from a discriminative<br />

appearance model. This is a measure of how reliable is the information of the descriptor and<br />

it is not consider as an occlusion or a bad localization. The confidence level was succesfully<br />

integrated into the decision function which increased the accuracy.<br />

We also studied non-linear methods, from which we did not obtain good results for clustering<br />

strategies, neither based on pose, unsupervised clustering or as a Gaussian Mixture Model.<br />

However, algorithms, such as SR-KDA, are able to find non-linear discriminant information in<br />

the data. It was able to make a slight increase of the performance, at the expense of being<br />

more computationally expensive. These results show that it would be interesting to go further<br />

50


51<br />

into studying other types of non-linear algorithms.<br />

Finally, we demonstrated that the distances, given by different descriptors, can be inte-<br />

grated to boost the performance of the face recognition pipeline. The obtained performance<br />

is comparable with the state of the art for the Labeled Faces in the Wild and Public Figures<br />

benchmarks.<br />

The PubFig benchmark shows that our algorithm is highly sensitive to pose and illumination<br />

changes. In the case of illumination, it means that the normalization being used does not present<br />

a good invariance to this factor, and therefore it is necessary to address this issue.


Bibliography<br />

[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face recognition with local binary patterns.<br />

<strong>page</strong>s 469–481. 2004.<br />

[2] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach,<br />

2000.<br />

[3] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces: Recognition<br />

using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine<br />

Intelligence, 19:711–720, 1997.<br />

[4] C. Bishop. Pattern recognition and machine learning (Information Science and Statistics).<br />

Springer, 1st ed. 2006. corr. 2nd printing edition, October 2007.<br />

[5] G. Bradski. The OpenCV library. Dr. Dobb’s Journal of Software Tools, 2000.<br />

[6] Deng Cai. Efficient kernel discriminant analysis via spectral regression. Technical report,<br />

2007.<br />

[7] Z. Cao, Q. Yin, X. Tang, and Jian S. Face recognition with learning-based descriptor. In<br />

Proc. Computer Vision and Pattern Recognition, 2010.<br />

[8] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow<br />

and appearance. In European Conference on Computer Vision, 2006.<br />

[9] J. Davis, B. Kulis, S. Sra, and I. Dhillon. Information-theoretic metric learning. In in<br />

NIPS 2006 Workshop on Learning to Compare Examples, 2007.<br />

[10] M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is... Buffy Automatic naming<br />

of characters in TV video. In In BMVC, 2006.<br />

[11] M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automatic naming of<br />

characters in TV video. Image and Vision Computing, 27(5), 2009.<br />

52


53 BIBLIOGRAPHY<br />

[12] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. Int. J.<br />

Comput. Vision, 61(1):55–79, 2005.<br />

[13] A. Ferencz, E. Learned-Miller, and J. Malik. Learning hyper-features for visual identifica-<br />

tion. In Neural Information Processing Systems, volume 18, 2004.<br />

[14] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an<br />

application to boosting. In European conference on computational learning theory, <strong>page</strong>s<br />

23–37, 1995.<br />

[15] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Automatic face naming with<br />

caption-based supervision. In Conference on Computer Vision & Pattern Recognition,<br />

<strong>page</strong>s 1–8, jun 2008.<br />

[16] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? Metric learning approaches for<br />

face identification. In International Conference on Computer Vision, sep 2009.<br />

[17] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant<br />

mapping. In CVPR ’06: Proceedings of the 2006 IEEE Computer Society Conference on<br />

Computer Vision and Pattern Recognition, <strong>page</strong>s 1735–1742, Washington, DC, USA, 2006.<br />

IEEE Computer Society.<br />

[18] G. Huang and V. Jain. Unsupervised joint alignment of complex images. In In ICCV,<br />

2007.<br />

[19] G. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A<br />

database for studying face recognition in unconstrained environments. Technical Report<br />

07-49, University of Massachusetts, Amherst, October 2007.<br />

[20] A. Kläser. Human detection and character recognition in tv-style movies. In Informatiktage,<br />

<strong>page</strong>s 151–154, 2007.<br />

[21] N. Kumar, P. N. Belhumeur, and S. K. Nayar. FaceTracer: A search engine for large<br />

collections of images with faces. In European Conference on Computer Vision (ECCV),<br />

<strong>page</strong>s 340–353, Oct 2008.<br />

[22] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers<br />

for face verification. In IEEE International Conference on Computer Vision (ICCV), Oct<br />

2009.<br />

[23] D. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision,<br />

60(2):91–110, 2004.


BIBLIOGRAPHY 54<br />

[24] B. Moghaddam. Bayesian face recognition. Pattern Recognition, 33(11):1771–1782, Novem-<br />

ber 2000.<br />

[25] M. M. Nordstrøm, M. Larsen, J. Sierakowski, and M. B. Stegmann. The IMM face database<br />

- an annotated dataset of 240 face images. Technical report, Informatics and Mathematical<br />

Modelling, Technical University of Denmark, DTU, Richard Petersens Plads, Building 321,<br />

DK-2800 Kgs. Lyngby, may 2004.<br />

[26] E. Nowak and F. Jurie. Learning visual similarity measures for comparing never seen<br />

objects. In Conference on Computer Vision & Pattern Recognition, jun 2007. see also<br />

http://lear.inrialpes.fr/people/nowak/.<br />

[27] P. J. Phillips, P. Flynn, T. Scruggs, K. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min,<br />

and W. Worek. Overview of the face recognition grand challenge. <strong>page</strong>s 947–954, 2005.<br />

[28] P. J. Phillips, W. T. Scruggs, A. O’Toole, P. Flynn, K. Bowyer, C. Schott, and M. Sharpe.<br />

Frvt 2006 and ice 2006 large-scale experimental results. IEEE Transactions on Pattern<br />

Analysis and Machine Intelligence, 32:831–846, 2010.<br />

[29] S. Phimoltares, C. Lursinsap, and K. Chamnongthai. Face detection and facial feature<br />

localization without considering the appearance of image context. Image Vision Comput.,<br />

25(5):741–753, 2007.<br />

[30] N. Pinto, J. J. di Carlo, and D. D. Cox. Establishing good benchmarks and baselines for<br />

face recognition. In Faces in real life images workshop at ECCV08, 2008.<br />

[31] N. Pinto, J.J. DiCarlo, and D.D. Cox. How far can you get with a modern face recognition<br />

test set using only simple features? Computer Vision and Pattern Recognition, IEEE<br />

Computer Society Conference on, 0:2591–2598, 2009.<br />

[32] F. Porikli. Integral histogram: A fast way to extract histograms in cartesian spaces. In in<br />

Proc. IEEE Conf. on Computer Vision and Pattern Recognition, <strong>page</strong>s 829–836, 2005.<br />

[33] S. Rizvi, P. J. Phillips, and H. Moon. The FERET verification testing protocol for face<br />

recognition algorithms, 1999.<br />

[34] J. Shi and C. Tomasi. Good features to track, 1994.<br />

[35] J. Sivic, M. Everingham, and A. Zisserman. “Who are you?”: Learning person specific<br />

classifiers from video. In Proceedings of the IEEE Conference on Computer Vision and<br />

Pattern Recognition, 2009.


55 BIBLIOGRAPHY<br />

[36] Y. Taigman, L. Wolf, and T. Hassner. Multiple one-shots for utilizing class label informa-<br />

tion. In The British Machine Vision Conference (BMVC), Sept. 2009.<br />

[37] X. Tan and B. Triggs. Enhanced local texture feature sets for face recognition under<br />

difficult lighting conditions. In Analysis and modelling of faces and gestures, volume 4778<br />

of LNCS, <strong>page</strong>s 168–182. Springer, oct 2007.<br />

[38] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience,<br />

3(1):71–86, 1991.<br />

[39] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.<br />

Proc. CVPR, 1:511–518, 2001.<br />

[40] X. Wang and X. Tang. A unified framework for subspace face recognition. IEEE Trans.<br />

Pattern Anal. Mach. Intell., 26(9):1222–1228, 2004.<br />

[41] K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor<br />

classification. J. Mach. Learn. Res., 10:207–244, 2009.<br />

[42] L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In Real-Life<br />

images workshop at the European Conference on Computer Vision (ECCV), October 2008.<br />

[43] L. Wolf, T. Hassner, and Y. Taigman. Similarity scores based on background samples. In<br />

Asian Conference on Computer Vision (ACCV), 2009.<br />

[44] M. Yang. Face recognition using kernel methods, 2001.<br />

[45] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: a literature<br />

survey. ACM Comput. Surv., 35(4):399–458, 2003.<br />

[46] J. Zhu, L. Van Gool, and S. Hoi. Unsupervised face alignment by robust nonrigid mapping.<br />

In IEEE International Conference on Computer Vision, 2009.<br />

[47] Q. Zhu, M. Yeh, K. Cheng, and S. Avidan. Fast human detection using a cascade of<br />

histograms of oriented gradients. In CVPR ’06: Proceedings of the 2006 IEEE Computer<br />

Society Conference on Computer Vision and Pattern Recognition, <strong>page</strong>s 1491–1498, Wash-<br />

ington, DC, USA, 2006. IEEE Computer Society.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!