Thesis - VIBOT congrat page
Thesis - VIBOT congrat page
Thesis - VIBOT congrat page
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Robust Face Descriptors in Uncontrolled Settings<br />
Kenneth Alberto Funes Mora<br />
LEAR Team<br />
INRIA Rhône-Alpes<br />
Supervisors<br />
Cordelia Schmid, Jakob Verbeek and Matthieu Guillaumin<br />
A <strong>Thesis</strong> Submitted for the Degree of<br />
MSc Erasmus Mundus in Vision and Robotics (<strong>VIBOT</strong>)<br />
· 2010 ·
Abstract<br />
Face Recognition is known to be a difficult problem for the computer vision community.<br />
Factors such as pose, expression, illumination conditions and occlusions, among others, span<br />
a very large set of images that can be generated by a single person. Therefore the automatic<br />
decision of whether a pair of images depict the same person or not, in uncontrolled settings,<br />
becomes a highly challenging problem.<br />
Due to the large quantity of potential applications, over the past years many algorithms<br />
have been proposed, which can be separated into three categories: holistic, facial feature based<br />
and hybrid. Even though some algorithms have achieved a high accuracy, there is still the need<br />
for a significant improvement to achieve robustness in uncontrolled conditions while achieving<br />
a high computational efficiency.<br />
In this thesis we explore the use of a Histogram of Oriented Gradients as a holistic descriptor.<br />
The experimental results show that a considerable performance is achieved when a proper set<br />
of parameters are combined with a prior face alignment. The classification function is given by<br />
a metric learning algorithm, i.e. an algorithm which finds the best Mahalanobis distance that<br />
separates the input data.<br />
Additionally a facial feature based descriptor is presented, which is the concatenation of<br />
SIFT descriptors, computed in the location of interest points found by a facial feature detection<br />
algorithm. More importantly, a method to handle occlusions is proposed, where a confidence<br />
is obtained from each facial feature and later combined into the classification function. Also,<br />
non-linear strategies for face recognition are discussed.<br />
Finally it is shown that there is complementary information between both descriptors, as<br />
their combination improves the performance such that it becomes comparable to the current<br />
state of the art algorithms.
Contents<br />
Acknowledgments iii<br />
1 Introduction 1<br />
1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.2 Outline and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
2 Related work 5<br />
2.1 Marginalized k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
2.2 Automatic naming of characters in TV video . . . . . . . . . . . . . . . . . . . . 6<br />
2.3 Attribute and simile descriptor for face identification . . . . . . . . . . . . . . . . 8<br />
2.4 Face recognition with learning based descriptor . . . . . . . . . . . . . . . . . . . 10<br />
2.5 Multiple one-shots using label class information . . . . . . . . . . . . . . . . . . . 12<br />
3 The face recognition pipeline 14<br />
3.1 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />
3.2 Facial features localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
3.3 Face alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
3.4 Preprocessing for illumination invariance . . . . . . . . . . . . . . . . . . . . . . . 19<br />
3.5 Face descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
3.6 Learning/Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
3.7 Datasets and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
i
3.8 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
4 Histogram of Oriented Gradients for face recognition 29<br />
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
4.2 Alignment comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />
4.3 HoG parametric study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
5 Facial feature based representations 36<br />
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
5.2 Feature wise classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />
5.3 Non-linear approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />
6 Combining face representations 46<br />
6.1 Results for LFW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
6.2 Results for PubFig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />
7 Conclusions and future work 50<br />
Bibliography 55<br />
ii
Acknowledgments<br />
First of all I want to thank all the people who brought the Vibot program into existence and<br />
that every year work very hard for its improvement. The coordinators Fabrice Meriadeau,<br />
David Fofi, Joaquim Salvi, Jordi Freixenet, Robert Martí and Yvan Petillot and every single<br />
one of the lecturers and administrative staff. Without your effort and initiative we would not<br />
be here.<br />
To my supervisors: Cordelia Schmid, Jakob Verbeek and Matthieu Guillaumin. I feel very<br />
thankful for receiving me in the LEAR team, and for your valuable guidance, which helped me<br />
to grow in knowledge and experience. As well to all the members of the LEAR team, for their<br />
friendship and for making these months a very gratifying and enrichful experience.<br />
To all my Vibot colleages, I have learned too many things from every single one of you. The<br />
cultures you were representing, your different world views and your experience. It is one of the<br />
things that I would never forget of the Vibot program. It helped to grow in so many ways that<br />
I can not express. The world is a small place but it contains great people. Your friendship will<br />
be always alive and I hope we will be meeting in the future.<br />
I want to thank my friends at home, who have been in contact with me all this time. Always<br />
willing to listen, always willing to advice, always willing to talk. Definitely a true friend is not<br />
separated by the distance. You guys know who you are. . .<br />
I want to thank my family, my parents Carlos Funes and Ruth Mora, and my brother<br />
Michael Funes for their support at the distance and encouranging words in moments of need,<br />
¡Papá, Mamá, Michael, los amo enormemente!, ¡Gracias!.<br />
I would like to thank my God and saviour Jesus Christ, you are the source of my strength<br />
and my motivation, you take me by the hand when I need it the most. Thank you. . .<br />
years.<br />
Last but not least, to the European Commission for funding my studies during these two<br />
iii
Chapter 1<br />
Introduction<br />
Face Recognition can be divided in two main applications: Face Identification and Face Verifi-<br />
cation. The former refers to the association between a set of probe faces and a gallery, in order<br />
to determine the identity of each of the exemplars from the probe set. The latter refers to the<br />
decision of whether a pair of face instances correspond or not to the same person. This defini-<br />
tion is different to that of visual identification [13], where the term identification is used for the<br />
pair matching problem. It can be noticed that the face verification problem is wider, in such<br />
way that the face identification task can be formulated by solving face verification subproblems.<br />
Within this thesis we focus on face verification. Therefore, the goal is to design an algorithm<br />
to automatically decide, whether a pair of unseen face images, depict the same person or not.<br />
It is a supervised classification problem, in which the decision function is trained based on a set<br />
of example faces labeled with identities, or pairs of face images labeled as similar or dissimilar.<br />
The availability of a solution to this problem is highly attractive for its many applications.<br />
It comprises fields such as entertainment, smart cards, information security, law inforcement,<br />
surveillance, etc. [45]. Within the context of scene interpretation, we want to be able to auto-<br />
matically determine what is happening in an image or a video [35]. Face recognition is highly<br />
valuable as it helps to determine the question of Who is in the scene [11,20]. This will open<br />
the possibility for applications such as categorization, retrieval and indexing based on iden-<br />
tity [15,16]. The use of face recognition technology is becoming more and more visible, e.g. the<br />
recent launch of tools for automatic face labeling in sites such as Picasa 1 .<br />
More than 35 years in research have generated many algorithms [1,3,7,11,15,16,22,26,35,<br />
36,38,40,42,45] and benchmarks [19,22,27,33], which have pushed face recognition to achieve<br />
outstanding results, a proof is the current availability of commercial software [28]. In general,<br />
this software is designed for the case in which the person cooperates in the image acquisition<br />
1 Picasa Web Albums, http://picasaweb.google.com/<br />
1
Chapter 1: Introduction 2<br />
(a) (b) (c)<br />
(d) (e) (f)<br />
Figure 1.1: Face variations due to: (a) Viewpoint changes (b) Illumination variation (c) Occlusions<br />
(d) Expression (e) Age variations (f) Image quality<br />
in a controlled environment, and therefore there are no major changes in illumination, pose,<br />
expression, etc. However, face recognition in uncontrolled settings, from still images and videos,<br />
is still an unsolved problem. Despite the large amount of research carried out, a significant<br />
improvement is still required in order to achieve robustness in such settings.<br />
The main challenge is that a single person can virtually generate an infinite number of<br />
images. This is due to the many factors that influence the image acquisition. Among the most<br />
important are: major pose or viewpoint changes, including scaling differences, variations in the<br />
illumination conditions, the possibility of occlusions due to sunglasses, hats and other objects,<br />
differences in expression,aging, changes in hair and facial hair and image quality. Figure 1.1<br />
shows examples of how this factors affect the resulting image.<br />
1.1 Problem definition<br />
Even though many algorithms can be found in the literature, a general pipeline can be identified,<br />
shown in Fig. 1.2. Its steps are intended to overcome the challenges previously mentioned. Face<br />
detection is the first step, it defines a bounding box for the location and scale of the face. Then<br />
three optional steps can be applied: alignment, facial feature localization and/or preprocessing<br />
to gain invariance to illumination. The goal is to build a visual descriptor that can be used as<br />
the input for machine learning algorithms. These algorithms are capable of classifying a pair<br />
of examples as belonging to the same individual or not. Three categories of algorithms can be<br />
identified: Holistic, Feature based and Hybrid approaches.
3 1.1 Problem definition<br />
Face<br />
detection<br />
Alignment<br />
(optional)<br />
Facial features<br />
extraction<br />
(optional)<br />
Illumination<br />
normalization<br />
(optional)<br />
Figure 1.2: General face recognition pipeline<br />
Visual feature<br />
extraction<br />
Face<br />
identification<br />
Holistic face description methods consider the face image as a whole to build the descriptor.<br />
Examples of such approaches are the subspace learning algorithms, where a face is represented<br />
as a point in a high dimensional space, with the intensity of each pixel as one dimension,<br />
followed by the use of techniques such as Principal Component Analysis (Eigenfaces) [38] or<br />
Linear Discriminant Analysis (Fisherfaces) [3]. In such cases, the objective is to project the data<br />
into a lower dimensional space where most of the information is maintained (PCA) or the dis-<br />
criminant information between different classes (people) is emphasized (LDA) when computing<br />
the projection matrix. Bayesian methods also fall into this category, refering to those meth-<br />
ods that generate a Maximum a Posteriori (MAP) estimation of a intrapersonal/extrapersonal<br />
classifier [24].<br />
Aditionally, proposals has been presented to unify Bayesian approaches with Eigenfaces<br />
and Fisherfaces [40]. These algorithms have shown to provide good results under controlled<br />
conditions, using benchmarks such as the FERET database [33]. However, they are not suitable<br />
for uncontrolled settings, where high non-linearities are introduced, e.g. as a result of major<br />
pose changes, and are sensitive to the localization given by the face detector.<br />
Proposals have been presented to improve the performance in uncontrolled conditions, by<br />
creating more complex descriptors than simply the set of pixel values, e.g. using Local Binary<br />
Patterns [1] or by extending subspace learning to handle non-linear data, using the kernel<br />
trick [6, 44]. Additionally through methods specialized in non-linear dimension reduction, by<br />
learning an invariant mapping [17]. In this thesis, a holistic approach based on Histogram of<br />
Oriented Gradients (HoG) will be presented in Chapter 4.<br />
Feature based face description algorithms are grounded in the localization of a set of facial<br />
features, such as the position of the mouth, the eyes, the nose, etc, after face detection [11,29].<br />
A descriptor is built using the location information. In the past years, algorithms based on<br />
Facial Features localization have gained a growing attention [7,10,11,16,22], as they are less<br />
sensitive to pose variations and misalignments introduced by the face detector.<br />
Therefore they are appropriate for the face recognition tasks in uncontrolled settings. How-<br />
ever, the facial feature localization itself is still problematic, and needs further improvements.<br />
In this thesis a feature-based algorithm using multiscale SIFT [16, 23] will be presented, and
Chapter 1: Introduction 4<br />
compared to the Holistic approach based on HoG descriptors.<br />
Hybrid face description methods combine holistic and feature based paradigms, through<br />
either early or late fusion. Early fusion refers to the case in which descriptors are combined<br />
into one using aggregation methods, such as concatenation of the feature vectors. In this case,<br />
the information is combined prior to classification. Late fusion makes a classification based on<br />
each descriptor, and their corresponding scores are combined into one, to make a more robust<br />
decision. In this thesis we use a late fusion method, which combines the HoG and multiscale<br />
SIFT descriptors.<br />
1.2 Outline and contributions<br />
In Chapter 2 different state of the art algorithms are described in detail. These were identified<br />
for being the current state of the art for challenging benchmarks such as the Labeled Faces in<br />
the Wild [19] dataset, or because they were an important influence for our work. In Chapter 3<br />
there is a detailed description of the face recognition pipeline from Fig. 1.2. Each of the stages<br />
are described, together with algorithms for their implementation.<br />
The first contribution is given in Chapter 4, where we explore the use of a Histogram of<br />
Oriented Gradients descriptor for face recognition. We show in this chapter that an alignment<br />
robust regarding translations is necessary to obtain a good performance. Furthermore, we<br />
identify set of parameters for which a highest accuracy is achieved.<br />
Our second contribution, described in Chapter 5, is related with feature based algorithms.<br />
We propose a strategy in which learning is done for each facial feature, after which we combine<br />
them through late fusion. Even though this does not help the overall performance, it is good<br />
to handle occlusions. This is done by detecting outliers based on a discriminative appearance<br />
model. The occlusion information is later on inserted into the classification function.<br />
The third contribution is showed in Chapter 6, where we combine the use of HoG and mul-<br />
tiscale SIFT representations through late fusion. This combinations increases the performance<br />
of the algorithm such that it is comparable to the state of the art. Finally, in Chapter 7, we<br />
give a summary of our work pointing out the main conclusions, from which we define our future<br />
work.
Chapter 2<br />
Related work<br />
In Chapter 1 different face recognition algorithms were mentioned. We identified a few methods<br />
that have given promising results in uncontrolled settings, and are recognized as the state of<br />
the art. These algorithms are described in more detail in this chapter.<br />
2.1 Marginalized k-Nearest Neighbors<br />
Guillaumin et al. proposed the use of metric learning approaches for face recognition [16], more<br />
specifically, Logistic Discriminant Metric Learning (LDML), an algorithm that searches for<br />
the best Mahalanobis distance between pairs of feature vectors, explained in more detail in<br />
Section 3.6.2.<br />
Even though LDML has proven to be effective, any Metric Learning algorithm will generate<br />
a linear transformation of the input space. However data, for face recognition, is believed to<br />
be highly non linear, due to major changes in pose and expression. Therefore, metric learning<br />
approaches might not be able to effectively separate the classes. To overcome this problem,<br />
Guillaumin et al. proposed a modification of k-Nearest Neighbors (k-NN). In k-NN classification,<br />
an unseen example is assigned to the class with most occurrence within its k neighbors, that<br />
are defined according to some measure, e.g. minimim Euclidean distance.<br />
If n i c denote the quantity of neighbors of xi belonging to class c. Then the probability of xi<br />
to be of class c is estimated as p(yi = c|xi) = n i c/k. The proposal is to classify the pair (xi,xj)<br />
as belonging to the same class by marginalizing over all the possible classes within the training<br />
set. This is shown in Eq. (2.1).<br />
p(yi = yj|xi,xj) = �<br />
c<br />
p(yi = c|xi)p(yj = c|xj) = 1<br />
k 2<br />
5<br />
�<br />
n c in c j<br />
c<br />
(2.1)
Chapter 2: Related work 6<br />
This result can be thought as a binary k-Nearest Neighbors classifier in the implicit space<br />
of N 2 pairs. This can be observed in Fig. 2.1, where for each point of the pair to be classified,<br />
their k neighbors are selected and then the vote is given by all the pairs that can be generated<br />
from their neighbors, divided by the quantity of possible pairs Eq. (2.1).<br />
The descriptors used in [16] were Local Binary Patterns (LBP) [42] and SIFT [23], computed<br />
at 3 scales in the locations given by the facial feature localization algorithm, i.e. the corner of<br />
the eyes, nose and mouth. The metric used to define the neighborhood was given by a Large<br />
Margin Nearest Neighbors [41]. An algorithm designed to find a metric specifically optimized<br />
for the k-NN problem.<br />
xi<br />
B<br />
A<br />
C<br />
12 pairs<br />
24 pairs<br />
6 pairs<br />
C<br />
6 pairs<br />
Figure 2.1: Marginalized K Nearest Neighbors [16]<br />
2.2 Automatic naming of characters in TV video<br />
Everingham et al. [11] considered the problem of automatic naming of characters in video. They<br />
combined information such as subtitles and scripts to determine which characters are present<br />
in the scene and when. Using visual information are able to associate a name to each character<br />
for certain tracks. These tracks are used as well to generate a set of training examples for<br />
a face recognition algorithm, used to determine the identity of characters from the remaining<br />
unlabeled tracks.<br />
In this case, the problem is simpler in terms of face recognition, tracking can be used to<br />
associate faces in a sequence of frames. Moreover, video can easily generate a large amount of<br />
training examples, and generally, there is a small amount of characters to recognize.<br />
The first step is to align the script (dialogue-character) with the subtitles (dialogue-timing)<br />
to determine which characters are talking and when. Then they proceed to obtain face tracks,<br />
that are face detections linked as the same person over a group of not necessarilly sequential<br />
frames. This is done using a Kanade-Lucas-Tomasi (KLT) tracker [34], this algorithm uses a<br />
interest point detector for the first frame and then propagates the points over the following<br />
A<br />
xj<br />
B
7 2.2 Automatic naming of characters in TV video<br />
(a) (b)<br />
Figure 2.2: (a)Example of face tracking to build the training set (b) Features Patches extraction<br />
[11]<br />
frames. Based on the tracked interest points, which follow a path intersecting face detections,<br />
the creation of the face tracks are obtained as seen in Fig. 2.2a. The face tracking is done<br />
separately for each shot of the whole video, where a change of shot is detected by thresholding<br />
the difference of color histograms between succesive frames. Notice that this simplifies the<br />
problem of face matching and no real face recognition is done yet.<br />
In order to build a face descriptor, the facial feature detector, described in detail in Section<br />
3.2 is used. The pixel values surrounding each localization are extracted, as showed in Fig.2.2b,<br />
normalized to have zero mean and unitary variance, in order to acquire photometric invariance.<br />
Using the localization of the mouth, a speaker detection is used, simply by computing the<br />
variation of the mouth pixels in sequential frames and thresholding. Additionally to facial<br />
information, clothing information is used, with a color histogram for a bounding box below the<br />
face detection. Finally knowing which face track is speaking and associating it with the script<br />
and subtitle information, a set of face tracks can be properly labeled with an identity. These<br />
tracks can be used as training examples for a classification problem, in order to label the rest<br />
of the face tracks that could not be labeled in the previous steps.<br />
To label the rest of face tracks, a similarity measure comparing two characters combines<br />
facial and clothing information, as given in Eq. (2.2)<br />
�<br />
S(pi,pj) = exp − df(pi,pj)<br />
2σ2 � �<br />
exp −<br />
f<br />
dc(pi,pj)<br />
2σ2 �<br />
c<br />
(2.2)<br />
Taking into account this similarity measure, a classification based on Nearest Neighbors or<br />
Support Vector Machines can be used to label the rest of face tracks in the video. More details
Chapter 2: Related work 8<br />
can be found in [11].<br />
Table 2.1: Low level features parameters for a single trait classifier<br />
Pixel Value Types Normalization Aggregation<br />
RGB(r) None(n) None(n)<br />
HSV (h) Mean-Normalization (m) Histogram (h)<br />
Image Intensity (i) Energy-Normalization (e) Statistics (s)<br />
Edge Magnitude (m)<br />
Edge Orientation (o)<br />
2.3 Attribute and simile descriptor for face identification<br />
The work presented by Kumar et al. [22] has presented one of the best results for the Labeled<br />
Faces in the Wild benchmark, when using the “restricted” protocol (explained in Section 3.7.1).<br />
They presented two separate strategies: the attribute and the simile classifier.<br />
2.3.1 Attribute descriptor<br />
The attribute classifier algorithm is based on the idea that a person’s identity can be infered<br />
from a set of high level attributes, such as gender, age, race, etc. The result is a descriptor<br />
with entries according to each of the attributes, as shown in Fig. 2.3a. Each trait is determined<br />
using the algorithm in [21]: the face image is divided into regions, as shown in Fig. 2.3c. The<br />
aim is to have a set of low level features that are created by the combination of a region, using<br />
a specific pixel value type, normalization and aggregation. The options are listed in Table 2.1.<br />
The selection of which combinations to use is trait dependent.<br />
Kumar et al. proposed to use forward feature selection to know which low-level features to se-<br />
lect for a given trait. Then a SVM classifier with RBF Kernel is trained concatenating the useful<br />
low-level features. In [22], the low level descriptor is defined as F(I) = 〈f1(I),f2(I),...,fk(I)〉<br />
where fi(I) represent the feature i of image I, a selection from Table 2.1. The attribute descrip-<br />
tor is build using the output of the trait classifiers as xi = 〈C1(F(Ii)),C2(F(Ii)),...,Cn(F(Ii))〉.<br />
Finally the recognition function is given in Eq. (2.3)<br />
f(Ii,Ij) = D(xi,xj) (2.3)<br />
With D(xi,xj) as a classification function, described in Section 2.3.3, such that the output<br />
is positive for the same identity and negative for different identities.
9 2.3 Attribute and simile descriptor for face identification<br />
(a) (b)<br />
(c)<br />
Figure 2.3: (a) Descriptor based on high level attributes (b) Training examples for the attributes<br />
(c) Face Regions for the attribute classifiers [21]<br />
2.3.2 Simile descriptor<br />
A problem with the attribute classifier is that a significant amount of annotation must be<br />
done, and only features that can be described with words such as gendre must be used. Simile<br />
descriptors are based on the intuition of describing a person based on similarities with reference<br />
individuals. For example: “Nose similar to subject 1” and “Mouth Not similar to subject 2”. To<br />
create such description, a set of reference face images was created. A classifier is trained based<br />
on at least 600 positive examples for each feature and at least 10 times more negative examples.<br />
The final descriptor is depicted in Fig.2.4a, while Fig.2.4b show some training examples.<br />
For a pair of unseen examples, their respective simile feature vectors, xi and xj, are com-<br />
puted. Then a classifier is used to take the decision of whether they depict the same person<br />
(Eq. (2.4)).<br />
2.3.3 Verification classifier<br />
f(Ii,Ij) = D(xi,xj) (2.4)<br />
Both Eq.(2.3) and Eq.(2.4) use the same algorithm, which is a Support Vector Machine classifier<br />
optimized to give higher importance to the sign than to the absolute value of the entries of the
Chapter 2: Related work 10<br />
(a) (b)<br />
Figure 2.4: (a) Descriptor based on similarity of features (b) Training examples for the features<br />
descriptor. This is done based on the observation that the trait classifiers are designed to be<br />
binary outputs, in the range [−1,1].<br />
To do that they proposed to generate pairs pi = (|ai − bi|,ai.bi)g( 1<br />
2 (ai + bi)), where<br />
ai = Ci(I1), bi = Ci(I2) and g(z) is a Gaussian weighting. The concatenation of all the pairs<br />
generate the feature vector that is used for an SVM RBF classifier. Even though these algo-<br />
rithms have both achieved outstanding results for Labeled Faces in the Wild, they do not follow<br />
the strict evaluation protocol as they use training data not available in the Labeled Faces in<br />
the Wild dataset. It also has the disadvantage of using a large set of classifiers just to build the<br />
descriptor. This is not desirable in terms of computational efficiency.<br />
2.4 Face recognition with learning based descriptor<br />
Recently, Cao et.al [7] introduced a novel method which is comparable to the best performing<br />
algorithms for Labeled Faces in the Wild. It brings two main contributions, the first one is<br />
that there is no manually defined descriptor, but a proper encoding is learned specifically for<br />
facial images, in an unsupervised manner. The second contribution consist in a pose dependent<br />
classification.<br />
As illustrated in the top part of Fig. 2.5b, the descriptor is learned as follows: a sampling<br />
method is defined in which, for every pixel, its neighbors are retrieved in a predefined pattern,<br />
to form a low level vector. Examples can be observed in Fig. 2.5a where different options for<br />
patterns are presented. The sampling is done for every pixel in the image, for all the images in<br />
the training set, and therefore each pixel will have an associated low level feature vector.<br />
A vector quantization algorithm is used, which might be K-Means, PCA-tree or random-<br />
projection tree. Empirically they found that random-projection tree gives better results. The
11 2.4 Face recognition with learning based descriptor<br />
(1)<br />
(3)<br />
R 1<br />
R 1<br />
(2)<br />
R 1<br />
R 2<br />
(a)<br />
(4)<br />
R 1<br />
R 2<br />
Preprocessed<br />
image<br />
*<br />
Landmark<br />
detection<br />
R 2<br />
R 1<br />
Sampling and<br />
normalization<br />
left eye<br />
.<br />
.<br />
.<br />
nose<br />
left eye<br />
.<br />
.<br />
.<br />
nose<br />
Component<br />
alignment<br />
d d d 1,1 1,2 1,w<br />
d d d 2,1 2,2 2,w<br />
{ }<br />
d h,1 d h,2<br />
Normalized low-level<br />
feature vectors<br />
DoG<br />
d h,w<br />
LE<br />
descriptor<br />
extractor<br />
LE descriptor<br />
extraction<br />
•<br />
• •<br />
• •• •<br />
(b)<br />
Learning-based<br />
encoding<br />
{<br />
{<br />
left eye<br />
.<br />
.<br />
.<br />
nose<br />
left eye<br />
.<br />
.<br />
.<br />
nose<br />
Component<br />
representaion<br />
Code image<br />
s 2<br />
s 9<br />
... s1<br />
Component<br />
similarity vector<br />
PCA and<br />
normalization<br />
Concatenated<br />
patch histogram<br />
Pose<br />
evaluation<br />
Pose-adaptive<br />
classifier<br />
Pose-adaptive<br />
face similarity<br />
LE descriptor<br />
Face<br />
verificaton<br />
Figure 2.5: (a) Sampling patterns for Learning-based Descriptor. Neighboring pixels are sampled<br />
in a circular pattern [7] (b) Face Recognition with Learning-based Descriptor [7]. The top<br />
part shows the pipeline used to learn the face encoding (descriptor). The bottom section shows<br />
the overall pipeline, showing the pose adaptive recognition.<br />
quantization will transform the low level features into a single code, as shown in Fig. 2.5b,<br />
which will define a code image. Then a spatial grid is defined, and for each cell, a histogram of<br />
occurrence of codes is created. All the histograms are then concatenated to form a final vector.<br />
However, depending on the size of the grid, and the predefined number of codes, this histogram<br />
might be very large, therefore PCA is used to reduce its size.<br />
Empirically and surprisingly they showed that the discriminative power is even higher after<br />
the dimensionality reduction, and improves even further by simply normalizing the projected<br />
vector. They also show methods in which they can combine different sampling patterns to<br />
boost the performance as they might retrieve complementary information. It is important to<br />
remark that the higher results were obtained not for a holistic descriptor, but using feature<br />
localization, they use the encoding for each feature independently, and the alignment is done<br />
for each component, not as a global alignment.<br />
Besides the encoding, an adaptive matching was used, in which three exemplar images,<br />
with left, frontal and right pose were selected. For a unseen image, the similarity of the<br />
descriptors is computed against each of the exemplars, and the assigned pose is the one of<br />
the exemplar with the highest similarity. A classifier was trained for each combination of<br />
pose (left left, frontal frontal, right right, left right, left frontal, left right) in a way such that<br />
depending on the infered combination of pose for the input images, the corresponding classifier
Chapter 2: Related work 12<br />
is used. They showed with their results that this also brings an improvement in the accuracy<br />
of the classification.<br />
2.5 Multiple one-shots using label class information<br />
This method, introduced by Taigman et al. [36], is based in the one-shot similarity score (OSS).<br />
The OSS score is computed as follows: a set of face examples A is obtained, this has to be<br />
exclusive to the images to be compared in terms of identity. Then, if a pair of images xi and xj<br />
is to be classified, first a discriminative classifier fi is trained, using image xi as a single positive<br />
example and the set A as the negative examples. The process is repeated for xj to obtain a<br />
classifier fj. The OSS score is the average of the cross classification, i.e. s = (fi(xj)+fi(xj))/2.<br />
The work from Taigman [36] is an extension of this method which benefits from the use of<br />
label information. The proposal is to split the set A according to the identities, such that we<br />
have Ai,i = {1,2,...,n}. Then to create a single OSS score from each of the subsets to build<br />
a multiple one-shot vector. The motivation is to make classifiers which are more discriminative<br />
towards identity than to other factors, such as pose. If a subset of Ai has images of only one<br />
person and there is variety regarding factors such as pose, expression, etc. then the classifier<br />
will be more likely to discriminate identity. In the case a factor such as pose is constant within<br />
the subset Ai, then the OSS score will not be discriminative towards identity, but to pose,<br />
however they argue this information is beneficial when combining a large set of OSS scores into<br />
the multiple one-score vector. In such way that they also created subsets of images sharing the<br />
same pose to create more OSS scores.<br />
The pipeline for this algorithm can be observed in Fig. 2.6, and it is described as follows: the<br />
two images being compared are aligned, using a similar strategy to that of Section 3.3.2, from<br />
Figure 2.6: the multiple one-shot pipeline
13 2.5 Multiple one-shots using label class information<br />
which they create a feature vector. They tested with SIFT with a dense sampling, Local Binary<br />
Patterns (LBP), the three-patch and the four-patch LBP [42]. PCA is later used to reduce the<br />
dimensionality of the descriptor. Then Information Theoretic Metric Learning (ITML) is used to<br />
learn a Mahalanobis distance d(xi,xj) = (xi −xj) ⊤ S(xi −xj), which generates a distance above<br />
certain threshold for negative pairs while maintaining the distance below another threshold for<br />
positive pairs [9]. The learned matrix can be factorized using a Cholesky decomposition, as<br />
S = G ⊤ G, from which the matrix G is used to project the feature vectors. In the new space,<br />
the computation of the Euclidean distance is equivalent to computing the Mahalanobis distance<br />
in the previous space. The metric and the PCA projection are obtained from the training set<br />
prior to the computation of the OSS scores.<br />
Finally, for a pair of face images to be classified their feature vectors, projected using the<br />
matrix G, are used to generate multiple OSS scores using the subsets Ai, these are concatenated<br />
to create a vector which is fed into a SVM classifier.<br />
This algorithm has currently the highest accuracy reported for the Labeled Faces in the<br />
Wild benchmark, in the “unrestricted” protocol, explained in Section 3.7.1. However, notice<br />
the computation of OSS scores is very expensive, as many different discriminative models have<br />
to be trained in order to create the multiple OSS score vector.
Chapter 3<br />
The face recognition pipeline<br />
In this chapter the pipeline depicted in Fig. 3.1 is discussed in more detail. The function of<br />
each stage is described, and relevant algorithms for their implementation are presented.<br />
3.1 Face detection<br />
Face detection is the search of location and scale of instances of human faces within an arbitrary<br />
image. Again, the difficulty is to perform well in the presence of factors that affects images<br />
acquired in uncontrolled conditions (c.f. Fig. 1.1). Viola & Jones [39] proposed an efficient<br />
algorithm for face detection, based on Haar Wavelet Features and a cascade of classifiers,<br />
selected by the Adaboost algorithm.<br />
Adaboost [14] is an algorithm designed to create a “stronger classifier” from a set of “weak<br />
classifiers” through their linear combination. The algorithm iteratively selects, from the weak<br />
classifiers space, the one which minimizes a distributed error over the training data. The<br />
assigned weight to the selected classifier is dependent on the error and, at each iteration, the<br />
distribution is updated, in such way that, the training examples which were misclassified, are<br />
given higher importance in the following iterations.<br />
Face<br />
detection<br />
Alignment<br />
(optional)<br />
Facial features<br />
extraction<br />
(optional)<br />
Illumination<br />
normalization<br />
(optional)<br />
Figure 3.1: General face recognition pipeline<br />
14<br />
Visual feature<br />
extraction<br />
Face<br />
identification
15 3.2 Facial features localization<br />
A<br />
C<br />
¦¡ ¦¡ ¦<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¥¡ ¥<br />
¥¡<br />
¦¡ ¦<br />
¦¡<br />
¤¡ ¤¡ ¤<br />
£¡ £<br />
£¡<br />
¤¡ ¤<br />
¤¡<br />
£¡ £<br />
£¡<br />
¤¡ ¤<br />
¤¡<br />
£¡ £<br />
£¡<br />
¤¡ ¤<br />
¤¡<br />
£¡ £<br />
£¡<br />
¤¡ ¤<br />
¤¡<br />
£¡ £<br />
£¡<br />
¤¡ ¤<br />
¤¡<br />
£¡ £<br />
£¡<br />
¤¡ ¤<br />
¤¡<br />
£¡ £<br />
£¡<br />
¤¡ ¤<br />
¤¡<br />
£¡ £<br />
£¡<br />
¤¡ ¤<br />
¤¡<br />
(a)<br />
¡ ¡<br />
©¡ ©<br />
©¡<br />
¡<br />
¡<br />
©¡ ©<br />
©¡<br />
¡<br />
¡<br />
©¡ ©<br />
©¡<br />
¡<br />
¡<br />
©¡ ©<br />
©¡<br />
¡<br />
¡<br />
©¡ ©<br />
©¡<br />
¡<br />
¡<br />
©¡ ©<br />
©¡<br />
¡<br />
¡<br />
©¡ ©<br />
©¡<br />
¡<br />
¡<br />
©¡ ©<br />
©¡<br />
¡<br />
¡<br />
¨¡ ¨¡ ¨<br />
§¡ §<br />
§¡<br />
¨¡ ¨<br />
¨¡<br />
§¡ §<br />
§¡<br />
¨¡ ¨<br />
¨¡<br />
§¡ §<br />
§¡<br />
¨¡ ¨<br />
¨¡<br />
§¡ §<br />
§¡<br />
¨¡ ¨<br />
¨¡<br />
§¡ §<br />
§¡<br />
¨¡ ¨<br />
¨¡<br />
§¡ §<br />
§¡<br />
¨¡ ¨<br />
¨¡<br />
§¡ §<br />
§¡<br />
¨¡ ¨<br />
¨¡<br />
§¡ §<br />
§¡<br />
¨¡ ¨<br />
¨¡<br />
¢¡ ¢¡ ¢¡ ¢¡ ¢<br />
¡ ¡ ¡<br />
¡<br />
¢¡ ¢¡ ¢¡ ¢<br />
¢¡<br />
¡ ¡ ¡<br />
¡<br />
¢¡ ¢¡ ¢¡ ¢<br />
¢¡<br />
¡ ¡ ¡<br />
¡<br />
¢¡ ¢¡ ¢¡ ¢<br />
¢¡<br />
¡ ¡ ¡<br />
¡<br />
¢¡ ¢¡ ¢¡ ¢<br />
¢¡<br />
¡ ¡ ¡<br />
¡<br />
¢¡ ¢¡ ¢¡ ¢<br />
¢¡<br />
¡ ¡ ¡<br />
¡<br />
¢¡ ¢¡ ¢¡ ¢<br />
¢¡<br />
¡ ¡ ¡<br />
¡<br />
¢¡ ¢¡ ¢¡ ¢<br />
¢¡<br />
B<br />
D<br />
Figure 3.2: Viola-Jones object detection based on Haar Features (a) Examples of Haar Features<br />
(b) Feature computation from the integral image. Notice that the area marked as D can be<br />
computed using the points 1,2,3 and 4 from the Integral Image. D = 4 + 1 − (2 + 3) [39]<br />
Their algorithm has the advantage of providing a fast way to extract the Haar wavelets by<br />
precomputing what is called an Integral Image, Eq. (3.1). This is possible due to the rectangular<br />
geometry of Haar Wavelets (Fig. 3.2a), which can be later on computed by adding a few terms<br />
from the Integral image (Fig. 3.2b). This is an important asset for detection, as different Haar<br />
Filters must be computed in many locations and scales within the probe image.<br />
�<br />
Ĩ(x,y) =<br />
A<br />
C<br />
1<br />
3<br />
B<br />
D<br />
(b)<br />
x ′ ≤x,y ′ I(x<br />
≤x<br />
′ ,y ′ ) (3.1)<br />
While this algorithm is widely used because of its accuracy and speed, the implementation<br />
used in this thesis is an extension of the Viola-Jones algorithm. Besides Haar features, Histogram<br />
of Oriented Gradients (HoG) features (see Section 3.5.1) have been used. The advantage of using<br />
HoG features is that the same concept of the integral image can be applied, by creating the<br />
Integral Histogram [32,47]. This strategy boosts the speed of the algorithm, which benefits in<br />
terms of robustness from the use of additional features. Fig. 3.3 show some examples of face<br />
detections.<br />
3.2 Facial features localization<br />
Facial feature point localization is the first step in feature based algorithms. Its robustness<br />
is crucial for performance. The detector being used in this thesis is the one from [11], which<br />
is an improvement over the pictorial structure model [12]. The algorithm must maximize the<br />
following measure:<br />
2<br />
4
Chapter 3: The face recognition pipeline 16<br />
(a) (b) (c)<br />
Figure 3.3: (a) Correct face detections (b) Example of missed detection due to large pose<br />
variation (c) Incorrect detections due to a cluttered region<br />
p(F |p1,...,pn) ∝ p(p1,...,pn|F)<br />
n�<br />
i=1<br />
p(ai|F)<br />
p(ai|F)<br />
(3.2)<br />
Eq. (3.2) shows the probability of having the set of features F given a localization (p1,...,pn).<br />
This is proportional to the probability of having such localization (whether the relative posi-<br />
tioning of points is possible according to the expected geometry), multiplied by the ratio of<br />
obtaining the appearance ai for the respective feature, over the probability of also having that<br />
appearance given that the feature is not present. For the appearance model, there is the as-<br />
sumption of mutual independence between all the facial features, which is as well independent<br />
of their localization, and therefore appears as a multiplication. Eq. (3.2) can be understood as<br />
the combination of two models, one for the relative localization of the features and another for<br />
their appearance.<br />
The appearance ratios are modeled using a binary classifier, trained with feature/non fea-<br />
tures examples. It uses Haar Wavelets and Adaboost for the combination of the weak classifiers<br />
given by the Haar Features. It follows exactly the same algorithm as in Section 3.1, and the<br />
output is substituted directly into Eq.(3.2). On the other hand the localization is modeled with<br />
a tree-like Gaussian mixture in which there is a covariance dependency in the form of a tree.<br />
Each covariance depends on its parent node, as shown in Fig. 3.4 where nodes 2,3 and 4 are<br />
shown with an uncertainty relative to their parent node (1).<br />
The combination of both models present a highly reliable localization, which is able to cope<br />
with large pose variations. It is able to handle occlusions, as the expected positions compensate<br />
for appearance problems.<br />
As discussed in [12] the tree structure for the Gaussian Mixture Model allows for efficient al-<br />
gorithms for maximizing Eq. 3.2, and using the Viola-Jones algorithm for appearance modeling,
17 3.3 Face alignment<br />
Figure 3.4: Tree-like Gaussian Mixture Model for the localization of Facial Features<br />
speeds the algorithm as well.<br />
3.3 Face alignment<br />
Many recognition algorithms rely on the ability of the face detector to give a standard location<br />
and scale for the face. However, this is not always the case, standard face detectors such as<br />
Viola-Jones’s, and the one used for this project, give poorly aligned images. This is the trade-<br />
off between having the ability to detect faces with large changes in pose and expression with<br />
alignment and localization. In order to compesate those misalignments, different algorithms<br />
have been proposed to bring an arbitrary facial image to a canonical pose, in which facial<br />
features can be more easily compared. Recent algorithms have been proposed for non-rigid<br />
transformations, such that proper positioning of the facial features are infered, despite the<br />
pose, see Zhu et al. [46]. In this section, two algorithms restricted to rigid transformations are<br />
described.<br />
3.3.1 Funneling<br />
In 2007, Huang et al. [18], introduced a technique called unsupervised joint alignment. This<br />
algorithm models an arbitrary set of images (in this case, face images) as a distribution field,<br />
i.e. a model for which every pixel in the image is a random variable Xi, with possible values<br />
from an alphabet χ, for example, the set of pixel intensities for an 8-bit gray-scale image, i.e.<br />
χ = {1,2,...,256}. Then each pixel Xi is assigned with a distribution over χ.<br />
The first step of the algorithm, which can be considered as training, is called congealing.<br />
Computes the empirical distribution for each pixel, based on the stack of images, i.e. the em-<br />
pirical distribution field. Then, for each image, it performs a transformation (e.g. an affine<br />
transformation) such that the entropy over the distribution field is minimized. Then it recom-<br />
putes the empirical distribution field for the transformed images and repeats the iterations until
Chapter 3: The face recognition pipeline 18<br />
convergence.<br />
Distribution<br />
Field 1<br />
Distribution<br />
Field 2<br />
Figure 3.5: Congealing example [18]<br />
Distribution<br />
Field n<br />
Fig. 3.5 illustrates the idea of congealing. The distribution field is formed by a stack of 1D<br />
binary images, i.e. χ = {0,1}. At each iteration, a horizontal translation will be chosen for<br />
each image, in such way that the overall entropy is reduced. As a result, at iteration n, the<br />
images will be at a position such that they are considered aligned.<br />
Notice that congealing can be used directly to align a set of face images. However, it<br />
cannot be applied for an unseen example, unless the new image is inserted into the training<br />
set, and congealing is run again. Funneling is an efficient way of doing that, the idea is to<br />
keep the sequence of distribution fields at each iteration of congealing, and choose a sequence<br />
of transformations for the new image, based on the distribution field obtained at each iteration<br />
of congealing. In [18], instead of using pixel values, SIFT descriptors were used at each pixel<br />
location. Then k-Means is used to obtain 12 clusters, which are used as the alphabet χ.<br />
3.3.2 Facial features coordinates based alignment<br />
Another strategy consists in using the output of the facial features localization, i.e. the coordi-<br />
nates, to infer the necessary affine transformation which will bring the facial feature points to<br />
a canonical pose, one that will be shared among all the images.<br />
Let xf = (x f<br />
0 ,xf1<br />
,1)⊤ be the homogeneous coordinates for the feature f of a non aligned<br />
image, and yf = (y f<br />
0 ,yf 1 )⊤ the desired coordinates for the same feature. We want to obtain<br />
the affine transformation A(2 × 3) such that yf = Axf . To obtain the six parameters of A<br />
only three features are needed, however, in order to compensate for wrong localizations, all the<br />
features can be used to obtain the set of parameters which minimize the least squares error in<br />
localization.<br />
If A ′ is defined as the vector with the entries of A, Y is the vector with the target coordinates<br />
Y = (y0 0,y 0 F −1 F −1<br />
1,...,y 0 ,y1 ) ⊤ . And finally the matrix X, with the input coordinates for all<br />
the features, is defined as shown in Eq. (3.3).
19 3.4 Preprocessing for illumination invariance<br />
(a) (b) (c)<br />
Figure 3.6: Examples of Facial Features based alignment<br />
⎛<br />
x<br />
⎜<br />
X = ⎜<br />
⎝<br />
0 0 x0 0<br />
1<br />
0<br />
1<br />
0<br />
0<br />
x<br />
0 0<br />
0 0 x0 .<br />
F −1<br />
x0 .<br />
F −1<br />
x1 .<br />
1<br />
.<br />
0<br />
1<br />
.<br />
0<br />
⎞<br />
⎟<br />
1 ⎟<br />
.<br />
⎟<br />
. ⎟<br />
0⎠<br />
0 0 0<br />
F −1<br />
x<br />
F −1<br />
x 1<br />
0<br />
1<br />
(3.3)<br />
Then, for the new variables, y f = Ax f becomes Y = XA ′ . Its solution is given in Eq. (3.4)<br />
A ′ = (X ⊤ X) −1 X ⊤ Y (3.4)<br />
Figure 3.6 show some examples of alignments obtained using this strategy. The disadvantage<br />
of this approach is that facial feature localization algorithms are rather slow, and affected by<br />
high pose changes that can lead to wrong alignments. Furthermore, a single canonical pose<br />
is not suitable for major changes in viewpoint. Within this work, the target coordinates were<br />
obtained by averaging over the set of training examples.<br />
3.4 Preprocessing for illumination invariance<br />
In uncontrolled conditions the illumination setup in which the image was acquired might have<br />
a drastic influence over the obtained descriptor. Optionally, a preprocessing stage is desirable,<br />
in which the effect of illumination conditions, local shadowing and highlights is removed, while<br />
preserving the visual information that is important for recognition.<br />
Tan and Triggs [37] proposed an efficient pipeline to remove the effects of illumination,<br />
specifically for face recognition. First Gamma correction is used, i.e. a transformation of the<br />
pixel gray-level values I using the non-linear transform Î = Iγ , with 0 < γ < 1. This enhance<br />
the dynamic range by increasing the intensity in dark regions and decreasing it for bright<br />
regions. Next the image is convolved with a Difference of Gaussians (DoG) kernel, a bandpass<br />
filter which is intended to remove gradients caused by shadows (low frequency), to suppress
Chapter 3: The face recognition pipeline 20<br />
(a) (b)<br />
Figure 3.7: Preprocessing examples to gain illumination invariance (a) before preprocessing (b)<br />
after preprocessing<br />
noise (high frequency), and maintaining the useful signal for recognition (middle frequency).<br />
Additionally a mask could be used to remove regions which are irrelevant for recognition. Finally<br />
Contrast Equalization is used to have a standarized contrast spectrum for the image. This is<br />
done carefully by removing the effect of extreme values, such as artificial gradients introduced<br />
by the masking. Fig. 3.7 show examples of the resulting images after the preprocessing is<br />
applied. In this thesis we did not consider a preprocessing step for illumination invariance, as<br />
the used descriptors are based on gradients, and therefore are invariant to illumination shifts.<br />
3.5 Face descriptor<br />
The objective will be to transform an image into a feature vector xi ∈ R D . This vector must<br />
be discriminative, i.e. it must encode information that is relevant to determine the identity of<br />
the person. The learning algorithms described in section 3.6 show strategies to learn which<br />
information is relevant and which is not.<br />
In Section 2.2 a facial feature based descriptor was presented, which is the pixel intensities<br />
surrounding the localized facial features. Its intensities are normalized to have zero mean and<br />
unitary variance to gain robustness to illumination changes. We refer to that descriptor as<br />
a facial features patch. In this section, two more descriptors are described: a Histogram of<br />
Oriented Gradients and SIFT.<br />
3.5.1 Histogram of Oriented Gradients<br />
Histogram of Oriented Gradients (HoG) was initially proposed by Dalal and Triggs [8]. It is<br />
a global descriptor (Holistic), closely related to SIFT (see Section 3.5.2) and edge orientation<br />
histograms; designed for the human detection task. The pipeline used for their application is<br />
depicted in Fig. 3.8.
21 3.5 Face descriptor<br />
As illustrated in Fig. 3.8 the descriptor is build as follows: for an input image, the deriva-<br />
tives in x and y direction (Ix and Iy) are computed by convolving the image with the filters<br />
h = [−1,0,1] and h = [−1,0,1] ⊤ respectively. Then the magnitude and direction of the deriva-<br />
tives are obtained as M(i,j) = � Ix(i,j) 2 + Iy(i,j) 2 and Ω(i,j) = arctan(Iy(i,j)/Ix(i,j)) in<br />
such way that each pixel will have its gradient vector: magnitude and direction. Then accord-<br />
ing to a predefined number of cells to use, the image is splitted into a grid of cells × cells<br />
and, for each of them, a histogram is computed over the occurrence of the gradient angles of<br />
the pixels contained in that cell. The vote for each pixel is given by its magnitude, and a soft<br />
assignment is used, i.e. linear interpolation to share the vote among neighboring angle bins.<br />
The next step is to normalize the histogram by using blocks of cells, i.e. a group of cells over<br />
which their joint energy is used for normalization. Dalal and Triggs used overlaping blocks in<br />
such way that there is redundancy over the cells being used, differing only in the value used for<br />
their normalization.<br />
For this thesis the strategy for normalization is different: we allowed the cells to overlap,<br />
with the amount of overlap as a parameter, and we defined three types of normalization:<br />
Input<br />
image<br />
• Cell: The normalization value for each cell is computed using only the information within<br />
the same cell. This approach is highly invariant to non-uniform illumination changes, but<br />
the relative changes in gradient magnitudes between different cells is lost.<br />
• Global: All the cells are normalized with the same value, which is computed globally.<br />
In this case, the relative changes in magnitude between different cells is maintained, but<br />
there is poor illumination invariance.<br />
• Block: The objective of block normalization is to provide a local, but coarser normal-<br />
ization, in such way that it is a tradeoff between illumination invariance and maintaining<br />
changes in magnitude between different cells. The strategy is overlap dependent to com-<br />
ply with the geometry of the spatial grid, it can be used only for overlaps of 0% or 50%.<br />
In the case of 0% overlap, the normalization value is computed combining the energy of<br />
current cell (the one to be normalized) and 3 of its neighbors, as shown in Fig. 3.9a.<br />
In the case of 50%, the current cell is normalized using the neighbors in its diagonal.<br />
Considering that a cell is actually 4 small squares from Fig. 3.9b, the neighbors in the<br />
diagonal are covering the area of the current cell.<br />
Normalize<br />
gamma &<br />
colour<br />
Compute<br />
gradients<br />
Weighted vote<br />
into spatial &<br />
orientation cells<br />
Contrast normalize<br />
over overlapping<br />
spatial blocks<br />
Collect HOG’s<br />
over detection<br />
window<br />
Person /<br />
Linear non−person<br />
SVM classification<br />
Figure 3.8: Pipeline proposed by Dalal and Triggs for Human detection using HoG [8]
Chapter 3: The face recognition pipeline 22<br />
In all of the cases the normalization used is L2, i.e. for a vector x = (x0,x1,...,xD−1) ⊤ the<br />
normalized vector is obtained as x ′ = x/|x|, with:<br />
�<br />
�<br />
�<br />
|x| = � D−1 �<br />
i=0<br />
x 2 i<br />
(3.5)<br />
We also considered using a multiscale version of HoG. In this case, a HoG descriptor is<br />
computed for each level of the scale pyramid. The quantity of cells for level l, denoted as cl,<br />
depends on the scaling factor k, i.e. cl = c0k −l . As a summary, the parameters involved in the<br />
HoG descriptor computation are shown in Table 3.1<br />
Table 3.1: HoG parameters summary<br />
Parameter Description<br />
Cells Quantity of cells for the image grid<br />
Angles Quantify of angle bins for each histogram<br />
Overlap Fraction of overlap between neighboring cells<br />
Sign Wether the angle range is from 0-180 ◦ or 0-360 ◦<br />
Normalization Either cell, global or block normalization<br />
Levels Quantity of levels for the Multiscale HoG<br />
scaling (k) Scaling factor for each level of Multiscale<br />
3.5.2 Scale invariant feature transform (SIFT)<br />
The SIFT descriptor was proposed by Lowe [23], and it has proven to be very useful for object<br />
recognition and matching applications. This descriptor is local in the sense that it describes the<br />
region surrounding a keypoint, in a specific scale and orientation. Normally its location, scale<br />
(a) (b)<br />
Figure 3.9: HoG Block Normalization (a) Zero percent overlap: the highlighted cell is<br />
normalized using its energy plus the energy of its 3 inmediate neighbors (b) Fifty percent<br />
overlap: the current cell is normalized using the energy of its 4 diagonal neighbors, which<br />
are covering its area due to the overlap.
23 3.6 Learning/Classification<br />
and orientation are obtained from a interest point (keypoint) detector. In the case of [23], the<br />
keypoints are obtained as space-scale extremas using Difference of Gaussians (DoG) filtering.<br />
The SIFT descriptor has the structure depicted in Fig. 3.10, the idea is similar to that of<br />
the HoG descriptor. The gradient is computed for each pixel (in the interest region) and the<br />
area is divided in subregions (2x2 in Fig. 3.10) from which a histogram of gradients is computed<br />
by using the magnitude of the gradient as the vote for the angle bins. However, it is important<br />
to remark that previous to the histogram computation, a Gaussian weighting is applied to the<br />
magnitude, centered in the middle of the descriptor with σ equal to one half of the width of<br />
the descriptor. This will give less importance to the pixels in the extremes of the area, and<br />
therefore, reduce the effect of misalignments. In this thesis, we used SIFT descriptors with 4x4<br />
subregions, each of 8 angle bins, generating a 128 dimensional descriptor.<br />
Image gradients<br />
3.6 Learning/Classification<br />
Keypoint descriptor<br />
Figure 3.10: SIFT Descriptor structure [23].<br />
Each image i is represented by a descriptor vector xi ∈ R D . The vector xi is associated also<br />
with a categorical label yi corresponding to the person identity. A classification algorithm, for<br />
face recognition, models the binary decision of whether, images xi and xj, belong to the same<br />
class (yi = yj), or not (yi �= yj), as shown in Eq. (3.6)<br />
f(xi,xj) : R D×2 → {0,1} (3.6)<br />
In the following sections, relevant algorithms for classification are described.<br />
3.6.1 Spectral regression kernel discriminant analysis<br />
Kernel discriminant analysis (KDA) is an extension of the linear discriminant analysis (LDA)<br />
to handle non-linear data. In the case of LDA, it is asssumed that the data for each class follows
Chapter 3: The face recognition pipeline 24<br />
a normal distribution with equal covariance. The goal is to solve Eq. (3.7)<br />
Wopt = arg max<br />
W Tr{(W ⊤ SBW) −1 (W ⊤ SWW)} (3.7)<br />
Eq. (3.7) finds the optimal combination of features which separates the input data according<br />
to their classes. The objective function is such that the between class covariance SB is maxi-<br />
mized and the within class covariance SW is minimized. These terms are defined in Eq. (3.8)<br />
and Eq. (3.9) respectively.<br />
SB =<br />
SW =<br />
c�<br />
c�<br />
Ni(µi − µ)(µi − µ) ⊤<br />
i=1<br />
�<br />
i=1 xk∈Xi<br />
(xk − µi)(xk − µi) ⊤<br />
(3.8)<br />
(3.9)<br />
Where Ni and µi is the number of points and the mean for class i, and µ is the mean for all<br />
the data, independently of the class, and Xi is the subset of points that belong to class i. LDA<br />
can be described as an algorithm that finds, an optimal linear projection, such that the data<br />
belonging to the same class will be moved closer, and the data belonging to different classes<br />
will be pushed appart.<br />
In [2] it is shown the problem can be reformulated in terms of inner products. Therefore<br />
the Kernel trick can be used to handle non-linear data, which leads to the KDA algorithm. For<br />
this thesis we used an instance of KDA called Spectral Regression Kernel Discriminant Analysis<br />
(SR-KDA), from the work of Cai et al. [6]. It is a specific formulation of KDA in which the<br />
optimization process is theoretically 27 times faster. The limitation of SR-KDA is that the<br />
target space is limited to be of c − 1 dimensions, where c is the number of classes.<br />
3.6.2 Logistic regression<br />
General logistic Regression<br />
Logistic regression [4] models the probability of a feature vector xi to belong to a class as a<br />
logistic sigmoid function. Its argument is a linear combination of the entries of the feature<br />
vector. This is shown in Eq. (3.10).<br />
p(yi = 1|xi) = σ(w ⊤ xi), (3.10)<br />
where σ(z) = (1 + exp(−z)) −1 is the sigmoid function, and xi is given in homogeneous coor-<br />
dinates, i.e. allows for a bias term to be learned in w. Taking the negative log-likelihood (Eq.
25 3.6 Learning/Classification<br />
(3.11)) and its gradient (Eq. (3.12)) the optimal weights can be obtained by using a gradient<br />
descend algorithm until convergence (finding the minimum negative log-likelihood).<br />
Logistic discriminant metric learning<br />
L = − �<br />
tn lnpn + (1 − tn)ln (1 − pn) (3.11)<br />
n<br />
▽L = �<br />
(tn − pn)xn<br />
n<br />
(3.12)<br />
The objective of metric learning algorithms is to find, the matrix M ∈ R D×D , such that the<br />
Mahalanobis distance, Eq. (3.13), is minimized for positive examples (yi = yj), and maximized<br />
for negative pairwise examples (yi �= yj).<br />
dM(xi,xj) = (xi − xj) ⊤ M(xi − xj), (3.13)<br />
where M is restricted to be positive semidefinite 1 . Logistic Discriminant Metric Learning,<br />
proposed by Guillaumin et al. [16], model the probability of two examples to depict the same<br />
person as given by Eq. (3.14).<br />
pn(yi = yj|xi,xj;M,b) = σ(b − dM(xi,xj)), (3.14)<br />
where σ(z) = (1 + exp(−z)) −1 is the sigmoid function and b is a bias value. Let n be an index<br />
representing the pair ij. From Eq. (3.14), the likelihood over the seen data, taking tn as the<br />
target class for pair xn = (xi,xj), is given in Eq. (3.15).<br />
L =<br />
N�<br />
n<br />
p tn<br />
n (1 − pn) 1−tn (3.15)<br />
From which it can be shown that the negative log likelihood, and its gradient are given in<br />
Eq. (3.16) and Eq. (3.17) respectively.<br />
L = − �<br />
tn lnpn + (1 − tn)ln (1 − pn) (3.16)<br />
n<br />
▽L = �<br />
(tn − pn)Xn<br />
n<br />
(3.17)<br />
Xn is defined as the vectorization of (xi −xj)(xi −xj)⊤. Using Eq. (3.16) and Eq. (3.17) it<br />
1 A Matrix M ∈ R D×D is positive semidefinite if x T Mx ≥ 0, ∀x �= 0. It is denoted as M � 0
Chapter 3: The face recognition pipeline 26<br />
is possible to learn the values of M by minimizing the negative log-likelihood using a gradient<br />
descent algorithm. If the matrix is restricted to be positive semidefinite, then a Cholesky<br />
decomposition can be applied to it, i.e. M = LL ⊤ . In this case Eq. (3.13) can be reformulated<br />
as in Eq. (3.18)<br />
dL(xi,xj) = (L ⊤ xi − L ⊤ xj) ⊤ (L ⊤ xi − L ⊤ xj) (3.18)<br />
This result can be interpreted as a projection of the data followed by the computation of<br />
the Euclidean distance in the new space. Throughout this thesis, logistic discriminant metric<br />
learning will be used as the main learning algorithm.<br />
3.7 Datasets and evaluation<br />
In order to evaluate the performance of our algorithm, two datasets are used: Labeled Faces in<br />
the Wild (LFW) and Public Figures (PubFig). In this section a description of both datasets<br />
together with their evaluation protocol is presented.<br />
3.7.1 Labeled faces in the wild<br />
The main dataset used for this project is called Labeled Faces in the Wild (LFW) [19]. An im-<br />
portant dataset due to its high variability in pose, expression, illumination conditions, etc. and<br />
therefore, considered to be appropriate to evaluate face recognition approaches for uncontrolled<br />
settings [30]. Consist of 13233 images retrieved from Yahoo! News using a Viola-Jones face<br />
detector. With a resolution of 250 × 250, the scale and location of each face is approximately<br />
the same, therefore there is no need to use a face detector. Each image is labeled according to<br />
the person identity to give a total of 5749 identities. The quantity of images per person varies<br />
from 1 to 530.<br />
To redirect the research efforts towards algorithms of recognition more than alignment, there<br />
are three versions of LFW available:<br />
• Not Aligned: the set of images as taken directly from the face detector.<br />
• Aligned Funneled: aligned using the algorithm described in section 3.3.1.<br />
• Aligned Commercial: aligned using the algorithm introduced in [43].<br />
In order to have a standard evaluation method to properly compare different algorithms,<br />
a protocol was established. Ten independent subsets (folds) of images were defined, mutually<br />
exclusive in terms of image exemplars and identity. The evaluation protocol allows for two
27 3.8 Baseline performance<br />
different paradigms: restricted and unrestricted. For the restricted case, a set of 600 pairs are<br />
predefined for each of the ten folds, each pair has an associated label which indicates whether<br />
the images belong or not to the same person, 300 pairs for each case. In this case the identity<br />
must not be used, i.e. no more pairs can be created. In the unrestricted paradigm, the identities<br />
can be used, so that a large quantity of pairs can be created.<br />
For both cases, performance is reported as the mean over 10-fold cross validation. This<br />
means that one of the 10 folds is held out, and the training is done using the remaining subsets,<br />
then the accuracy is obtained by classifying the “unseen” 600 pairs that were left aside. This<br />
is done 10 times, rotating over the different folds and the final report is the mean and standard<br />
deviation of the accuracy over the 10 folds. In this work we will focus in the unrestricted<br />
paradigm.<br />
3.7.2 Public figures (PubFig)<br />
The Public Figures dataset was compiled by Kumar et al. [22] and it is larger than LFW. It<br />
consist of 59470 images of 200 people, collected from the internet. Therefore there are many<br />
more images per person than in LFW. Similarly to LFW it contains a large variability in pose<br />
variations, illumination, expression, etc.<br />
An important difference with LFW is that images are given as a list of URL addresses, from<br />
different sources of the internet. That represents a problem as through time some images will<br />
be lost. That was confirmed when we retrieved the dataset, 15% of the URLs were invalid and,<br />
as a consequence, 25% of the test pairs could not be created.<br />
The evaluation protocol is 10 fold cross validation using a “restricted” paradigm equivalent<br />
to that of LFW, and therefore, no additional pairs can be used to train the algorithm. Different<br />
benchmarks to measure the performance of the algorithm under specific conditions are provided,<br />
e.g. the behavior using only frontal pose images, or only using neutral expressions, etc.<br />
In our evaluation, we use the dataset as an “unrestricted” paradigm, defining our own pairs<br />
for training, but using the benchmarks test pairs for evaluation.<br />
3.8 Baseline performance<br />
Our baseline algorithm is the following: facial features are detected (see section 3.2), using the<br />
found coordinates, two feature vectors are build. The first vector is formed by the concatenation<br />
of SIFT descriptors, obtained from three different scales (16, 32 and 48 pixels width) at the<br />
location of each facial feature (following [16]). The other case is the concatenation of the facial<br />
feature patches from section 2.2. The implementation was done in Matlab, and computationally<br />
expensive sections such as alignments or feature extractions were implemented in C.
Chapter 3: The face recognition pipeline 28<br />
Table 3.2 show results obtained for both descriptors in the Aligned Commercial version<br />
of LFW. For comparison, two classifiers are used, the Euclidean distance between the feature<br />
vectors of the pair of images being classified, and using LDML to learn a proper Metric.<br />
It can be observed the significant contribution of Metric Learning approaches for face recog-<br />
nition. Additionally, when Euclidean distance is used for classification, there is no significant<br />
contribution of using SIFT descriptors from facial feature patches. The difference is only ob-<br />
served when a proper metric is used.<br />
Table 3.2: Baseline algorithms performance<br />
Classification Facial Feature Patches Multiscale SIFT<br />
Euclidean Distance 0.6702 ± 0.0031 0.6845 ± 0.0051<br />
Logistic Discriminant Metric Learning 0.7385 ± 0.0042 0.8524 ± 0.0052
Chapter 4<br />
Histogram of Oriented Gradients<br />
for face recognition<br />
4.1 Motivation<br />
Facial feature based approaches have gained popularilty in the past years, due to their ro-<br />
bustness regarding pose variations, in comparison with holistic approaches. However, the per-<br />
formance of the face recognition is strongly dependent on the accuracy of the facial feature<br />
detection. Facial feature localization algorithms, even if they have gained significant improve-<br />
ments, are still not able to cope with large pose variations. Besides, the computation time is<br />
high, in order to maximize the objective function within the set of possible locations, Eq.(3.2).<br />
For those reasons, it is desirable to have a pipeline without facial feature detection.<br />
There is also the intuition that holistic approaches will provide more information to the<br />
learning process, which might give a higher discrimination power to the overall algorithm.<br />
Therefore, a Histogram of Oriented Gradients (HoG) descriptor, a holistic encoding, was im-<br />
plemented following the description from Section 3.5.1. The programming language for the<br />
implementation was C, using the OpenCV library [5]. Assuming the input image’s resolution<br />
is 250x250, the descriptor is created for the 100x100 pixel region in the center of the image. It<br />
is important to dismiss the background in order to reduce biases the dataset might have [31].<br />
The objective is to find a set of parameters such that the discriminative power of the descriptor<br />
is suitable for face recognition<br />
29
Chapter 4: Histogram of Oriented Gradients for face recognition 30<br />
4.2 Alignment comparison<br />
It is important to decide whether the use of an alignment is imperative for holistic approaches,<br />
more specifically, for the use of a HoG descriptor. To answer this question, we compare the<br />
three variants of LFW: Not Aligned, Aligned Funneled and Aligned Commercial, and using the<br />
same parameters for the HoG descriptor.<br />
The first results are shown in Table 4.1. It reveals, in a consistent manner, that an alignment<br />
is crucial for face recognition using HoG. It seems interesting that the funneled version of LFW<br />
did not show any improvement over the not aligned version, in fact there is a decay. For that<br />
reason, we ran a face alignment using the location of the facial features (c.f. Section 3.3.2).<br />
It can be seen that this boost the results significantly, for the not aligned as for the aligned<br />
funneled version, with an increase of over 5%, while there is an insignificant decrease in the case<br />
of the aligned commercial version. Though it is not reported here, in our experiments, we did<br />
not observe a significant difference in accuracy between any of the LFW versions when using a<br />
facial feature based descriptor.<br />
These results brings two conclusions: a face alignment is indeed crucial for the use of HoG<br />
descriptors. However, as suggested by the decrease of accuracy of the funneled version with<br />
respect to the not aligned version, the alignment should be robust not only in terms of rotation<br />
and scale, but more important, to translation. We believe that funneling is not as robust in<br />
translation as a feature based alignment.<br />
It is intuitive to have a need for robust alignment regarding translation, as it is desirable for<br />
the corresponding features to fall in the same spatial cell. Once a parametric study was done for<br />
HoG (Section 4.3), the same experiment was performed using the best set of parameters (Table<br />
4.9). The results are shown in Table 4.2 which confirms the previous behavior. A disadvantage<br />
of this result is that, even if the descriptor is holistic, there will be a need for a facial feature<br />
detector prior to its computation. This will inherit the problems caused by the detector.<br />
Table 4.1: Alignment Comparison for an initial set of parameters for a HoG descriptor: 12x12<br />
cells, 16 angle bins, range [0-360] ◦ , 50% overlap with block normalization<br />
LFW variants<br />
Not Aligned Aligned Funneled Aligned Commercial<br />
No further Alignment 0.7568 ± 0.0053 0.7408 ± 0.0067 0.8205 ± 0.0063<br />
Feature based alignment 0.8069 ± 0.0066 0.8093 ± 0.0063 0.8171 ± 0.0047
31 4.3 HoG parametric study<br />
Table 4.2: Alignment Comparison for the final set of parameters: 16x16 cells, 16 angle bins,<br />
range [0-360] ◦ , 50% overlap with global normalization<br />
LFW variants<br />
Not Aligned Aligned Funneled Aligned Commercial<br />
No further Alignment 0.7660 ± 0.0061 0.7702 ± 0.0042 0.8432 ± 0.0062<br />
Feature based alignment 0.8276 ± 0.0051 0.8383 ± 0.0054 0.8357 ± 0.0058<br />
4.3 HoG parametric study<br />
We perform a parametric study for a Histogram or Oriented Gradients based face recognition.<br />
The evaluation follows the protocol established for LFW, i.e. evaluation using 10 fold cross-<br />
validation, and the results are reported as the mean and standard deviation of the accuracy<br />
over the 10 folds. Unless specified, the dataset used is LFW aligned commercial and the<br />
learning algorithm is LDML. As a search for the optimal parameters, considering all possible<br />
combinations, is almost intractable, we decided to optimize parameters one by one.<br />
4.3.1 Angle range<br />
As a first experiment we studied the effect of the angle range over the performance of the<br />
algorithm. To do that we set the rest of the parameters to a fixed value: 8 angle bins, as<br />
used for the SIFT descriptor [23], 8x8 cells and 50% overlap, using a block normalization. The<br />
experiment was repeated for the three variants of LFW to compare the results.<br />
It can be observed, from Table 4.3, that a range of [0-360] ◦ outperforms the range of [0-180] ◦ ,<br />
when combined with LDML. This is consistent for the three variants of LFW. Therefore, in the<br />
following experiments the default is a signed angle, i.e. a range of [0 − 360] ◦ .<br />
4.3.2 Normalization<br />
The three variants for normalization are described in Section 3.5.1. These are cell, block and<br />
global normalization. Fig. 4.1 show examples of HoG descriptors, plotted over the original<br />
image. For cell normalization, as the norm is the same for each spatial bin, the relative changes<br />
Table 4.3: Angle range comparison for HoG. 8x8 cells, 8 angle bins, 50% overlap and block<br />
normalization<br />
LFW variants<br />
Angle Range Not Aligned Aligned Funneled Aligned Commercial<br />
[0 − 180] ◦ 0.7150 ± 0.0053 0.7077 ± 0.0052 0.7563 ± 0.0082<br />
[0 − 360] ◦ 0.7523 ± 0.0071 0.7495 ± 0.0054 0.8017 ± 0.0066
Chapter 4: Histogram of Oriented Gradients for face recognition 32<br />
(a) (b) (c)<br />
Figure 4.1: HoG Normalization examples (a) Cell normalization (b) Block normalization and<br />
(c) Global Normalization<br />
Table 4.4: Normalization comparison for the HoG descriptor. Parameters: 16 angle bins, range<br />
[0-360] ◦<br />
Number of cells/Overlap(%)<br />
12/0 12/50 16/0 16/50<br />
Cell 0.7933 ± 0.0061 0.8128 ± 0.0077 0.7578 ± 0.0091 0.8178 ± 0.0061<br />
Block 0.8192 ± 0.0064 0.8305 ± 0.0064 0.8291 ± 0.0058 0.8385 ± 0.0074<br />
Global 0.8247 ± 0.0071 0.8283 ± 0.0068 0.8317 ± 0.0056 0.8432 ± 0.0062<br />
in magnitude between different cells is lost, which will diminish the influence of strong gradients.<br />
However it will be very robust to non uniform changes in illumination. In the case of global<br />
normalization, the important gradients, that appear from regions such as the eyes, mouth and<br />
nose, will be emphasized at the cost of a weaker resistance to illumination changes. Block<br />
normalization is the trade-off between cell and global paradigms.<br />
A experiment was performed in which the parameters were left unchanged, except for the<br />
normalization type, overlap and the number of cells. The results, found in Table 4.4, show<br />
consistenly that cell normalization give the worst performance. Global normalization leads to<br />
similar results as block normalization. In most of the cases global is better except for for 12<br />
cells with 50% overlap. Because of these results, and for its simplicity of computation, we take<br />
global normalization as the default for further experiments. The exception is the quantity of<br />
cells experiment, which was computed in parallel.
33 4.3 HoG parametric study<br />
4.3.3 Quantity of cells<br />
Another important parameter to determine is the quantity of cells. Table 4.5 show experiments<br />
we performed changing only this parameter. Here we used 16 angle bins over a signed range,<br />
i.e. [0-360] ◦ , using 0% overlap and global normalization. It can be observed that for more than<br />
14 cells there is not a significant variation and below that value, the results start to degrade. A<br />
reason, why above 14 cells there is no improvement in performance, might be because LDML<br />
start to combine the information of finer cells as if they were coarser. More cells will not bring<br />
any improvement, but only generate larger descriptors, e.g. there is not a significant difference<br />
in performance between 16 × 16 and 20 × 20, however for 20 cells the descriptor size is almost<br />
doubled compared to 16 cells. Therefore, we decided to set 16 cells as our default value.<br />
Table 4.5: Number of cells comparison for the HoG descriptor. 16 angle bins, range [0-360] ◦ ,<br />
0% overlap with block normalization<br />
Number of cells Accuracy<br />
10 0.8198 ± 0.0086<br />
12 0.8305 ± 0.0064<br />
14 0.8327 ± 0.0080<br />
16 0.8385 ± 0.0074<br />
18 0.8348 ± 0.0059<br />
20 0.8412 ± 0.0060<br />
22 0.8380 ± 0.0068<br />
4.3.4 Angle bins<br />
Angle bins refer to the quantity of partitions in which the angle range is split. Experiments<br />
were done to compare how is the performance affected by modifying the quantity of angle bins<br />
per cell. The results can be found in Table 4.6, it can be noticed the maximum is found at 16<br />
bins, therefore it is taken as the default for further experiments.<br />
Table 4.6: Accuracy obtained using different angle bins for the HoG descriptor. Parameters<br />
16x16 cells, range [0-360] ◦ , 0% overlap with global normalization<br />
Angle bins<br />
8 12 16 20<br />
0.8230 ± 0.0049 0.8270 ± 0.0052 0.8317 ± 0.0046 0.8295 ± 0.0077
Chapter 4: Histogram of Oriented Gradients for face recognition 34<br />
4.3.5 Overlap<br />
In Table 4.7 is shown the variation in accuracy as a function of the overlap, when the rest<br />
of parameters are left unchanged. The maximum in accuracy was obtained for the case the<br />
overlap is of 50%, corresponding to 0.8432 ± 0.0062. However, it is not highly affected for a<br />
range between 10% and 60%.<br />
It is important to remark that the cell size in pixels is a function of the the overlap when the<br />
image size remains fixed. Therefore to show that overlap is beneficial, an additional experiment<br />
was done: a 9x9 cells descriptor was created with no overlap. In this case, the cell size is<br />
similar to that of 16x16 cells using 50% overlap (≈ 11 pixels). The accuracy obtained was<br />
0.8207 ± 0.0080, which is lower than using overlap. We argue that overlap is beneficial as it<br />
helps to correct misalignments due to problems in face detection or pose variations.<br />
Table 4.7: Overlap comparison. Parameters 16x16 cells, 16 angle bins in the range [0-360] ◦ ,<br />
using global normalization<br />
Overlap (%)<br />
0 12.5 25 37.5 50 62.5 75<br />
0.8317 0.8423 0.8392 0.8412 0.8432 0.8415 0.8333<br />
±0.0056 ±0.0045 ±0.0054 ±0.0064 ±0.0062 ±0.0066 ±0.0052<br />
4.3.6 Multiscale HoG<br />
We also studied a multiscale HoG descriptor, in this case there are two parameters involved:<br />
the number of scales and the rescaling factor. The results from Table 4.8 show that the use of a<br />
multiscale approach does not bring any significant contribution to the performance. The reason<br />
might be related to the fact that a coarser level of the pyramid is only a linear combination of<br />
the finer cells. This will cause LDML to ignore coarser levels, as the information of the finest<br />
level of the pyramid is enough.<br />
Table 4.8: Multiscale HoG performance<br />
Levels/k Number of cells<br />
12 14 16 18<br />
2/1.15 0.8317 ± 0.0067 0.8407 ± 0.0065 0.8435 ± 0.0074 0.8425 ± 0.0074<br />
2/1.30 0.8287 ± 0.0067 0.8375 ± 0.0059 0.8380 ± 0.0047 0.8453 ± 0.0068<br />
2/1.45 0.8355 ± 0.0056 0.8388 ± 0.0068 0.8413 ± 0.0063 0.8410 ± 0.0074<br />
3/1.15 0.8322 ± 0.0074 0.8423 ± 0.0061 0.8398 ± 0.0062 0.8397 ± 0.0063<br />
3/1.30 0.8312 ± 0.0057 0.8383 ± 0.0065 0.8397 ± 0.0062 0.8435 ± 0.0070<br />
3/1.45 0.8312 ± 0.0059 0.8360 ± 0.0073 0.8440 ± 0.0057 0.8420 ± 0.0058
35 4.4 Discussion<br />
4.4 Discussion<br />
The conclusion of this study was the identification of appropriate parameters for face recogni-<br />
tion. The descriptor to be used will have 16x16 cells as the spatial grid, with an overlap of 50%,<br />
the angle histograms are created using 16 bins, which represent a range from 0 ◦ to 360 ◦ , the<br />
voting is done using soft assignment by linear interpolation. There is no need for a multiscale<br />
descriptor when using LDML as the classification algorithm.<br />
Further improvements could be achieved by reducing high differences of occurrence between<br />
certain angle bins. For example, it is expected for regions around the mouth to always have a<br />
high occurrence of horizontal lines. Therefore a large fraction of the feature vector energy will<br />
be distributed over the angle bins corresponding to those gradients, shadowing other bins with<br />
less occurrence. This problem is one of the main motivations for the work of Cao et al. [7], as<br />
this concentration of energy reduces the discriminative power of the descriptor.<br />
A simple way to balance the energy is by defining new descriptors x ′ by simply computing<br />
the square root of the input descriptors, i.e. x ′ = ( √ x0, √ x1,..., √ xD−1) ⊤ . This is similar<br />
to the computation of the Hellinger distance d(x,y) = �<br />
i (√ xi − √ yi) 2 , but extended to<br />
handle interfeatures correlation through the Mahalanobis distance. The effect of doing such<br />
test brought the results from 0.8432±0.0062 up to 0.8530 ± 0.0065 for the aligned commercial<br />
version of LFW. Notice that by using this method, the conclusion drawn for SIFT multiscale<br />
might not hold, as coarser cells would not be the linear combination of finer cells.<br />
This result suggest that it would be interesting to study different strategies to distribute<br />
the energy of the descriptor. For example, instead of computing the square root, a parameter<br />
γ ∈ [0,1] could be used to create a new feature vector x ′ = (x γ<br />
0 ,xγ<br />
1 ,...,xγ ) ⊤ . This is a<br />
generalization of the square root vector.<br />
Table 4.9: Best found parameters for HoG based recognition<br />
Parameter Description<br />
Cells 16<br />
Angle bins 16<br />
Overlap 50%<br />
Sign Signed, i.e. range: [0 − 360] ◦<br />
Normalization global<br />
Additional Square root of features<br />
Levels 1<br />
scaling (k) -
Chapter 5<br />
Facial feature based<br />
representations<br />
5.1 Motivation<br />
Pose and expression represent a major challenge for face recognition. Their appearance intro-<br />
duce non-linearity in the data, which might be difficult, or even impossible to handle using<br />
linear algorithms. This include metric learning approaches such as LDML.<br />
A way to overcome this limitation is to design descriptors invariant to these factors. Feature<br />
based approaches have proven to be useful to build descriptors less sensitive to changes in pose,<br />
as they are build at each facial feature, no matter their relative position. Another alternative<br />
is to use non-linear machine learning algorithms. In this case, the non-linear data lying in a<br />
high dimensional space might be separated. In this chapter some experiments using non-linear<br />
strategies are presented.<br />
Another challenge is to handle occlusions, a common problem in uncontrolled settings.<br />
We propose to separate the metric learning according to spatial regions, i.e. a specific metric<br />
is learned to classify a region of the face, as whether it represents the same person or not,<br />
independently of the rest of the face. Then to make a classification by combining the results<br />
given by each region. As an example, in the case of the HoG descriptor, this could be done by<br />
grouping neighboring cells. If that is the case, each region can be classified as occluded or not,<br />
using outlier detection algorithms. Thus, in later stages of classification, facial regions can be<br />
dismissed or not.<br />
Our first goal is to separate the training stage according to spatial regions, and to achieve<br />
similar results as using a global training (c.f. Section 3.8). We will consider each of the 9<br />
36
37 5.2 Feature wise classification<br />
detected facial features as a spatial region. These are: left eye left, left eye right, right eye left,<br />
right eye right, nose left, nose center, nose right, mouth left and mouth right.<br />
The second goal is to classify, for each feature, whether it is inlier or an outlier. The output<br />
will be a confidence value, a measure of “normality”. This score represent how well the specific<br />
instance being classified fits a model given by the training data.<br />
As a final step, it is desirable to include the confidence values into the classification. The<br />
goal is to reduce the influence of the occluded features and equally increase the influence of the<br />
observed features into the final decision.<br />
5.2 Feature wise classification<br />
In this case, the multiscale SIFT descriptor xi, obtained in Section 3.8, is split into 9 feature<br />
vectors. x f<br />
i denote the descriptor for the feature f of image i. Then, a metric Mf is learned<br />
for each feature separately. To take a joint decision for the classification, we propose two<br />
approaches, described as follows.<br />
Distance sum<br />
A joint distance is obtained by adding the feature wise distances and their bias terms. Both the<br />
metric and the bias term are learned using the LDML algorithm. This is shown in Eq. (5.1).<br />
Logistic Regression<br />
⎛<br />
p(yi = yj|xi,xj) = σ ⎝ �<br />
bf − �<br />
⎞<br />
dMf(xi,xj) ⎠ (5.1)<br />
f<br />
The problem with the distance sum approach is that it assumes every feature has the same<br />
contribution to the final decision. However this is not the case, to confirm that assertion,<br />
we refer to the joint learning described in Section 3.8. If we take into account that the Ma-<br />
halobis distance can be seen as a weighted combination of the entries of the difference vector,<br />
i.e. (xi − xj) ⊤ M(xi − xj) = �<br />
u<br />
scribes how significant is the pair of entries uv.<br />
f<br />
�<br />
v muv(x u i − xu j )(xv i − xv j ), then the magnitude of muv, de-<br />
In Fig. 5.1 is plotted the energy of the entries of M, which correlates the facial feature<br />
pairs according to a global learning, i.e. the entry at row u and column v show how correlated<br />
is the facial feature u with the facial feature v. It can be noticed there is higher energy in<br />
the diagonal as expected. However, it is important to remark that the energy is not equally<br />
distributed over the diagonal, implying that there are features which are more important than<br />
others. Interestingly the eyes are much more discriminative than the nose and the mouth. We
Chapter 5: Facial feature based representations 38<br />
left_eye_left<br />
left_eye_right<br />
right_eye_left<br />
right_eye_right<br />
nose_left<br />
nose_center<br />
nose_right<br />
mouth_left<br />
mouth_right<br />
left_eye_left<br />
left_eye_right<br />
right_eye_left<br />
nose_left<br />
right_eye_right<br />
nose_center<br />
nose_right<br />
mouth_left<br />
mouth_right<br />
Figure 5.1: Energy distribution for a joint learning of a Facial Features based descriptor<br />
assume this is a consequence of expression variation, in the case of mouth, and because of pose<br />
affecting the nose. Based on these observations, we will use logistic regression to find proper<br />
weights for each facial feature, as shown in Eq. (5.2).<br />
⎛<br />
p(yi = yj|xi,xj) = σ ⎝w0 + �<br />
⎞<br />
(wfdMf(xi,xj)) ⎠ (5.2)<br />
In Table 5.1 is shown the accuracy achieved for each facial feature separately, reported for<br />
one fold of the aligned commercial version of LFW. Additionally, the accuracy for both types of<br />
combination is given (for 1 fold). Notice how a single feature is not highly discriminative, how-<br />
ever it is better than a simple Euclidean distance classification using all the features (see Table<br />
3.2). Furthermore the performance of the algorithm is improved when features are combined.<br />
In this case, there was no difference between the distance sum and logistic regression. However<br />
in our experiments, we noticed logistic regression assigned higher weights to the eyes, followed<br />
by the nose and with lower weights to the mouth, which is consistent with the global learning,<br />
as depicted in Fig. 5.1. When running the algorithm for more folds, a difference appeared<br />
between the distance sum and logistic regression. The last two lines of Table 5.1 show results<br />
for more folds, in which logistic regression gives an advantage over the distance sum, however<br />
there is not a significant difference.<br />
f
39 5.2 Feature wise classification<br />
5.2.1 Occlusion detection<br />
Table 5.1: Results for separate facial feature learning<br />
Number of folds Feature Accuracy<br />
1 left eye left 0.7311<br />
left eye right 0.7832<br />
right eye left 0.7849<br />
right eye right 0.7412<br />
nose left 0.7597<br />
nose center 0.7445<br />
nose right 0.7378<br />
mouth left 0.6840<br />
mouth right 0.7143<br />
Distance sum 0.8434<br />
Logistic regression 0.8434<br />
4 Distance sum 0.8170 ± 0.0050<br />
Logistic regression 0.8191 ± 0.0055<br />
To detect occlusions we adopt a discriminative model for each facial feature, in which the<br />
descriptor x f<br />
i is classified as normal or occluded. We profit from the already implemented<br />
appearance model used for the facial feature localization algorithm (c.f. Section 3.2).<br />
The confidence value is modeled as p(f i |I) = σs,b(p(ai|F)/p(ai|F)), i.e. the output from the<br />
appearance model passed through a sigmoid function. This give a probabilistic estimate of how<br />
well the feature fits the appearance model. Notice the sigmoid function has two parameters,<br />
s and b which are the slope and a bias. These parameters could be inserted into the learning,<br />
however, in our experiments we used s = 1 and b = 0 for simplicity. Fig. 5.2a show some<br />
examples of correctly detected abnormalities (p(f i |I) < 0.5), we found that this method not<br />
only detects outliers given by objects occluding the facial feature, but it is also useful to detect<br />
erroneous localizations as shown in the last two images from Fig. 5.2a. For a pair of images Ii<br />
and Ij, a confidence vector qij is created as given in Eq. (5.3)<br />
qij = (p(f 1 |Ii) × p(f 1 |Ij)),...,p(f 9 |Ii) × p(f 9 |Ij) ⊤<br />
(5.3)<br />
To use the confidence values in the classification of an unseen pair of examples, we tried to<br />
normalize qij, in such way that the L1-norm is equal to 9, and then multiply the distance of<br />
the facial feature f by the the corresponding entry of the normalized qij. However, this did not<br />
affect much the performance for the distance sum, and decreased the accuracy for the case of<br />
logistic regression weighting. Instead we propose to use the confidence values not only for the<br />
distance, but as well for the bias. In the case of distance sum, the classification function from
Chapter 5: Facial feature based representations 40<br />
Table 5.2: Feature combination comparison using confidence values. Results reported for 4<br />
folds<br />
Distance Sum Logistic Regression<br />
Normal 0.8170 ± 0.0050 0.8191 ± 0.0055<br />
Confidence weighting 0.8306 ± 0.0048 0.8272 ± 0.0050<br />
Eq. (5.1) is modified as shown in Eq. (5.5)<br />
⎛<br />
p(yi = yj|xi,xj) = σ ⎝ �<br />
q f<br />
ijbf − �<br />
q f<br />
ijdMf(xi,xj) ⎞<br />
⎠ (5.4)<br />
f<br />
f<br />
⎛<br />
= σ ⎝ �<br />
q f<br />
ij (bf<br />
⎞<br />
− dMf(xi,xj)) ⎠ (5.5)<br />
f<br />
This can be thought as an adaptive threshold, function of the confidence for each of the<br />
facial features, or as the confidence weighting of the disparity between the feature distance<br />
and its threshold. In the worse case, when a facial feature is entirely occluded, the confidence<br />
value will remove its effect completely from the classification. For the case of logistic regression<br />
based combination, we assumed the learned bias value w0 can be split according to the learned<br />
weights. The classification function is shown in Eq. (5.6) and the results are given in Table 5.2.<br />
Notice the weights are being learned for the classifier in Eq. (5.2), and the confidence values<br />
are being inserted into the classification function only for the evaluation of the test set. However,<br />
it would desirable to also use the confidence values into the training process, such that logistic<br />
regression finds the optimal weights for this task.<br />
5.2.2 Discussion<br />
⎛<br />
p(yi = yj|xi,xj) = σ ⎝ �<br />
f<br />
q f<br />
ij w0<br />
�<br />
wf<br />
� k wk<br />
�<br />
− �<br />
q f<br />
ijwfd ⎞<br />
Mf(xi,xj) ⎠ (5.6)<br />
In this section we have shown that separate learning can be done according to spatial regions,<br />
then the distances can be combined to make a joint decision. The results show that a single<br />
feature is not very discriminative, but their combination bring a significant improvement. We<br />
found there was no major difference between using a distance sum approach to that of logistic<br />
regression.<br />
Even though it was not possible to generate the same results as for a global learning, this<br />
f
41 5.3 Non-linear approaches<br />
(a)<br />
(b)<br />
Figure 5.2: Examples of outlier detections: red color for occluded feature and yellow for normal<br />
feature. For this image we refer the reader to the electronic version of the document.(a) correct<br />
outlier detections (b) wrong outlier detections<br />
proves that the separation can be done. A cause for this limitation might be that we can<br />
not benefit from inter feature correlation, or more importantly, that the distances are not<br />
comparable. One way to overcome this problem is to do a global learning, in which the Matrix<br />
M is restricted to be block diagonal. This is equivalent to learning each facial feature metric<br />
separately, but in a way the distances are comparable.<br />
The results from Table 5.2 also show that the confidence value for each facial feature, can<br />
be integrated effectively into the decision function. This allows to handle occlusions and/or<br />
wrong localizations. We hope that if the feature-wise learning gets to be comparable to that of<br />
global, by using the confidence values we can improve the accuracy even further.<br />
A limitation of this algorithm is that it depends of a robust apperance model. As illustrated<br />
in Fig. 5.2b, the outlier detection algorithm might fail giving false detections. We suggest it is<br />
necessary to implement an appearance model trained specifically for this task.<br />
5.3 Non-linear approaches<br />
In this section we describe some experiments using non-linear algorithms. In every case the<br />
descriptor is the SIFT multiscale computed at the location of the facial features, same as in the<br />
previous section but without the separation into feature wise vectors.
Chapter 5: Facial feature based representations 42<br />
5.3.1 Spectral regression kernel discriminant analysis<br />
In this case we used SR-KDA(see Section 3.6.1) to find a non-linear projection of the input data<br />
such that discriminant information between the classes, i.e. the identities is emphasized. In the<br />
target space, a linear classification algorithm can be used. We compared Euclidean distance<br />
and LDML as classifiers. The results, computed over 10-fold cross-validation are shown in Table<br />
5.3.<br />
When comparing these results with the baseline from Table 3.2, which were added as well to<br />
Table 5.3, the contribution of using SR-KDA becomes evident. This can be observed specially<br />
for Euclidean distance classification, which presents an increase of accuracy of 10%. For LDML<br />
there is a 1% gain over the baseline. This results show that if Euclidean distance is used<br />
for classification, SR-KDA is effective to improve significantly the accuracy. For LDML the<br />
contribution is not that large. A limitation of using SR-KDA is that it is computationally<br />
expensive for a large quantity of data and classes, which is the case for LFW.<br />
Table 5.3: Results for using SRKDA projection of the input data. Results obtained for 10-fold<br />
cross-validation<br />
Euclidean distance LDML<br />
Not using SR-KDA 0.6845 ± 0.0051 0.8524 ± 0.0052<br />
Using SR-KDA 0.7883 ± 0.0029 0.8622 ± 0.0056<br />
5.3.2 Clustering<br />
In this section we describe another non-linear algorithm we explored, in which the input data<br />
is divided into clusters. This can be done before or after LDML learning. The intuition is that<br />
it is expected for similar faces to be grouped together as a cluster. Therefore, if a learning<br />
is done, specifically for that cluster, similar data might be separated in a way which was not<br />
possible when using a global training. This is a divide and conquer strategy.<br />
Pose adaptive classifier<br />
Following [7], described in Section 2.4, we build pose adaptive classifiers, where each pose<br />
is considered as a cluster. To assign a pose to an unseen example, a simple approach was<br />
implemented: three images were taken from the IMM database [25], one for each case: left (L),<br />
right (R) and frontal (F) pose. The identity, illumination and expression remained unchanged.<br />
For an unseen image, we assign the pose of the reference image for which there is a minimum<br />
Euclidean distance. This is the same approach as in [7] but with a different descriptor.
43 5.3 Non-linear approaches<br />
Once the images are clustered, according to pose, a LDML classifier is trained for each<br />
possible pair of poses, i.e. the six classifiers: LL, LR (RL), LF (FL), RR, RF (FR), FF. Table<br />
5.4 show the obtained results for 5 out of 10 folds. For comparison, the accuracy for the baseline<br />
algorithm is also shown as global learning.<br />
Table 5.4: Results for pose adaptive classification. Using 5 out 10 folds<br />
Accuracy<br />
Global learning 0.8504 ± 0.0049<br />
Pose adaptive 0.8358 ± 0.0026<br />
In Table 5.5 are shown the quantity of pairs from the test set which were assigned to each of<br />
the pose combinations. Additionally the achieved accuracy for each pose combination separately<br />
is also presented. The results show that frontal-frontal classification (FF) remained similar to<br />
that of global learning. However, the results for the combinations are not as good, with the<br />
worst case for pairs assigned to the left and right (LR) classifier.<br />
Table 5.5: Obtained pose combination accuracy. Results reported for 5 folds<br />
Pose combination<br />
LL FF RR LF (FL) LR (RL) FR (RF)<br />
Number of pairs 14 2302 16 310 23 290<br />
Accuracy 0.7833 0.8564 0.6524 0.7979 0.7517 0.8275<br />
Unsupervised clustering<br />
In this case the data is projected using the matrix L learned by the LDML algorithm. In the<br />
new space we explored using the different clustering strategies, described in Table 5.6. In this<br />
case, the objective is to train a classifier for each cluster separately, not training classifiers for a<br />
pair of clusters. For an unseen pair of examples, if the images are assigned to a different cluster,<br />
then are classified as having a different identity. In the case are assigned to the same cluster,<br />
the decision is done by the LDML classifier trained for that specific cluster.<br />
Table 5.6: Clustering algorithms<br />
Identifier Description<br />
KM Standard k-Means.<br />
S KM k-Means by adding supervision. At the assign step of the k-Means algorithm,<br />
points belonging to the same class are assigned to the same cluster.<br />
GMM Gaussian Mixture Model.
Chapter 5: Facial feature based representations 44<br />
In terms of computation time, this will represent an improvement, as the LDML complexity<br />
is quadratic with respect to the quantity of points. Therefore, splitting the data in k clusters<br />
and training k classifiers, each using n/k points will make the algorithm be k times faster.<br />
However, as Table 5.7 show this approach did not give good results. The reported accuracy<br />
is the ratio of positive pairs which are assigned to the same cluster. This value is used to<br />
measure the performance of the clustering because, positive pairs assigned to different clusters,<br />
are labeled as having a different class without possible correction. This can be considered as an<br />
upper bound of performance and therefore the expected results are lower than those obtained<br />
when doing an unclustered training. Due to its low clustering accuracy, we did not proceed to<br />
compute the metric for each cluster.<br />
Table 5.7: Ratio of positive pairs assigned to the same cluster.<br />
Algorithm Number of clusters<br />
3 4 5 10 30<br />
KM 0.75676 0.69932 0.63851 0.59797 0.34122<br />
S KM 0.71622 0.67905 0.65203 0.55405 0.32095<br />
GMM 0.87162 0.79054 0.79392 0.59122 0.31757<br />
Mixture model classification<br />
The final clustering approach is to do a soft assignment using a Gaussian Mixture Model<br />
(GMM), where the covariance is restricted to be diagonal. In this case a classifier is trained<br />
for every combination of clusters. There is no gain in efficiency of computation, but there<br />
is a “finer” learning, which might capture different information than a global learning. The<br />
classification function for an unseen pair is given in Eq. (5.7).<br />
p(yi = yj|xi,xj) = � �<br />
p(u|xi)p(v|xj)p(yi = yj|xi,xj;Muv,buv) (5.7)<br />
u<br />
v<br />
Where p(k|x) is the posterior probability for x to belong to cluster k taken directly from<br />
the GMM. The parameters Muv and buv are learned using the set of points, from the training<br />
data, that belong either to cluster u or cluster v after making a hard assignment (MAP). Table<br />
5.8 show the results, and comparing with the baseline, which is 0.8672 of accuracy for the first<br />
fold, we can deduce that there is no gain in trying to make a finer classification.<br />
5.3.3 Discussion<br />
In this section were presented some non-linear approaches for face recognition, using feature<br />
based descriptors. The experiments with SR-KDA showed that there is indeed a gain of using
45 5.3 Non-linear approaches<br />
Table 5.8: Accuracy obtained over 1 fold using a GMM Model<br />
Number of clusters<br />
2 3 4<br />
0.8574 0.8387 0.8454<br />
non-linear algorithms to separate the input data. A simple classification, such as Euclidean<br />
distance is improved significantly, by more than 10% of accuracy. However for the case of<br />
LDML the gain was of only 1% of gain in accuracy. The computational cost is another factor to<br />
be taken into account, as SR-KDA is computationally expensive for a large quantity of classes<br />
and data.<br />
When using clustering approaches, there is a problem due to the large quantity of positive<br />
pairs assigned to different clusters. This effect can be reduced only by diminishing the quantity<br />
of clusters being considered. However, the results showed that even if this is reduced to 2 or<br />
3 clusters, the lost of positive pairs in the clustering is too high. The only way to overcome<br />
this limitation is by using a soft assignment by using a mixture model. In this case, the results<br />
show that it is similar to a global learning. Thus, there is no gain in using this approach.
Chapter 6<br />
Combining face representations<br />
In previous chapters, it has been demonstrated that a good recognition rate can be achieved by<br />
learning a proper metric, with algorithms such as LDML. The feature vectors representing the<br />
face can be either a HoG encoding, or SIFT descriptors in the location of each facial feature.<br />
In this chapter, as a last experiment, it is demonstrated that the performance of classification<br />
can be improved even further, by combining the distance for all the descriptors.<br />
To combine the descriptors two approaches were explored, both in a logistic framework.<br />
In the first case, a global distance is obtained by adding the distance for each descriptor, the<br />
learned biases are combined as well, and both terms are passed through a sigmoid function.<br />
Let x f<br />
i<br />
denote the feature vector of type f for face i, where f can be either SIFT descriptors<br />
computed in 3 scales at the location of facial features, a HoG descriptor using the parameters<br />
from Table 4.9, or the facial feature patches described in Section 2.2. Mf denote the learned<br />
metric for feature f. This approach is shown in Eq. (6.1).<br />
p(yi = yj|x 1 i,...,x F i ,x 1 j,...,x F ⎛<br />
F�<br />
j ) = σ ⎝ b f −<br />
f=1<br />
F�<br />
f=1<br />
⎞<br />
dMf (xfi<br />
,xfj<br />
) ⎠ (6.1)<br />
The other way to combine the features is by using logistic regression. In this case the sum,<br />
from Eq. (6.1), becomes a linear combination of the distances. The weight assigned to each<br />
feature type and the joint bias term are learned using the logistic regression algorithm. From<br />
the training examples, a large set of pairs are created from which their distances are computed,<br />
using the metric learned in the previous experiments. These set of distances are then used as<br />
the training examples for the logistic regression, see Section 3.6.2. The decision function is<br />
given in Eq. (6.2).<br />
46
47 6.1 Results for LFW<br />
p(yi = yj|x 1 i,...,x F i ,x 1 j,...,x F �<br />
j ) = σ w0 +<br />
6.1 Results for LFW<br />
F�<br />
f=1<br />
wfdMf (xf i ,xfj<br />
)<br />
�<br />
Table 6.1 show the results obtained for LFW-Aligned Commercial dataset. For comparison,<br />
the results for each feature trained separately are shown as well. In most of the cases, the<br />
accuracy is higher than for individual features. Based on these results we can conclude there is<br />
complementary information given by holistic and feature based descriptors.<br />
(6.2)<br />
In the case of combining SIFT multiscale and the facial feature patches there was a decay<br />
in the accuracy. Notice that, for the same case, the standard deviation increased significantly.<br />
The reason for this decay is that the weights found by the logistic regression are quite different<br />
between the folds, although their relative proportions are maintained, i.e. as expected SIFT<br />
multiscale is given a larger weight than facial feature patches. This causes a problem when a<br />
global threshold is selected for all the folds, as done in the accuracy computation. To correct<br />
this problem, a regularization was added such that the L2-norm of the weight vector (without<br />
including the bias) is constant. The results shown in Table 6.1 shows that this strategy corrects<br />
the problem.<br />
Notice that the results are not very different between distance sum and logistic regression.<br />
Specially for the case of HoG and SIFT multiscale combination, the reason is because logistic<br />
regression is assigning the same weights to both descriptors. When the facial feature patches<br />
descriptor was inserted the learning process assigned a low weight to it, reducing its contribution.<br />
This was confirmed in our experiments.<br />
What is important to remark is that the highest gain comes from the combination of the<br />
HoG descriptor(holistic) with the SIFT multiscale (facial feature based). The facial feature<br />
Table 6.1: Results for the combination of descriptors in the LFW benchmark<br />
Descriptor: used(+), not used(-) Combination type<br />
SIFT HoG Feature Distance LR LR<br />
Multiscale (squared) Patches sum (Regularized)<br />
+ - - 0.8524 ± 0.0052<br />
- + - 0.8530 ± 0.0065<br />
- - + 0.7385 ± 0.0046<br />
- + + 0.8607 ± 0.0054 0.7901 ± 0.0119 0.8600 ± 0.0060<br />
+ - + 0.8536 ± 0.0053 0.8154 ± 0.0152 0.8515 ± 0.0052<br />
+ + - 0.8766 ± 0.0050 0.8749 ± 0.0049 0.8759 ± 0.0052<br />
+ + + 0.8719 ± 0.0058 0.8746 ± 0.0047 0.8724 ± 0.0059
Chapter 6: Combining face representations 48<br />
True positives rate<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
SIFT_FF + HoG, aligned<br />
LMDL+MKNN, funneled<br />
Multishot combined, aligned<br />
0<br />
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1<br />
False positives rate<br />
Figure 6.1: Receiver Operating Characteristic curve for HoG and SIFT combination for the<br />
LFW benchmark. Results for [16] and [36] also shown<br />
patches do not bring any contribution, as the information they represent is already encoded by<br />
the SIFT multiscale descriptor. The performance of this algorithm is among the state of the<br />
art for the LFW benchmark. The ROC curve for the HoG and SIFT combination is shown in<br />
Fig. 6.1.<br />
Our results are comparable with performance of the state of the art algorithms, reported<br />
for the unrestricted paradigm of LFW. These methods gave an accuracy of 0.8517±0.0061 [36],<br />
0.8750 ± 0.0040 [16] and 0.8950 ± 0.0051 [36].<br />
6.2 Results for PubFig<br />
We tested the algorithm over the PubFig dataset [22]. However, in this case, the pipeline<br />
included face detection and a facial features based alignment, prior to the computation of the<br />
descriptors. The facial feature patches were discarded due to the results obtained for LFW.<br />
The problem is that the results are not comparable to the ones reported in [22], due to<br />
the images removed from their original location. For that reason we do not follow the training<br />
protocol, which is defined as a “restricted” paradigm. We train using the label information<br />
so that many more pairs for training can be generated. However, we keep using 10-fold cross
49 6.2 Results for PubFig<br />
validation for evaluation.<br />
We take advantage of the separation of sets according to illumination, expression and pose,<br />
which allows us to observe how sensitive is our algorithm to these variants. The training images<br />
are the same, but we test only in the specified benchmark.<br />
Table 6.2 show the results given for all the PubFig benchmarks, the combination algorithm<br />
is distance sum. The logistic regression for the combination of features gave practically the<br />
same results. Again, the learned weights are practically the same.<br />
From the results it can be concluded that our algorithm is sensitive to pose changes, as there<br />
is a different of almost 5% between posefront and poseside benchmarks. The same happens with<br />
the light benchmarks, with a difference of almost 4% between lightfront and lightside. In the<br />
case of expression, there was almost no difference. The ROC curves are illustrated in Fig. 6.2.<br />
Table 6.2: Results for the different variants of the PubFig dataset<br />
Dataset Accuracy<br />
pubfig full 0.7763 ± 0.0068<br />
pubfig posefront 0.8111 ± 0.0139<br />
pubfig poseside 0.7656 ± 0.0108<br />
pubfig lightfront 0.7875 ± 0.0125<br />
pubfig lightside 0.7485 ± 0.0080<br />
pubfig exprneutral 0.7733 ± 0.0128<br />
pubfig exprexpr 0.7759 ± 0.0072<br />
True positives rate<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
Pubfig_exprexpr<br />
Pubfig_exprneutral<br />
Pubfig_full<br />
Pubfig_lightfront<br />
Pubfig_lightside<br />
Pubfig_posefront<br />
Pubfig_poseside<br />
0<br />
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1<br />
False positives rate<br />
Figure 6.2: Receiver Operating Characteristic curve for a combination of HoG descriptors and<br />
SIFT for the PubFig benchmarks
Chapter 7<br />
Conclusions and future work<br />
In this thesis we compared two robust descriptors for face recognition in uncontrolled settings.<br />
The first one is a Histogram of Oriented Gradients, computed over the entire face, and therefore<br />
a holistic approach. For the HoG descriptors we found a suitable set of parameters which gave a<br />
good performance for the Labeled Faces in the Wild benchmark. It was concluded that the use<br />
of a face alignment is crucial, when combined with a metric learning algorithm such as LDML.<br />
This alignment must be robust in terms of translations, in such way that facial features for<br />
the pair of images being compared, are localized approximately in the same spatial cell. The<br />
coordinates of facial features can be used to obtain a transformation which aligns the face into<br />
the desired pose. However there is the need for an improvement of the alignment algorithm<br />
and/or the facial feature point localization.<br />
The second visual feature vector we studied is a multiscale SIFT descriptor, computed in<br />
the location of facial features, therefore a feature based approach. This strategy gave good<br />
performance when combined with LDML. We concluded that it is possible to make a separate<br />
training for each facial feature and then combine their distances to make a joint decision.<br />
Even though the results were not as good as for a global learning, it opened the door to<br />
handle occlusions. We obtained a confidence value for each facial feature from a discriminative<br />
appearance model. This is a measure of how reliable is the information of the descriptor and<br />
it is not consider as an occlusion or a bad localization. The confidence level was succesfully<br />
integrated into the decision function which increased the accuracy.<br />
We also studied non-linear methods, from which we did not obtain good results for clustering<br />
strategies, neither based on pose, unsupervised clustering or as a Gaussian Mixture Model.<br />
However, algorithms, such as SR-KDA, are able to find non-linear discriminant information in<br />
the data. It was able to make a slight increase of the performance, at the expense of being<br />
more computationally expensive. These results show that it would be interesting to go further<br />
50
51<br />
into studying other types of non-linear algorithms.<br />
Finally, we demonstrated that the distances, given by different descriptors, can be inte-<br />
grated to boost the performance of the face recognition pipeline. The obtained performance<br />
is comparable with the state of the art for the Labeled Faces in the Wild and Public Figures<br />
benchmarks.<br />
The PubFig benchmark shows that our algorithm is highly sensitive to pose and illumination<br />
changes. In the case of illumination, it means that the normalization being used does not present<br />
a good invariance to this factor, and therefore it is necessary to address this issue.
Bibliography<br />
[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face recognition with local binary patterns.<br />
<strong>page</strong>s 469–481. 2004.<br />
[2] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach,<br />
2000.<br />
[3] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces: Recognition<br />
using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine<br />
Intelligence, 19:711–720, 1997.<br />
[4] C. Bishop. Pattern recognition and machine learning (Information Science and Statistics).<br />
Springer, 1st ed. 2006. corr. 2nd printing edition, October 2007.<br />
[5] G. Bradski. The OpenCV library. Dr. Dobb’s Journal of Software Tools, 2000.<br />
[6] Deng Cai. Efficient kernel discriminant analysis via spectral regression. Technical report,<br />
2007.<br />
[7] Z. Cao, Q. Yin, X. Tang, and Jian S. Face recognition with learning-based descriptor. In<br />
Proc. Computer Vision and Pattern Recognition, 2010.<br />
[8] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow<br />
and appearance. In European Conference on Computer Vision, 2006.<br />
[9] J. Davis, B. Kulis, S. Sra, and I. Dhillon. Information-theoretic metric learning. In in<br />
NIPS 2006 Workshop on Learning to Compare Examples, 2007.<br />
[10] M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is... Buffy Automatic naming<br />
of characters in TV video. In In BMVC, 2006.<br />
[11] M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automatic naming of<br />
characters in TV video. Image and Vision Computing, 27(5), 2009.<br />
52
53 BIBLIOGRAPHY<br />
[12] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. Int. J.<br />
Comput. Vision, 61(1):55–79, 2005.<br />
[13] A. Ferencz, E. Learned-Miller, and J. Malik. Learning hyper-features for visual identifica-<br />
tion. In Neural Information Processing Systems, volume 18, 2004.<br />
[14] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an<br />
application to boosting. In European conference on computational learning theory, <strong>page</strong>s<br />
23–37, 1995.<br />
[15] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Automatic face naming with<br />
caption-based supervision. In Conference on Computer Vision & Pattern Recognition,<br />
<strong>page</strong>s 1–8, jun 2008.<br />
[16] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? Metric learning approaches for<br />
face identification. In International Conference on Computer Vision, sep 2009.<br />
[17] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant<br />
mapping. In CVPR ’06: Proceedings of the 2006 IEEE Computer Society Conference on<br />
Computer Vision and Pattern Recognition, <strong>page</strong>s 1735–1742, Washington, DC, USA, 2006.<br />
IEEE Computer Society.<br />
[18] G. Huang and V. Jain. Unsupervised joint alignment of complex images. In In ICCV,<br />
2007.<br />
[19] G. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A<br />
database for studying face recognition in unconstrained environments. Technical Report<br />
07-49, University of Massachusetts, Amherst, October 2007.<br />
[20] A. Kläser. Human detection and character recognition in tv-style movies. In Informatiktage,<br />
<strong>page</strong>s 151–154, 2007.<br />
[21] N. Kumar, P. N. Belhumeur, and S. K. Nayar. FaceTracer: A search engine for large<br />
collections of images with faces. In European Conference on Computer Vision (ECCV),<br />
<strong>page</strong>s 340–353, Oct 2008.<br />
[22] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers<br />
for face verification. In IEEE International Conference on Computer Vision (ICCV), Oct<br />
2009.<br />
[23] D. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision,<br />
60(2):91–110, 2004.
BIBLIOGRAPHY 54<br />
[24] B. Moghaddam. Bayesian face recognition. Pattern Recognition, 33(11):1771–1782, Novem-<br />
ber 2000.<br />
[25] M. M. Nordstrøm, M. Larsen, J. Sierakowski, and M. B. Stegmann. The IMM face database<br />
- an annotated dataset of 240 face images. Technical report, Informatics and Mathematical<br />
Modelling, Technical University of Denmark, DTU, Richard Petersens Plads, Building 321,<br />
DK-2800 Kgs. Lyngby, may 2004.<br />
[26] E. Nowak and F. Jurie. Learning visual similarity measures for comparing never seen<br />
objects. In Conference on Computer Vision & Pattern Recognition, jun 2007. see also<br />
http://lear.inrialpes.fr/people/nowak/.<br />
[27] P. J. Phillips, P. Flynn, T. Scruggs, K. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min,<br />
and W. Worek. Overview of the face recognition grand challenge. <strong>page</strong>s 947–954, 2005.<br />
[28] P. J. Phillips, W. T. Scruggs, A. O’Toole, P. Flynn, K. Bowyer, C. Schott, and M. Sharpe.<br />
Frvt 2006 and ice 2006 large-scale experimental results. IEEE Transactions on Pattern<br />
Analysis and Machine Intelligence, 32:831–846, 2010.<br />
[29] S. Phimoltares, C. Lursinsap, and K. Chamnongthai. Face detection and facial feature<br />
localization without considering the appearance of image context. Image Vision Comput.,<br />
25(5):741–753, 2007.<br />
[30] N. Pinto, J. J. di Carlo, and D. D. Cox. Establishing good benchmarks and baselines for<br />
face recognition. In Faces in real life images workshop at ECCV08, 2008.<br />
[31] N. Pinto, J.J. DiCarlo, and D.D. Cox. How far can you get with a modern face recognition<br />
test set using only simple features? Computer Vision and Pattern Recognition, IEEE<br />
Computer Society Conference on, 0:2591–2598, 2009.<br />
[32] F. Porikli. Integral histogram: A fast way to extract histograms in cartesian spaces. In in<br />
Proc. IEEE Conf. on Computer Vision and Pattern Recognition, <strong>page</strong>s 829–836, 2005.<br />
[33] S. Rizvi, P. J. Phillips, and H. Moon. The FERET verification testing protocol for face<br />
recognition algorithms, 1999.<br />
[34] J. Shi and C. Tomasi. Good features to track, 1994.<br />
[35] J. Sivic, M. Everingham, and A. Zisserman. “Who are you?”: Learning person specific<br />
classifiers from video. In Proceedings of the IEEE Conference on Computer Vision and<br />
Pattern Recognition, 2009.
55 BIBLIOGRAPHY<br />
[36] Y. Taigman, L. Wolf, and T. Hassner. Multiple one-shots for utilizing class label informa-<br />
tion. In The British Machine Vision Conference (BMVC), Sept. 2009.<br />
[37] X. Tan and B. Triggs. Enhanced local texture feature sets for face recognition under<br />
difficult lighting conditions. In Analysis and modelling of faces and gestures, volume 4778<br />
of LNCS, <strong>page</strong>s 168–182. Springer, oct 2007.<br />
[38] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience,<br />
3(1):71–86, 1991.<br />
[39] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.<br />
Proc. CVPR, 1:511–518, 2001.<br />
[40] X. Wang and X. Tang. A unified framework for subspace face recognition. IEEE Trans.<br />
Pattern Anal. Mach. Intell., 26(9):1222–1228, 2004.<br />
[41] K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor<br />
classification. J. Mach. Learn. Res., 10:207–244, 2009.<br />
[42] L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In Real-Life<br />
images workshop at the European Conference on Computer Vision (ECCV), October 2008.<br />
[43] L. Wolf, T. Hassner, and Y. Taigman. Similarity scores based on background samples. In<br />
Asian Conference on Computer Vision (ACCV), 2009.<br />
[44] M. Yang. Face recognition using kernel methods, 2001.<br />
[45] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: a literature<br />
survey. ACM Comput. Surv., 35(4):399–458, 2003.<br />
[46] J. Zhu, L. Van Gool, and S. Hoi. Unsupervised face alignment by robust nonrigid mapping.<br />
In IEEE International Conference on Computer Vision, 2009.<br />
[47] Q. Zhu, M. Yeh, K. Cheng, and S. Avidan. Fast human detection using a cascade of<br />
histograms of oriented gradients. In CVPR ’06: Proceedings of the 2006 IEEE Computer<br />
Society Conference on Computer Vision and Pattern Recognition, <strong>page</strong>s 1491–1498, Wash-<br />
ington, DC, USA, 2006. IEEE Computer Society.