PhD Thesis Semi-Supervised Ensemble Methods for Computer Vision
PhD Thesis Semi-Supervised Ensemble Methods for Computer Vision
PhD Thesis Semi-Supervised Ensemble Methods for Computer Vision
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
40 Chapter 3. Overview of <strong>Semi</strong>-<strong>Supervised</strong> Learning<br />
text classification, showed that even with this low-sophisticated method weakly-related<br />
unlabeled data can help improving the classification accuracy.<br />
<strong>Semi</strong>-<strong>Supervised</strong> Learning from Weakly-Related Unlabeled Data Building on the<br />
similar ideas as self-taught learning, recently Yang et al. [Yang et al., 2008] presented an<br />
improved version of STL called “<strong>Semi</strong>-<strong>Supervised</strong> Learning with Weakly-Related Unlabeled<br />
Data” (SSLW). In particular, Yang et al. highlight that many SSL approaches are<br />
based on the cluster assumption, which, however is violated if the unlabeled data is only<br />
weakly related to the target classes. SSLW also tries to find a better data representation<br />
that is both in<strong>for</strong>mative to the target class and consistent with the feature coherence<br />
patterns of the weakly related unlabeled data.<br />
In more detail, out of the labeled data D L , SSLW uses a document-word matrix M D =<br />
(d 1 , d 2 , . . . , d l ), where d i ∈ N V represents the word-frequency vector <strong>for</strong> document d i<br />
and V is the size of the vocabulary. Additionally, they make use of a second matrix, the<br />
word-document matrix G out of both labeled and unlabeled data. G = (g 1 , g 2 , . . . , g V ),<br />
where g i = (g i,1 , g i,2 , . . . , g i,n ) represents the occurance of the ith word in all the n documents.<br />
For a SVM-<strong>for</strong>mulation one could now use M D in order to build the kernel K =<br />
MD TM D <strong>for</strong> the SVM’s dual <strong>for</strong>mulation. However, such a kernel would discard weakly<br />
related documents, i.e., set the similarity to zero. There<strong>for</strong>e, Yang et al. augment the<br />
kernel with a word-correlation matrix R ∈ R V ×V to be K = MD TRM D. In R, R ij<br />
represents the correlation between ith and the jth words. The goal is now to find the<br />
optimal R that maximizes the categorization margin. This is done by regularizing R<br />
according to G by introducing an internal representation of words W = (w 1 , w 2 , . . . , w V ),<br />
w i is the internal representation of the ith word. The word-correlation matrix can then be<br />
written as R = W T W . Now, the dual <strong>for</strong>mulation of the SVM can be changed to a<br />
min-max problem in order to find both the maximum α and minimum R.<br />
min max α T e − 1<br />
R∈∆,U,W α 2 (α ◦ y)T (MDRM T D )(α ◦ y) (3.16)<br />
Equation 3.16 can be efficiently solved using Second Order Cone Programming (SOCP)<br />
[Boyd and Vandenberghe, 2004]. For text categorization, SSLW has successfully demonstrated<br />
of being able to leverage the usage of both labeled and weakly-related unlabeled<br />
data in order to increase the generalization error and significantly outper<strong>for</strong>med selftaught<br />
learning and state-of-the-art SSL methods such as TSVM [Bennett and Demiriz,<br />
1999] and manifold regularization [Belkin et al., 2006].<br />
EigenTransfer Self-taught learning and SSLW can both also be considered as a transfer<br />
learning problem; however, without knowing the class labels of the source data.