24.11.2013 Views

PhD Thesis Semi-Supervised Ensemble Methods for Computer Vision

PhD Thesis Semi-Supervised Ensemble Methods for Computer Vision

PhD Thesis Semi-Supervised Ensemble Methods for Computer Vision

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

40 Chapter 3. Overview of <strong>Semi</strong>-<strong>Supervised</strong> Learning<br />

text classification, showed that even with this low-sophisticated method weakly-related<br />

unlabeled data can help improving the classification accuracy.<br />

<strong>Semi</strong>-<strong>Supervised</strong> Learning from Weakly-Related Unlabeled Data Building on the<br />

similar ideas as self-taught learning, recently Yang et al. [Yang et al., 2008] presented an<br />

improved version of STL called “<strong>Semi</strong>-<strong>Supervised</strong> Learning with Weakly-Related Unlabeled<br />

Data” (SSLW). In particular, Yang et al. highlight that many SSL approaches are<br />

based on the cluster assumption, which, however is violated if the unlabeled data is only<br />

weakly related to the target classes. SSLW also tries to find a better data representation<br />

that is both in<strong>for</strong>mative to the target class and consistent with the feature coherence<br />

patterns of the weakly related unlabeled data.<br />

In more detail, out of the labeled data D L , SSLW uses a document-word matrix M D =<br />

(d 1 , d 2 , . . . , d l ), where d i ∈ N V represents the word-frequency vector <strong>for</strong> document d i<br />

and V is the size of the vocabulary. Additionally, they make use of a second matrix, the<br />

word-document matrix G out of both labeled and unlabeled data. G = (g 1 , g 2 , . . . , g V ),<br />

where g i = (g i,1 , g i,2 , . . . , g i,n ) represents the occurance of the ith word in all the n documents.<br />

For a SVM-<strong>for</strong>mulation one could now use M D in order to build the kernel K =<br />

MD TM D <strong>for</strong> the SVM’s dual <strong>for</strong>mulation. However, such a kernel would discard weakly<br />

related documents, i.e., set the similarity to zero. There<strong>for</strong>e, Yang et al. augment the<br />

kernel with a word-correlation matrix R ∈ R V ×V to be K = MD TRM D. In R, R ij<br />

represents the correlation between ith and the jth words. The goal is now to find the<br />

optimal R that maximizes the categorization margin. This is done by regularizing R<br />

according to G by introducing an internal representation of words W = (w 1 , w 2 , . . . , w V ),<br />

w i is the internal representation of the ith word. The word-correlation matrix can then be<br />

written as R = W T W . Now, the dual <strong>for</strong>mulation of the SVM can be changed to a<br />

min-max problem in order to find both the maximum α and minimum R.<br />

min max α T e − 1<br />

R∈∆,U,W α 2 (α ◦ y)T (MDRM T D )(α ◦ y) (3.16)<br />

Equation 3.16 can be efficiently solved using Second Order Cone Programming (SOCP)<br />

[Boyd and Vandenberghe, 2004]. For text categorization, SSLW has successfully demonstrated<br />

of being able to leverage the usage of both labeled and weakly-related unlabeled<br />

data in order to increase the generalization error and significantly outper<strong>for</strong>med selftaught<br />

learning and state-of-the-art SSL methods such as TSVM [Bennett and Demiriz,<br />

1999] and manifold regularization [Belkin et al., 2006].<br />

EigenTransfer Self-taught learning and SSLW can both also be considered as a transfer<br />

learning problem; however, without knowing the class labels of the source data.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!