11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

528 Rise of the machines●●●●●●●●●●●FIGURE 44.1Labeled data.X n+1 ,...,X N without the corresponding Y values. We thus have labeled dataL = {(X 1 ,Y 1 ),...,(X n ,Y n )} and unlabeled data U = {X n+1 ,...,X N }. Howdo we use the unlabeled data in addition to the labeled data to improve prediction?This is the problem of semi-supervised inference.Consider Figure 44.1. The covariate is x =(x 1 ,x 2 ) ∈ R 2 .Theoutcomeinthis case is binary as indicated by the circles and squares. Finding the decisionboundary using only the labeled data is difficult. Figure 44.2 shows the labeleddata together with some unlabeled data. We clearly see two clusters. If wemake the additional assumption that Pr(Y =1|X = x) issmoothrelativetothe clusters, then we can use the unlabeled data to nail down the decisionboundary accurately.There are copious papers with heuristic methods for taking advantageof unlabeled data. To see how useful these methods might be, consider thefollowing example. We download one-million webpages with images of catsand dogs. We randomly select 100 pages and classify them by hand. Semisupervisedmethods allow us to use the other 999,900 webpages to constructa good classifier.But does semi-supervised inference work? Or, to put it another way, underwhat conditions does it work? In Azizyan et al. (2012), we showed the following(which I state informally here).Suppose that X i ∈ R d . Let S n denote the set of supervised estimators;these estimators use only the labeled data. Let SS N denote the set of semisupervisedestimators; these estimators use the labeled data and unlabeleddata. Let m be the number of unlabeled data points and suppose that m ≥n 2/(2+ξ) for some 0

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!