Boosting - ArrestedComputing

oostingRay Buse Theory Lunch 3.19.2009

The Big Idea…• Construct a “Strong” learner from aset of “Weak” onesStrongclassifierFeaturesvectorWeightWeak classifier2/36

The Big Questions…• How can we combine the learners?• When does it work?• What are the costs?• Why do I care?• What’s 7*6?• Who is this? -->3/36

Supervised Learning (Example)• Training: Here are X movies I like, and Y movies Idon’t like. (Alternatively, here are my ratings forX+Y movies).• Input: Here are Z some movies I haven't seen• Output:– Binary (or Categorical): These are the movies youwould like– Numerical: This is how much you would like eachmovie– Ranking: Here are the movies in order from best toworst5/36

Supervised Learning (Goals)• In principle we would like our classifierfunction to be:– Accurate (Correctly classify the data)– Fast (Training and evaluation)– General (Don’t want to over-fit the training data)– Simple (So that we can “understand” it)… but typically there are tradeoffs7/36

Applications• Detecting credit card fraud• Stock market prediction• Speech and handwriting recognition• Medical diagnosis• Market basket analysis– Movie Preference Prediction8/36

NetFlix Example9/36

NetFlix Prize10/36

Leader Board11/36

Progress12/36

NetFlix & **Boosting**• Netflix looks like a good candidate forattempting boosting:– No single method has met threshold (yet)– There may be many rules that help solve different(independent) aspects of the problem– $1M prize … so might as well try• So how does boosting work? …13/36

Adaptive **Boosting** (AdaBoost)• Introduced by Schapire and Freund in 1996• The idea is that we run a weak learning algorithmseveral times; each time on a different distribution ofinstances to generate several different hypothesis• Future weak learners focus more on the examples thatprevious weak learners misclassified• The classifiers could be entirely different techniques(Neural Networks, Bayesian, Decision Trees … ); fornow lets assume we just know how to do linearclassification (e.g., SVM) …14/36

A Toy Example15/36

Toy exampleEach data point hasa class label:y t =+1 ( )-1 ( )and a weight:D t =1This one seems to be the bestThis is a ‘weak classifier’: It performs slightly better than chance.17/36

Toy exampleEach data point hasa class label:+1 ( )y t =-1 ( )We update the weightsWe set a new problem for which the previous weak classifier performs at chance again19/36

Toy exampleEach data point hasa class label:+1 ( )y t =-1 ( )We update the weightsWe set a new problem for which the previous weak classifier performs at chance again20/36

Toy exampleEach data point hasa class label:+1 ( )y t =-1 ( )We update the weightsWe set a new problem for which the previous weak classifier performs at chance again21/36

Toy examplef 4f 1 f 2f 3The strong (non- linear) classifier is built as the combination of allthe weak (linear) classifiers.22/36

Choice of αSchapire and Singer proved that the trainingerror is bounded byWhich is minimal when ttth ( x ) yiD ( i)i24/36

Real DataTest error keepsdropping evenafter trainingerror goes tozero!TesterrorTest errordoesn'tincreaseeven after1000roundsTrainingerrorThe **Boosting** Approach to Machine Learning, by Robert E. Schapire25/36

Key Idea:An explanation by margin• Training error only measures whetherclassifications are right or wrong• Should also consider “confidence” ofclassifications• Define margin of example (x,y) to be y ∙ F(x)26/36

Margin DistributionIncreasing margins imply higherconfidence27/36

Strengths of AdaBoost• It has no parameters to tune (except for thenumber of rounds)• It is fast, simple and easy to program (??)• It comes with a set of theoretical guarantees(e.g., training error, test error)• It can identify outliners: i.e. examples that areeither mislabeled or that are inherentlyambiguous and hard to categorize.28/36

Weakness of AdaBoost• The actual performance of boosting dependson the data and the base learner.• **Boosting** seems to be especially susceptible tonoise (misclassified samples).• When the number of outliners is very large,the emphasis placed on the hard examplescan hurt the performance. “Gentle AdaBoost”, “BrownBoost”29/36

RankBoost• Movie Ranking Problem:• Recommend Movie to User based on:1) Ranked list of movies user has seen and enjoyed2) Set of ranked lists of movies other users haveseen and enjoyed• Only relative order of ranking matters, notabsolute ratings30/36

How it works• Just like AdaBoost except….– D (the weight distribution) is over all pairwisecombinations of movies (yielding a ranking)– Users form the “features” (e.g., our model of theuser in question is a weighted sum of other users)• Minimize Disagreement: the fraction ofdistinct pairs of movies (in the test set) that Hmis-orders31/36

Formally…32/36

# of features33/36

34/36

Conclusions• Big Idea: Weak learners really can be combinedinto a strong learner• Instead of trying to design a learning algorithmthat is accurate over the entire space, we canfocus on finding base learning algorithms thatonly need to be better than random.• AdaBoost is a systematic way to achieve this• RankBoost applies the same technique to rankedorderings of objects35/36

Questions / Comments?36/36