Bagging, Boosting and Ransac - LASA

MACHINE LEARNING - 2013MACHINE LEARNINGBagging, Boosting andRANSAC

MACHINE LEARNING - 20132Bootstrap AggregationBagging• The Main Idea• Some Examples• Why it works

3MACHINE LEARNING - 2013The Main IdeaAggregation• Imagine we have m sets of n independentobservationsS (1) ={(X 1,Y 1),...,(X n,Y n)} (1) , ... , S (m) ={(X 1,Y 1),...,(X n,Y n)} (m)all taken iid from the same underlying distribution P• Traditional approach: generate some ϕ(x,S) fromall the data samples• Aggregation: learn ϕ(x,S) by averaging ϕ(x,S (k) )over many kbeforeϕ(x,S)beforeϕ(x,S (k) )afterafter

MACHINE LEARNING - 20134The Main IdeaBootstrapping• Unfortunately, we usually have one singleobservations set S• Idea: bootstrap S to form the S (k) observation sets• (canonical) Choose some samples, duplicate themuntil you fill a new S (i) of the same size of S• (practical) Take a sub-set of samples of S, (use asmaller set)• The samples not used by each set are validationsamplesSS (1)S (2)S (3)

MACHINE LEARNING - 20135The Main IdeaBagging• Generate S (1) ,..., S (m) from bootstrapping• Compute the ϕ(x,S) (k) individually• Compute ϕ(x,S)=E k (ϕ(x,S) (k) ) by aggregation

MACHINE LEARNING - 20136A concrete example•We select some input samples•We learn a regression model•Very sensitive to the input selection•m training sets = m different models(x (1)1 ,y(1) 1 ),...,(x(1) n ,y n (1) )ˆf (1) (x) =Y (1)(x (2)1 ,y(2) 1 ),...,(x(2) n ,y n (2) ) ˆf (2) (x) =Y (2)ˆf (1) , ··· ,ˆf(m)Y(1) , ··· ,Y (m)

MACHINE LEARNING - 20137Aggregation: combine several models•Linear combination of simple models•More examples = Better model•We can stop when we’re satisfiedZ = 1 mmXi=1Y (i)m = =410 60

MACHINE LEARNING - 20138Proof of convergenceHypothesis: The average will converge to something meaningful• Assumptions• Y (1) ,...,Y (m) are iid• E(Y) = y (E(Y) is an unbiased estimator of y)• Expected ErrorE((Y y) 2 )=E((Y E(Y )) 2 = 2 (Y )• With AggregationZ = 1 mXY (i) E(Z) = 1 mmi=1mXE(Y (i) )= 1 mi=1mXy = yi=1E((Z y) 2 )=E((Z E(Z)) 2 = 2 (Z) = 2⇣ 1mXm! i=1= 1 Xm 2m 2 (Y (i) )= 1 1mX2 (Y (i) ) = 1 m mmi=1i=1Y (i)⌘2 (Y )infinite observations = zero error : we have our underlying estimator!

MACHINE LEARNING - 20139In layman termsyZYThe expected error (variance) of Y is larger than ZThe variance of Z shrinks with m

MACHINE LEARNING - 201310Relaxing the assumptions• We DROP the second assumption• Y (1) ,...,Y (m) are iid• E(Y) = y (E(Y) is an unbiased estimator of y)E((Y y) 2 )=E((Y E(Y )+E(Y ) y) 2 )=E(((Y E(Y )+(E(Y ) y)) 2 )= E((Y E(Y )) 2 )+E((E(Y ) y) 2 )+E(2(Y E(Y ))(E(Y ) y))σ 2 (Y ) ≥ 0The larger it is,the better for uswe add theseE (Y y) 2 E (E(Y ) y) 2E (Y y) 2 E (Z y) 2using Z gives us a smaller error(even if we can’t prove convergence to zero)Zwe regroup themE Y E(Y ) E 2(E(Y ) y)= 0

MACHINE LEARNING - 2013Peculiarities• Instability is good• The more variable (unstable) the form of ϕ(x,S) is,the more improvement can potentially be obtained• Low-variability methods (e.g. PCA, LDA) improve lessthan high-variability ones (e.g. LWR, Decision Trees)• Loads of redundancy• Most predictors do roughly “the same thing”11

MACHINE LEARNING - 201312From Bagging to Boosting• Bagging: each model is trained independently• Boosting: each model is built on top of theprevious ones

MACHINE LEARNING - 201313Adaptive BoostingAdaBoost• The Main Idea• The Thousand Flavours of Boost• Weak Learners and Cascades

MACHINE LEARNING - 201314The Main IdeaIterative Approach• Combine several simple models (weak learners)• Avoid redundancy• Each learner complements the previous ones• Keep track of the errors of the previous learners

MACHINE LEARNING - 201315Weak Learners• A “simple” classifier that can be generated easily• As long as it is better than random, we can use it• Better when tailored to the problem at hand• E.g. very fast at retrieval (for images)

MACHINE LEARNING - 201316AdaBoostInitialization• We choose a weak learner model ϕ(x)(e.g. f (x,v) = x ∙v > θ )vθ• Initialization• Generate ϕ 1 (x), ... , ϕ N (x) weak learners• N can be in the hundreds of thousandsv1v 2 v3 v4v5 v6v7• Assign a weight wi to each training sample

MACHINE LEARNING - 201317AdaBoostIterations• Compute the error e j for each classifier ϕ j (x)e j : P ni=1 w i · 1 j (x i )6=y i• Select the ϕ j with the smallest classificationerrorargmin jn Xi=1w i · 1 j (x i )6=y i!• Update the weights wi depending on how theyare classified by ϕj .Here comes the important partv1 v v3 2 v4v5 v6v7

MACHINE LEARNING - 201318Updating the weights• Evaluate how “well” ϕ j (x) is performing= 1 2 ln ✓ 1 eje j◆how far are we from aperfect classification?• Update the weights for each samplew (t+1)i =(make it biggerw (t)i exp( (t) ) if ⇥ j (x i ) 6= y i ,w (t)i exp( (t) ) if ⇥ j (x i )=y i .make it smaller

MACHINE LEARNING - 2013AdaBoostRinse and Repeat• Recompute the error e j for each classifier ϕ j (x)using the updated weightse j : P ni=1 w i · 1 j (x i )6=y i• Select the new ϕ j with the smallestclassification errorargmin jn Xi=1• Update the weights wi .w i · 1 j (x i )6=y i!

MACHINE LEARNING - 201320Boosting In ActionThe Checkerboard Problem

MACHINE LEARNING - 201321Boosting In ActionInitialization• We choose a simple weak learnerf (x,v) = x ∙v > θ• We generate a thousand random vectorsv1,...,v1000 and corresponding learners fj(x,vj)• For each fj(x,vj) we compute a good threshold θj

MACHINE LEARNING - 2013Boosting In Actionand we keep going...• We look for the best weak learner• We adjust the importance (weight) of the errors• Rinse and repeat100%90%80%70%60%50%1 2 3 4 5 6 7 8 9 1022

MACHINE LEARNING - 2013Boosting In Action120 40 80 weak learners100%90%80%70%60%50%1 10 20 40 80 12023

MACHINE LEARNING - 201324Drawbacks of Boosting• Overfitting!• Boost will always overfit withmany weak learners• Training Time• Training of a face detectortakes up to 2 weeks onmodern computers

MACHINE LEARNING - 201325A thousand different flavors• A couple of new boost variants every year• Reduce overfitting• Increase robustness to noise• Tailored to specific problems• Mainly change two things• How the error is represented• How the weights are updated1989Boosting1996AdaBoost1999Real AdaBoost2000Margin BoostModest AdaBoostGentle AdaBoostAnyBoostLogitBoost2001BrownBoost2003KLBoostWeight Boost2004FloatBoostActiveBoost2005JensenShannonBoostInfomax Boost2006Emphasis Boost2007Entropy BoostReweight Boost...

26MACHINE LEARNING - 2013An example• Instead of counting the errors, we compute theprobability of correct classificationDiscrete AdaBoostReal AdaBooste j : P ni=1 w i · 1 j (x i )6=y iw (t+1)i =(= 1 2 ln ✓ 1 eje j◆p j =w (t)i exp( (t) ) if ⇥ j (x i ) 6= y i ,w (t)i exp( (t) ) if ⇥ j (x i )=y i . w (t+1)inYi=1w i P (y i =1|x i )= 1 2 ln ✓ 1 pjp j◆= w (t)i exp( y i (t) )

MACHINE LEARNING - 201327A celebrated exampleViola-Jones Haar-Like waveletsABI(x) : pixel of image I at position xf(x) = X XI(x) I(x)x2Ax2Bimage pixels2 rectangles of pixels1 positive, 1 negative• Millions of possible classifiers(x) =⇢1 if f(x) > 0,1 otherwise....ϕ 1 (x)ϕ 2 (x)

MACHINE LEARNING - 201328Real-Time on HD video

MACHINE LEARNING - 201329Some simpler examplesFeature: the distance from a point cf (x,c) = (x -c) T (x -c) > θ100%90%80%70%60%50%Random CirclesRandom Projections1 10 20 40 80 120

MACHINE LEARNING - 201330Some simpler examplesFeature: being inside a rectangle Rf (x,R) = 1 x ∈ R100%90%80%70%60%50%Random CirclesRandom ProjectionsRandom Rectangles1 10 20 40 80 120

MACHINE LEARNING - 201331Some simpler examplesFeature: full-covariance gaussianf (x,μ,∑) = P(x | μ,∑)100%90%80%70%60%50%Random GaussiansRandom CirclesRandom ProjectionsRandom Rectangles1 10 20 40 80 120

MACHINE LEARNING - 201332Weak Learners don’t need to be weak!20 boosted SVMs with 5 SVs and the RBF kernel

MACHINE LEARNING - 201333Cascades of Weak Classifiers• Split classification task in Stages• Each stage has increasing numbers of weakclassifiers• Each stage only needs to ‘learn’ to classifywhat the previous ones let through

MACHINE LEARNING - 201334Cascades of Weak Classifiers10 weaklearners100 weaklearners1000 weaklearnerssample Stage 1 Stage 2 Stage 3yesno no no90%classification95%classification99.9%classification1000samples10’000tests+10‘000tests100 samples 50 samples70 tests per sample+50’000= 70’000teststestsinstead of 1110

MACHINE LEARNING - 201335Cascades of Weak Classifiers• Advantages• Stage splits can be chosen manually• Trade-off performance and accuracy of the firststages• Disadvantages• Later stages become difficult to train (few samples)• Very large amount of samples for training• Even slower to train than standard boosting

MACHINE LEARNING - 201336The curious case of RANSACRANdom SAmple Consensus• Created to withstand hordes ofoutliers• Hasn’t really been proven toconverge to optimal solution inany reasonable time• Extremely effective in practice• Easy to implement• Widely used in Computer Vision

MACHINE LEARNING - 201337RANSAC in practice• Select a random subset of samples• Estimate the model on these samples• Sift through all other samples• If they are close to the model, add to theConsensus• If the consensus is big enough, keep it• Repeat from topKeep the best consensus(e.g. most samples, least error, etc.)

MACHINE LEARNING - 201338Some examples of RANSACReference Panoramic ImageInput VideoB. Noris and A. Billard. “Aggregation of Asynchronous Eye-Tracking streams from Sets of Uncalibrated Panoramic Images”. (2011)

MACHINE LEARNING - 201339Some examples of RANSACScaramuzza et al. (ETHZ)Input image: 360° CameraCompute visual features(KLT)Exclude Outliers(bad features)RANSACCompute Visual OdometryScaramuzza et al. “Real-Time Monocular Visual Odometry for On-Road Vehicles with 1-Point RANSAC” ICRA 2009.

MACHINE LEARNING - 201340RANSAC vs BaggingRANSACBaggingSelect a random subset of samplesTrain many individual modelsKeep the bestNeeds to be luckyVery light to computeKeep them allProven to convergeHeavy to compute

MACHINE LEARNING - 2013•+•-•+•-•+•-41Summing up• Bagging: linear combination of multiple learnersVery robust to noiseA lot of redundant effort• Boosting: weighed combination of arbitrarylearnersVery strong learner from very simple onesSensitive to noise (at least Discrete Adaboost)• RANSAC: iterative evaluation of random learnersVery robust against outliers, simple to implementNot ensured to converge (although in practice it does)

Bagging, Boosting and Ransac - LASA

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?