12.07.2015 Views

Bagging, Boosting and Ransac - LASA

Bagging, Boosting and Ransac - LASA

Bagging, Boosting and Ransac - LASA

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

MACHINE LEARNING - 2013MACHINE LEARNING<strong>Bagging</strong>, <strong>Boosting</strong> <strong>and</strong>RANSAC


MACHINE LEARNING - 20132Bootstrap Aggregation<strong>Bagging</strong>• The Main Idea• Some Examples• Why it works


3MACHINE LEARNING - 2013The Main IdeaAggregation• Imagine we have m sets of n independentobservationsS (1) ={(X 1,Y 1),...,(X n,Y n)} (1) , ... , S (m) ={(X 1,Y 1),...,(X n,Y n)} (m)all taken iid from the same underlying distribution P• Traditional approach: generate some ϕ(x,S) fromall the data samples• Aggregation: learn ϕ(x,S) by averaging ϕ(x,S (k) )over many kbeforeϕ(x,S)beforeϕ(x,S (k) )afterafter


MACHINE LEARNING - 20134The Main IdeaBootstrapping• Unfortunately, we usually have one singleobservations set S• Idea: bootstrap S to form the S (k) observation sets• (canonical) Choose some samples, duplicate themuntil you fill a new S (i) of the same size of S• (practical) Take a sub-set of samples of S, (use asmaller set)• The samples not used by each set are validationsamplesSS (1)S (2)S (3)


MACHINE LEARNING - 20135The Main Idea<strong>Bagging</strong>• Generate S (1) ,..., S (m) from bootstrapping• Compute the ϕ(x,S) (k) individually• Compute ϕ(x,S)=E k (ϕ(x,S) (k) ) by aggregation


MACHINE LEARNING - 20136A concrete example•We select some input samples•We learn a regression model•Very sensitive to the input selection•m training sets = m different models(x (1)1 ,y(1) 1 ),...,(x(1) n ,y n (1) )ˆf (1) (x) =Y (1)(x (2)1 ,y(2) 1 ),...,(x(2) n ,y n (2) ) ˆf (2) (x) =Y (2)ˆf (1) , ··· ,ˆf(m)Y(1) , ··· ,Y (m)


MACHINE LEARNING - 20137Aggregation: combine several models•Linear combination of simple models•More examples = Better model•We can stop when we’re satisfiedZ = 1 mmXi=1Y (i)m = =410 60


MACHINE LEARNING - 20138Proof of convergenceHypothesis: The average will converge to something meaningful• Assumptions• Y (1) ,...,Y (m) are iid• E(Y) = y (E(Y) is an unbiased estimator of y)• Expected ErrorE((Y y) 2 )=E((Y E(Y )) 2 = 2 (Y )• With AggregationZ = 1 mXY (i) E(Z) = 1 mmi=1mXE(Y (i) )= 1 mi=1mXy = yi=1E((Z y) 2 )=E((Z E(Z)) 2 = 2 (Z) = 2⇣ 1mXm! i=1= 1 Xm 2m 2 (Y (i) )= 1 1mX2 (Y (i) ) = 1 m mmi=1i=1Y (i)⌘2 (Y )infinite observations = zero error : we have our underlying estimator!


MACHINE LEARNING - 20139In layman termsyZYThe expected error (variance) of Y is larger than ZThe variance of Z shrinks with m


MACHINE LEARNING - 201310Relaxing the assumptions• We DROP the second assumption• Y (1) ,...,Y (m) are iid• E(Y) = y (E(Y) is an unbiased estimator of y)E((Y y) 2 )=E((Y E(Y )+E(Y ) y) 2 )=E(((Y E(Y )+(E(Y ) y)) 2 )= E((Y E(Y )) 2 )+E((E(Y ) y) 2 )+E(2(Y E(Y ))(E(Y ) y))σ 2 (Y ) ≥ 0The larger it is,the better for uswe add theseE (Y y) 2 E (E(Y ) y) 2E (Y y) 2 E (Z y) 2using Z gives us a smaller error(even if we can’t prove convergence to zero)Zwe regroup themE Y E(Y ) E 2(E(Y ) y)= 0


MACHINE LEARNING - 2013Peculiarities• Instability is good• The more variable (unstable) the form of ϕ(x,S) is,the more improvement can potentially be obtained• Low-variability methods (e.g. PCA, LDA) improve lessthan high-variability ones (e.g. LWR, Decision Trees)• Loads of redundancy• Most predictors do roughly “the same thing”11


MACHINE LEARNING - 201312From <strong>Bagging</strong> to <strong>Boosting</strong>• <strong>Bagging</strong>: each model is trained independently• <strong>Boosting</strong>: each model is built on top of theprevious ones


MACHINE LEARNING - 201313Adaptive <strong>Boosting</strong>AdaBoost• The Main Idea• The Thous<strong>and</strong> Flavours of Boost• Weak Learners <strong>and</strong> Cascades


MACHINE LEARNING - 201314The Main IdeaIterative Approach• Combine several simple models (weak learners)• Avoid redundancy• Each learner complements the previous ones• Keep track of the errors of the previous learners


MACHINE LEARNING - 201315Weak Learners• A “simple” classifier that can be generated easily• As long as it is better than r<strong>and</strong>om, we can use it• Better when tailored to the problem at h<strong>and</strong>• E.g. very fast at retrieval (for images)


MACHINE LEARNING - 201316AdaBoostInitialization• We choose a weak learner model ϕ(x)(e.g. f (x,v) = x ∙v > θ )vθ• Initialization• Generate ϕ 1 (x), ... , ϕ N (x) weak learners• N can be in the hundreds of thous<strong>and</strong>sv1v 2 v3 v4v5 v6v7• Assign a weight wi to each training sample


MACHINE LEARNING - 201317AdaBoostIterations• Compute the error e j for each classifier ϕ j (x)e j : P ni=1 w i · 1 j (x i )6=y i• Select the ϕ j with the smallest classificationerrorargmin jn Xi=1w i · 1 j (x i )6=y i!• Update the weights wi depending on how theyare classified by ϕj .Here comes the important partv1 v v3 2 v4v5 v6v7


MACHINE LEARNING - 201318Updating the weights• Evaluate how “well” ϕ j (x) is performing= 1 2 ln ✓ 1 eje j◆how far are we from aperfect classification?• Update the weights for each samplew (t+1)i =(make it biggerw (t)i exp( (t) ) if ⇥ j (x i ) 6= y i ,w (t)i exp( (t) ) if ⇥ j (x i )=y i .make it smaller


MACHINE LEARNING - 2013AdaBoostRinse <strong>and</strong> Repeat• Recompute the error e j for each classifier ϕ j (x)using the updated weightse j : P ni=1 w i · 1 j (x i )6=y i• Select the new ϕ j with the smallestclassification errorargmin jn Xi=1• Update the weights wi .w i · 1 j (x i )6=y i!


MACHINE LEARNING - 201320<strong>Boosting</strong> In ActionThe Checkerboard Problem


MACHINE LEARNING - 201321<strong>Boosting</strong> In ActionInitialization• We choose a simple weak learnerf (x,v) = x ∙v > θ• We generate a thous<strong>and</strong> r<strong>and</strong>om vectorsv1,...,v1000 <strong>and</strong> corresponding learners fj(x,vj)• For each fj(x,vj) we compute a good threshold θj


MACHINE LEARNING - 2013<strong>Boosting</strong> In Action<strong>and</strong> we keep going...• We look for the best weak learner• We adjust the importance (weight) of the errors• Rinse <strong>and</strong> repeat100%90%80%70%60%50%1 2 3 4 5 6 7 8 9 1022


MACHINE LEARNING - 2013<strong>Boosting</strong> In Action120 40 80 weak learners100%90%80%70%60%50%1 10 20 40 80 12023


MACHINE LEARNING - 201324Drawbacks of <strong>Boosting</strong>• Overfitting!• Boost will always overfit withmany weak learners• Training Time• Training of a face detectortakes up to 2 weeks onmodern computers


MACHINE LEARNING - 201325A thous<strong>and</strong> different flavors• A couple of new boost variants every year• Reduce overfitting• Increase robustness to noise• Tailored to specific problems• Mainly change two things• How the error is represented• How the weights are updated1989<strong>Boosting</strong>1996AdaBoost1999Real AdaBoost2000Margin BoostModest AdaBoostGentle AdaBoostAnyBoostLogitBoost2001BrownBoost2003KLBoostWeight Boost2004FloatBoostActiveBoost2005JensenShannonBoostInfomax Boost2006Emphasis Boost2007Entropy BoostReweight Boost...


26MACHINE LEARNING - 2013An example• Instead of counting the errors, we compute theprobability of correct classificationDiscrete AdaBoostReal AdaBooste j : P ni=1 w i · 1 j (x i )6=y iw (t+1)i =(= 1 2 ln ✓ 1 eje j◆p j =w (t)i exp( (t) ) if ⇥ j (x i ) 6= y i ,w (t)i exp( (t) ) if ⇥ j (x i )=y i . w (t+1)inYi=1w i P (y i =1|x i )= 1 2 ln ✓ 1 pjp j◆= w (t)i exp( y i (t) )


MACHINE LEARNING - 201327A celebrated exampleViola-Jones Haar-Like waveletsABI(x) : pixel of image I at position xf(x) = X XI(x) I(x)x2Ax2Bimage pixels2 rectangles of pixels1 positive, 1 negative• Millions of possible classifiers(x) =⇢1 if f(x) > 0,1 otherwise....ϕ 1 (x)ϕ 2 (x)


MACHINE LEARNING - 201328Real-Time on HD video


MACHINE LEARNING - 201329Some simpler examplesFeature: the distance from a point cf (x,c) = (x -c) T (x -c) > θ100%90%80%70%60%50%R<strong>and</strong>om CirclesR<strong>and</strong>om Projections1 10 20 40 80 120


MACHINE LEARNING - 201330Some simpler examplesFeature: being inside a rectangle Rf (x,R) = 1 x ∈ R100%90%80%70%60%50%R<strong>and</strong>om CirclesR<strong>and</strong>om ProjectionsR<strong>and</strong>om Rectangles1 10 20 40 80 120


MACHINE LEARNING - 201331Some simpler examplesFeature: full-covariance gaussianf (x,μ,∑) = P(x | μ,∑)100%90%80%70%60%50%R<strong>and</strong>om GaussiansR<strong>and</strong>om CirclesR<strong>and</strong>om ProjectionsR<strong>and</strong>om Rectangles1 10 20 40 80 120


MACHINE LEARNING - 201332Weak Learners don’t need to be weak!20 boosted SVMs with 5 SVs <strong>and</strong> the RBF kernel


MACHINE LEARNING - 201333Cascades of Weak Classifiers• Split classification task in Stages• Each stage has increasing numbers of weakclassifiers• Each stage only needs to ‘learn’ to classifywhat the previous ones let through


MACHINE LEARNING - 201334Cascades of Weak Classifiers10 weaklearners100 weaklearners1000 weaklearnerssample Stage 1 Stage 2 Stage 3yesno no no90%classification95%classification99.9%classification1000samples10’000tests+10‘000tests100 samples 50 samples70 tests per sample+50’000= 70’000teststestsinstead of 1110


MACHINE LEARNING - 201335Cascades of Weak Classifiers• Advantages• Stage splits can be chosen manually• Trade-off performance <strong>and</strong> accuracy of the firststages• Disadvantages• Later stages become difficult to train (few samples)• Very large amount of samples for training• Even slower to train than st<strong>and</strong>ard boosting


MACHINE LEARNING - 201336The curious case of RANSACRANdom SAmple Consensus• Created to withst<strong>and</strong> hordes ofoutliers• Hasn’t really been proven toconverge to optimal solution inany reasonable time• Extremely effective in practice• Easy to implement• Widely used in Computer Vision


MACHINE LEARNING - 201337RANSAC in practice• Select a r<strong>and</strong>om subset of samples• Estimate the model on these samples• Sift through all other samples• If they are close to the model, add to theConsensus• If the consensus is big enough, keep it• Repeat from topKeep the best consensus(e.g. most samples, least error, etc.)


MACHINE LEARNING - 201338Some examples of RANSACReference Panoramic ImageInput VideoB. Noris <strong>and</strong> A. Billard. “Aggregation of Asynchronous Eye-Tracking streams from Sets of Uncalibrated Panoramic Images”. (2011)


MACHINE LEARNING - 201339Some examples of RANSACScaramuzza et al. (ETHZ)Input image: 360° CameraCompute visual features(KLT)Exclude Outliers(bad features)RANSACCompute Visual OdometryScaramuzza et al. “Real-Time Monocular Visual Odometry for On-Road Vehicles with 1-Point RANSAC” ICRA 2009.


MACHINE LEARNING - 201340RANSAC vs <strong>Bagging</strong>RANSAC<strong>Bagging</strong>Select a r<strong>and</strong>om subset of samplesTrain many individual modelsKeep the bestNeeds to be luckyVery light to computeKeep them allProven to convergeHeavy to compute


MACHINE LEARNING - 2013•+•-•+•-•+•-41Summing up• <strong>Bagging</strong>: linear combination of multiple learnersVery robust to noiseA lot of redundant effort• <strong>Boosting</strong>: weighed combination of arbitrarylearnersVery strong learner from very simple onesSensitive to noise (at least Discrete Adaboost)• RANSAC: iterative evaluation of r<strong>and</strong>om learnersVery robust against outliers, simple to implementNot ensured to converge (although in practice it does)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!