12.07.2015 Views

Advanced Topics in Machine Learning - Caltech

Advanced Topics in Machine Learning - Caltech

Advanced Topics in Machine Learning - Caltech

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Advanced</strong> <strong>Topics</strong> <strong>in</strong> Mach<strong>in</strong>e Learn<strong>in</strong>g Lecture 1 – Introduc8on CS/CNS/EE 253 Andreas Krause


Learn<strong>in</strong>g from massive data ! Many applica8ons require ga<strong>in</strong><strong>in</strong>g <strong>in</strong>sights from massive, noisy data sets ! Science ! Physics (LHC, …), Astronomy (sky surveys, …), Neuroscience (fMRI, micro-­‐electrode arrays, …), Biology (High-­‐throughput microarrays, …), Geology (sensor arrays, …), … ! Social science, economics, … ! Commercial / civil applica8ons ! Consumer data (onl<strong>in</strong>e adver8s<strong>in</strong>g, viral marke8ng, …) ! Health records (evidence based medic<strong>in</strong>e, …) ! Security / defense related applica8ons ! Spam filter<strong>in</strong>g / <strong>in</strong>trusion detec8on ! Surveillance, … 2


Web-­‐scale mach<strong>in</strong>e learn<strong>in</strong>g ! Predict relevance of search results from click data ! Personaliza8on ! Onl<strong>in</strong>e adver8s<strong>in</strong>g ! Mach<strong>in</strong>e transla8on ! Learn<strong>in</strong>g to <strong>in</strong>dex ! Spam filter<strong>in</strong>g ! Fraud detec8on ! … L. Brouwer >21 billion <strong>in</strong>dexed web pages T. Riley


Analyz<strong>in</strong>g fMRI data Mitchell et al., Science, 2008 ! Predict ac8va8on pa\erns for nouns ! Google’s Trillion word corpus used to measure co-­‐occurrence 4


Monitor<strong>in</strong>g transients <strong>in</strong> astronomy [Djorgovski] Novae, Cataclysmic Variables Supernovae Gamma-­‐Ray Bursts Gravita8onal Microlens<strong>in</strong>g Accre8on to SMBHs


Data-­‐rich astronomy [Djorgovski] ! Typical digital sky survey now generates ~ 10 -­‐ 100 TB, plus a comparable amount of derived data products ! PB-­‐scale data sets are on the horizon ! Astronomy today has ~ 1 -­‐ 2 PB of archived data, and generates a few TB/day ! Both data volumes and data rates grow exponen8ally, with a doubl<strong>in</strong>g 8me ~ 1.5 years ! Even more important is the growth of data complexity ! For comparison: Human memory ~ a few hundred MB Human Genome < 1 GB 1 TB ~ 2 million books Library of Congress (pr<strong>in</strong>t only) ~ 30 TB


How is the data-­‐rich science different? [Djorgovski] ! The <strong>in</strong>forma8on volume grows exponen8ally Most data will never be seen by humans The need for data storage, network, database-­‐related technologies, standards, etc. ! Informa8on complexity is also <strong>in</strong>creas<strong>in</strong>g greatly Most data (and data constructs) cannot be comprehended by humans directly The need for data m<strong>in</strong><strong>in</strong>g, KDD, data understand<strong>in</strong>g technologies, hyperdimensional visualiza8on, AI/Mach<strong>in</strong>e-­‐assisted discovery … ! We need to create a new scien8fic methodology to do the 21 st century, computa8onally enabled, data-­‐rich science… ! ML and AI will be essen8al components of the new scien8fic toolkit


Data volume <strong>in</strong> scien8fic and <strong>in</strong>dustrial applica8ons [Meiron et al] 8


How can we get ga<strong>in</strong> <strong>in</strong>sight from massive, noisy data sets? 9


Key ques8ons ! How can we deal with data sets that don’t fit <strong>in</strong> ma<strong>in</strong> memory of a s<strong>in</strong>gle mach<strong>in</strong>e? Onl<strong>in</strong>e learn<strong>in</strong>g ! Labels are expensive. How can we obta<strong>in</strong> most <strong>in</strong>forma8ve labels at m<strong>in</strong>imum cost? Ac=ve learn<strong>in</strong>g ! How can we adapt complexity of classifiers for large data sets? Nonparametric learn<strong>in</strong>g 10


Overview ! Research-­‐oriented advanced topics course ! 3 ma<strong>in</strong> topics ! Onl<strong>in</strong>e learn<strong>in</strong>g (from stream<strong>in</strong>g data) ! Ac8ve learn<strong>in</strong>g (for gather<strong>in</strong>g most useful labels) ! Nonparametric learn<strong>in</strong>g (for model selec8on) ! Both theory and applica8ons ! Handouts etc. on course webpage ! h?p://www.cs.caltech.edu/courses/cs253/ 11


Overview ! Instructors: Andreas Krause (krausea@caltech.edu) and Daniel Golov<strong>in</strong> (dgolov<strong>in</strong>@caltech.edu) ! Teach<strong>in</strong>g assistant: Deb Ray (dray@caltech.edu) ! Adm<strong>in</strong>istra9ve assistant: Sheri Garcia (sheri@cs.caltech.edu) 12


Background & Prequisites ! Formal requirement: CS/CNS/EE 156a or <strong>in</strong>structor’s permission 13


! Grad<strong>in</strong>g based on Coursework ! 3 homework assignments (one per topic) (50%) ! Course project (40%) ! Scrib<strong>in</strong>g (10%) ! 3 late days ! Discuss<strong>in</strong>g assignments allowed, but everybody must turn <strong>in</strong> their own solu8ons ! Start early! 14


Course project ! “Get your hands dirty” with the course material ! Implement an algorithm from the course or a paper you read and apply it to some data set ! Ideas on the course website (soon) ! Applica8on of techniques you learnt to your own research is encouraged ! Must be someth<strong>in</strong>g new (e.g., not work done last term) 15


Project: Timel<strong>in</strong>e and grad<strong>in</strong>g ! Small groups (2-­‐3 students) ! January 20: Project proposals due (1-­‐2 pages); feedback by <strong>in</strong>structor and TA ! February 10: Project milestone ! March ~10: Poster session (TBA) ! March 15: Project report due ! Grad<strong>in</strong>g based on quality of poster (20%), milestone report (20%) and f<strong>in</strong>al report (60%) ! We will have a Best Project Award!! 16


Course overview ! Onl<strong>in</strong>e learn<strong>in</strong>g from massive data sets ! Ac=ve learn<strong>in</strong>g to gather most <strong>in</strong>forma8ve labels ! Nonparametric learn<strong>in</strong>g to adapt model complexity This lecture: Quick overview over all these topics 17


Tradi8onal classifica8on task Spam Ham +++ ++++–––– –– – –– –! Input: Labeled data set with posi8ve (+) and nega8ve (-­‐) examples ! Output: Decision rule (e.g., l<strong>in</strong>ear separator) 18


Ma<strong>in</strong> memory vs. disk access Ma<strong>in</strong> memory: Fast, random access, expensive Secondary memory (hard disk) ~10 4 slower, sequen8al access, <strong>in</strong>expensive Massive data Sequen8al access How can we learn from stream<strong>in</strong>g data? 19


Onl<strong>in</strong>e classifica8on task Spam – X ++–Ham – X X: Classifica=on error ! Data arrives sequen8ally ! Need to classify one data po<strong>in</strong>t at a 8me ! Use a different decision rule (l<strong>in</strong>. separator) each 8me ! Can’t remember all data po<strong>in</strong>ts! 20


Model: Predic8on from expert advice Experts l 1 l 2 l 3 … l T e 1 0 1 0 1 1 e 2 0 0 1 0 0 e 3 1 1 0 1 0 … e n 1 0 0 0 0 Loss Time Total: ∑ t l (t,i t ) m<strong>in</strong>Expert = Someone with an op<strong>in</strong>ion (not necessarily someone who knows someth<strong>in</strong>g) Th<strong>in</strong>k of an expert as a decision rule (e.g., l<strong>in</strong>. separator) 21


Performance metric: Regret ! Best expert: i* = m<strong>in</strong> i ∑ t l (t,i) ! Let i 1 ,…,i T be the sequence of experts selected ! Instantaneous regret at 8me t: r t = l (t,i t )-­‐ l (t,i*) ! Total regret: ! Typical goal: Want selec8on strategy that guarantees 22


Expert selec8on strategies ! Pick an expert (classifier) uniformly at random? ! Always pick the best expert? 23


Input: Randomized weighted majority ! Learn<strong>in</strong>g rateIni8aliza8on: ! Associate weight w 1,s = 1 with every expert s For each round t ! Choose expert s with prob. ! Obta<strong>in</strong> losses ! Update weights: 24


Guarantees for RWM Theorem For appropriately chosen learn<strong>in</strong>g rate, Randomized Weighted Majority obta<strong>in</strong>s subl<strong>in</strong>ear regret: Note: No assump8on about how the loss vectors l are generated! 25


Prac8cal problems ! In many applica8ons, number of experts (classifiers) is <strong>in</strong>f<strong>in</strong>ite Onl<strong>in</strong>e op8miza8on (e.g., onl<strong>in</strong>e convex programm<strong>in</strong>g) ! Ozen, only par8al feedback is available (e.g., obta<strong>in</strong> loss only for chosen classifier) Mul8-­‐armed bandits, sequen8al experimental design ! Many prac8cal problems are high-­‐dimensional Dimension reduc8on, sketch<strong>in</strong>g 26


Course overview ! Onl<strong>in</strong>e learn<strong>in</strong>g from massive data sets ! Ac=ve learn<strong>in</strong>g to gather most <strong>in</strong>forma8ve labels ! Nonparametric learn<strong>in</strong>g to adapt model complexity This lecture: Quick overview over all these topics 27


Spam or Ham? Spam Ham o++ooooo+– ooooo o–oo–! Labels are expensive (need to ask expert) ! Which labels should we obta<strong>in</strong> to maximize classifica=on accuracy? 28


Learn<strong>in</strong>g b<strong>in</strong>ary thresholds ! Input doma<strong>in</strong>: D=[0,1] ! True concept c: c(x) = +1 if x¸ t c(x) = -­‐1 if x < t -­‐ -­‐ -­‐ -­‐ + + + + 0 Threshold t 1 ! Samples x 1 ,…,x n 2 D uniform at random 29


! Input doma<strong>in</strong>: D=[0,1] Passive learn<strong>in</strong>g ! True concept c: c(x) = +1 if x¸ t c(x) = -­‐1 if x < t 0 1 ! Passive learn<strong>in</strong>g: Acquire all labels y i 2 {+,-­‐} 30


! Input doma<strong>in</strong>: D=[0,1] Ac8ve learn<strong>in</strong>g ! True concept c: c(x) = +1 if x¸ t c(x) = -­‐1 if x < t 0 1 ! Passive learn<strong>in</strong>g: Acquire all labels y i 2 {+,-­‐} ! Ac8ve learn<strong>in</strong>g: Decide which labels to obta<strong>in</strong> 31


Classifica8on error ! Azer obta<strong>in</strong><strong>in</strong>g n labels, D n = {(x 1 ,y 1 ),…,(x n ,y n )} learner outputs hypothesis consistent with labels D n -­‐ -­‐ -­‐ -­‐ + + + + 0 Threshold t 1 ! Classifica8on error: R(h) = E x~P [h(x) ≠ c(x)] 32


Sta8s8cal ac8ve learn<strong>in</strong>g protocol Data source P (produces <strong>in</strong>puts x i ) Ac8ve learner assembles data set D n = {(x 1 ,y 1 ),…,(x n ,y n )} by selec8vely obta<strong>in</strong><strong>in</strong>g labels Learner outputs hypothesis h Classifica8on error R(h) = E x~P [h(x) ≠ c(x)] How many labels do we need to ensure that R(h) · ε? 33


Label complexity for passive learn<strong>in</strong>g 34


Label complexity for ac8ve learn<strong>in</strong>g 35


Comparison Passive learn<strong>in</strong>g Ac8ve learn<strong>in</strong>g Labels needed to learn with classifica8on error ε Ω(1/ε) O(log 1/ε) Ac=ve learn<strong>in</strong>g can exponen=ally reduce the number of required labels! 36


Key ques8ons ! For which classifica8on tasks can we provably reduce the number of labels? ! Can we do worse by ac8ve learn<strong>in</strong>g? ! Can we implement ac8ve learn<strong>in</strong>g efficiently? 37


Course overview ! Onl<strong>in</strong>e learn<strong>in</strong>g from massive data sets ! Ac=ve learn<strong>in</strong>g to gather most <strong>in</strong>forma8ve labels ! Nonparametric learn<strong>in</strong>g to adapt model complexity This lecture: Quick overview over all these topics 38


Nonl<strong>in</strong>ear classifica8on –––––+ + –– ++–++ +––– –– – – –– ––! How should we adapt the classifier complexity to grow<strong>in</strong>g data set size? 39


Nonl<strong>in</strong>ear classifica8on +–– + +–– + ––+ + – –––++–++ +––– –– – – –– –––! How should we adapt the classifier complexity to grow<strong>in</strong>g data set size? 40


Nonl<strong>in</strong>ear classifica8on +–– + +–– + ––+ + – ––––+++–++ +– +–– – – –– – – –– –– + –! How should we adapt the classifier complexity to grow<strong>in</strong>g data set size? 41


L<strong>in</strong>ear classifica8on L<strong>in</strong>ear classifica8on Loss func8on e.g., h<strong>in</strong>ge loss: Complexity penalty 42


From l<strong>in</strong>ear to nonl<strong>in</strong>ear classifica8on L<strong>in</strong>ear classifica8on Nonl<strong>in</strong>ear classifica8on Complexity penalty for func=on f?? 43


1D Example Nonl<strong>in</strong>ear classifica8on 1 + + + + + + + + + + 0 -­‐1 – – – – – – 44


Representa8on of func8on f Solu8on of can be wri\en as for appropriate choice of ||f|| (Representer Theorem) Hereby, k( . , . ) is called a kernel func=on (associated with ||.||) 45


Examples of kernels Squared exp. kernel Exponen=al k. F<strong>in</strong>ite dimensional 46 46


Nonparametric solu8on Solu8on of can be wri\en as Func8on f has one parameter α t for each data po<strong>in</strong>t x t ! No f<strong>in</strong>ite-­‐dimensional representa8on “non-­‐parametric” Large data set Huge number of parameters!! 47


Key ques8ons ! How can we determ<strong>in</strong>e the right tradeoff between func8on expressiveness (#parameters) and computa8onal complexity? ! How can we control model complexity <strong>in</strong> an onl<strong>in</strong>e fashion? ! How can we quan8fy uncerta<strong>in</strong>ty <strong>in</strong> nonparametric learn<strong>in</strong>g? 48


Course overview Onl<strong>in</strong>e Learn<strong>in</strong>g Response surface methods Bandit op8miza8on Nonparametric Learn<strong>in</strong>g Ac8ve set selec8on Ac=ve Learn<strong>in</strong>g 49

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!