11.07.2015 Views

n - of Marcus Hutter

n - of Marcus Hutter

n - of Marcus Hutter

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Marcus</strong> <strong>Hutter</strong> - 1 - Loss Bounds for UniOptimality <strong>of</strong> Universal BayesianPrediction for General Loss and<strong>Marcus</strong> <strong>Hutter</strong>Istituto Dalle Molle di Studi sull’IntelligenzaIDSIA, Galleria 2, CH-6928 Manno-Lugano, Smarcus@idsia.ch, http://www.idsia.ch/ ∼ m2000 – 2002Universal Induction = Ockham + Epicurus + BLoss(Universal Prediction Scheme)Loss(Any other Prediction Scheme) ≤ 1 + o(1


<strong>Marcus</strong> <strong>Hutter</strong> - 2 - Loss Bounds for UniTable <strong>of</strong> Contents• The Philosophical Dilemma <strong>of</strong> Predicting t• (Conditional) Probabilities and their Interp• Probability that the Sun will rise Tomorrow• Kolmogorov Complexity• Universal Probability Distribution• Universal Sequence Prediction• Loss Bounds & Optimality• Application to Games <strong>of</strong> Chance (Bound on• Generalization: Continuous Probability Cla• Generalization: The Universal AIξ Model• Further Generalizations, Outlook, Conclusio


<strong>Marcus</strong> <strong>Hutter</strong> - 3 - Loss Bounds for UniProblem Setup• Every induction problem can be phrased as a sequence predi• Classification is a special case <strong>of</strong> sequence prediction.(With some tricks the other direction is also true)• I’m interested in maximizing pr<strong>of</strong>it (minimizing loss).I’m not (primarily) interested in finding a (true/predictive/c• Separating noise from data is not necessary in this setting!My Position to Occam• Most <strong>of</strong> us belief in or at least use the axioms <strong>of</strong> logic, proonatural numbers when doing science, without questioning th• We should/must add Occam’s razor in some quantified formbecause it is the foundation <strong>of</strong> machine learning and science• There is (yet) no mathematical pro<strong>of</strong> <strong>of</strong> Occam’s razor, andan independent axiom, but there is lots <strong>of</strong> evidence that this


<strong>Marcus</strong> <strong>Hutter</strong> - 4 - Loss Bounds for UniOn the Foundations <strong>of</strong> Machine L• Example: Algorithm/complexity theory: The goal is to findproblems and to show lower bounds on their computation tirigorously defined: algorithm, Turing machine, problem class...• Most disciplines start with an informal way <strong>of</strong> attacking a suget more and more formalized <strong>of</strong>ten to a point where they aExamples: set theory, logical reasoning, pro<strong>of</strong> theory, probabinfinitesimal calculus, quantum field theory, ...• Machine learning: Tries to build and understand systems whdata, to make good prediction, which are able to generalize,Many terms only vaguely defined or there are many alternat


<strong>Marcus</strong> <strong>Hutter</strong> - 5 - Loss Bounds for UniOccam to the Rescue• Is it possible to give machine learning a rigorous mathematicframework/definition?• Yes! Use Occam’s razor, quantified in terms <strong>of</strong> Kolmogorovcombine it with Bayes, and possibly sequential decision theo• There is at the moment no alternative suggestion <strong>of</strong> how tolearning rigorously.My view <strong>of</strong> (future) Machine lea• Application = Solve learning tasks by approximating Kolmog(MML, MDL, SRM, and much more specific ones, like SVM• Theory = Pro<strong>of</strong> theorems, especially on convergence and ap• Non-standard ML = Modify “Occam’s axiom” with the goabetter.


<strong>Marcus</strong> <strong>Hutter</strong> - 6 - Loss Bounds for UniInduction = Predicting the FuExtrapolate past observations to the futurebut how can we know something about the futEpicurus’ principle <strong>of</strong> multiple explanationsIf more than one theory is consistent with the observaOckhams’ razor (simplicity) principleEntities should not be multiplied beyond necessity.Hume’s negation <strong>of</strong> Induction The only form <strong>of</strong> induction as the conclusion is already logically contained inBayes’ rule for conditional probabilitiesGiven the prior believe/probability one can predict aSolomon<strong>of</strong>f’s universal priorSolves the question <strong>of</strong> how to choose the prior if not


<strong>Marcus</strong> <strong>Hutter</strong> - 7 - Loss Bounds for UniStrings and Conditional ProbabStrings: x=x 1 x 2 ...x n with x t ∈X and x 1:m := x 1 x 2 ...x m−1 x mρ(x 1 ...x n ) is the probability that an (infinite) sequence starts wiHeavy use <strong>of</strong> Bayes’ rule in the following forms:ρ(x n |x


<strong>Marcus</strong> <strong>Hutter</strong> - 8 - Loss Bounds for UniProbability <strong>of</strong> Sunrise TomorrWhat is the probability that the sun will rise tomorrow? It is µ(0lifetime <strong>of</strong> the sun in days.1 = sun raised. 0 = sun will not raise.• The probability is undefined, because there has never been atested the existence <strong>of</strong> the sum tomorrow (reference class p• The probability is 1, because in all experiments that have bedays) the sun raised.• The probability is 1 − ɛ, where ɛ is the proportion <strong>of</strong> stars inexplode in a supernova per day.• The probability is (d + 1)/(d + 2) (Laplace estimate by assuprocess with uniformly distributed raising prior probability p)• The probability can be derived from the type, age, size andsun, even though we never have observed another star withSolomon<strong>of</strong>f solved the problem <strong>of</strong> unknown prior µ by introducinprobability distribution ξ related to Algorithmic Information The


<strong>Marcus</strong> <strong>Hutter</strong> - 9 - Loss Bounds for UniKolmogorov ComplexityThe Kolmogorov Complexity <strong>of</strong> a string x is the length <strong>of</strong> the shproducing x.K(x) := minp{l(p) : U(p) = x} , U = univThe definition is ”nearly” independent <strong>of</strong> the choice <strong>of</strong> U|K U (x) − K U ′(x)| < c UU ′, K U (x) + = K U ′+= indicates equality up to a constant c UU ′ independent <strong>of</strong> x.K satisfies most properties an information measure should satisfK(xy) + ≤ K(x) + K(y).K(x) is not computable, but only co-enumerable (semi-computa


<strong>Marcus</strong> <strong>Hutter</strong> - 10 - Loss Bounds for UniUniversal Probability DistributThe universal semimeasure is the probability that output <strong>of</strong> U stinput is provided with fair coin flipsξ(x) = ∑w µi · µ i (x) =× ∑2 −l(p) , e.g. wµ i ∈Mp : U(p)=x∗[Solomon<strong>of</strong>f 64]Universality property <strong>of</strong> ξ: ξ dominates every computable probabξ(x) ≥ w µi ·µ i (x)∀µ i ∈ MFurthermore, the µ expected squared distance sum between ξ acomputable µ∞∑ ∑µ(x


<strong>Marcus</strong> <strong>Hutter</strong> - 11 - Loss Bounds for UniConvergence TheoremThe universal conditional probability ξ(x t |x


<strong>Marcus</strong> <strong>Hutter</strong> - 12 - Loss Bounds for UniUniversal Sequence PredictioA prediction is very <strong>of</strong>ten the basis for some decision. The decisiwhich itself leads to some reward or loss. Let l xt y t∈ [0, 1] be thtaking action y t ∈Y and x t ∈X is the t th symbol <strong>of</strong> the sequencdecision Y ={umbrella, sunglasses} based on weather forecasts XLoss sunny rainyumbrella 0.3 0.1sunglasses 0.0 1.0The goal is to minimize the µ-expected loss. More generally weprediction scheme∑y Λ ρt := arg min ρ(x t |x


<strong>Marcus</strong> <strong>Hutter</strong> - 13 - Loss Bounds for UniLoss Bounds (Main TheoremL Λ µn made by the informed scheme Λ µ ,L Λ ξn made by the universal scheme Λ ξ ,L Λ n made by any (causal) prediction scheme Λ.i) L Λ µn ≤ L Λ n for any (causal) prediction scheme Λ.√ii) 0 ≤ L Λ ξn − L Λ µn ≤ 2D n + 2 L Λ µniii) if L ∞Λµ is finite, then L ∞Λξ is finiteiv)v)D nL Λ ξn /L Λ µn = 1 + O((L Λ µn ) −1/2 ) LΛ µn →∞−→ 1∑ nt=1 E[(l tΛ ξ(x


<strong>Marcus</strong> <strong>Hutter</strong> - 14 - Loss Bounds for UniExample ApplicationA dealer has two dice, one with 2 white and 4 black faces, the o2 black faces. He chooses a die according to some deterministicwe bet s=$3 on white or black and receive r =$5 for every corrIf we know µ, i.e. the die the dealer chooses, we should predictsides and win money. Expected Pr<strong>of</strong>it (= –Loss): P nΛµ /n = 1 3If we don’t know µ we can use Solomon<strong>of</strong>f prediction scheme Λ ξthe same pr<strong>of</strong>it:P nΛξ /P nΛµ , = 1 − O(n −1/2 )Bound on Winning TimeEstimate <strong>of</strong> the number <strong>of</strong> rounds before reaching the winning zP nΛξ >0 if L Λ ξn 330 ln 2·K(µ) + O(1)Λ ξ is asymptotically optimal with rapid convergence.


<strong>Marcus</strong> <strong>Hutter</strong> - 15 - Loss Bounds for UniGeneral Bound for Winning TFor every (passive) game <strong>of</strong> chance for which there exists a winnmake money by using Λ ξ even if you don’t know the underlyingprocess/algorithm.Λ ξ finds and exploits every regularity.The time n needed to reach the winning zone is( 2p∆)1n ≤2·ln , ¯p nΛµ := 1 ¯p nΛµ w µ nn∑p tΛµ , p ∆ =t=1


<strong>Marcus</strong> <strong>Hutter</strong> - 16 - Loss Bounds for UniGeneralization: Continuous ProbabilityIn statistical parameter estimation one <strong>of</strong>ten has a continuous hyBernoulli(θ) process with unknown θ ∈ [0, 1]).∫M := {µ θ : θ ∈ IR d }, ξ(x 1:n ) := dθ w(θ)·µ θ (x 1:n ),IR dThe only property <strong>of</strong> ξ needed was ξ(x 1:n )≥w µi ·µ i (x 1:n ) whichdropping the sum over µ i . Here, restrict the integral over IR d toθ. For sufficiently smooth µ θ and w(θ) we expectξ(x 1:n )> ∼ |N δn |·w(θ)·µ θ (x 1:n ) =⇒ D n< ∼ ln 1w µ+The average Fisher information ¯j n measures the curvature (paraln µ θ . Under some weak regularity conditions on ¯j n one can shoD n:= ∑ x 1:nµ(x 1:n ) ln µ(x 1:n)ξ(x 1:n )≤ ln 1w µ+ d 2 ln n 2π + 1 2 li.e. D n grows only logarithmically with n.


<strong>Marcus</strong> <strong>Hutter</strong> - 17 - Loss Bounds for UniOptimality <strong>of</strong> the Universal Pred• There are M and µ ∈ M and weights w µ for which the loss• The universal prior ξ is pareto-optimal, in the sense that theF(µ a , ρ) ≤ F(µ a , ξ) for all µ a ∈ M and strict inequality forwhere F is the instantaneous or total squared distance s t , Sd t , D n , or error e t , E n , or loss l t , L n .• ξ is elastic pareto-optimal in the sense that by accepting a sdecrease in some environments one can only achieve a slightin other environments.• Within the set <strong>of</strong> enumerable weight functions with short prweights w ν = 2 −K(ν) lead to the smallest performance boun(to ln w −1µ ) constant in all enumerable environments.Does all this justify Occam’s razor ?


<strong>Marcus</strong> <strong>Hutter</strong> - 18 - Loss Bounds for UniLarger & Smaller Environmental C• all finitely computable probability measures(ξ ∉ M in no sense computable)• all enumerable (approximable from below) semi-measures [S(ξ ∈ M enumerable)• all cumulatively enumerable semi-measures [Schmidhuber 01(distribution enumerable and ∈ M)• all approximable (asymptotically computable) measures [Sch(ξ ∉ M in no sense computable)• Speed prior related to Levin complexity and Levin search [Sc(which distributions are dominated?)• finite-state automata instead <strong>of</strong> general Turing machines [Feto Lempel-Ziv data compression (ξ ∉ M)


<strong>Marcus</strong> <strong>Hutter</strong> - 19 - Loss Bounds for UniGeneralization: The Universal AIξUniversal AI = Universal Induction + Sequential DeciReplace µ AI in decision theory model AIµ by an appropriate genξ(yx 1:t ) :=∑2 −l(q)q:q(y 1:t )=x 1:ty t= arg maxy t∑x tmaxy t+1∑x t+1... maxy m(r(x t )+ ... +r(x m ))·ξ(x t:m |y t:m )∑x mClaim: AIξ is the most intelligent environmental independent, i.eagent possible.Applications: Strategic Games, Function Minimization, SupervisExamples, Sequence Prediction, Classification.[Proceedings <strong>of</strong> ECML-2001] and [http://www.hutt


<strong>Marcus</strong> <strong>Hutter</strong> - 20 - Loss Bounds for UniFurther Generalizations:• Time and history dependent loss function in general interval• Infinite (countable and uncountable) action/decision space.• Partial Sequence Prediction.• Independent Experiments & Classification.Outlook:• Infinite (prediction) alphabet X .• Delayed and Probabilistic Sequence Prediction.• Unification with (Lossbounds for) aggregating strategies.• Determine suitable performance measures for universal AIξ a• Study learning aspect <strong>of</strong> Λ ξ and AIξ.• Information theoretic interpretation <strong>of</strong> winning time.• Implementation and application <strong>of</strong> Λ ξ for specific finite M.• Downscale theory and results to MDL approach.


<strong>Marcus</strong> <strong>Hutter</strong> - 21 - Loss Bounds for UniConclusions• Solomon<strong>of</strong>f’s prediction scheme, which is related to Kolom<strong>of</strong>ormally solves the general problem <strong>of</strong> induction.• We proved convergence and loss-bounds for Solomon<strong>of</strong>f predis well suited, even for difficult prediction problems.• We proved several optimality properties for Solomon<strong>of</strong>f pred• We made no structural assumptions on the probability distri• The bounds are valid for any bounded loss function.• We proved a bound on the time to win in games <strong>of</strong> chances• Discrete and continuous probability classes have been consid• Generalizations to active agents with reinforcement feedback• At least all this is a lot <strong>of</strong> evidence that Occam’s razor is aSee [http://www.idsia.ch/ ∼ marcus] for deta

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!