n - of Marcus Hutter

Marcus Hutter - 1 - Loss Bounds for UniOptimality of Universal BayesianPrediction for General Loss andMarcus HutterIstituto Dalle Molle di Studi sull’IntelligenzaIDSIA, Galleria 2, CH-6928 Manno-Lugano, Smarcus@idsia.ch, http://www.idsia.ch/ ∼ m2000 – 2002Universal Induction = Ockham + Epicurus + BLoss(Universal Prediction Scheme)Loss(Any other Prediction Scheme) ≤ 1 + o(1

Marcus Hutter - 2 - Loss Bounds for UniTable of Contents• The Philosophical Dilemma of Predicting t• (Conditional) Probabilities and their Interp• Probability that the Sun will rise Tomorrow• Kolmogorov Complexity• Universal Probability Distribution• Universal Sequence Prediction• Loss Bounds & Optimality• Application to Games of Chance (Bound on• Generalization: Continuous Probability Cla• Generalization: The Universal AIξ Model• Further Generalizations, Outlook, Conclusio

Marcus Hutter - 3 - Loss Bounds for UniProblem Setup• Every induction problem can be phrased as a sequence predi• Classification is a special case of sequence prediction.(With some tricks the other direction is also true)• I’m interested in maximizing profit (minimizing loss).I’m not (primarily) interested in finding a (true/predictive/c• Separating noise from data is not necessary in this setting!My Position to Occam• Most of us belief in or at least use the axioms of logic, proonatural numbers when doing science, without questioning th• We should/must add Occam’s razor in some quantified formbecause it is the foundation of machine learning and science• There is (yet) no mathematical proof of Occam’s razor, andan independent axiom, but there is lots of evidence that this

Marcus Hutter - 4 - Loss Bounds for UniOn the Foundations of Machine L• Example: Algorithm/complexity theory: The goal is to findproblems and to show lower bounds on their computation tirigorously defined: algorithm, Turing machine, problem class...• Most disciplines start with an informal way of attacking a suget more and more formalized often to a point where they aExamples: set theory, logical reasoning, proof theory, probabinfinitesimal calculus, quantum field theory, ...• Machine learning: Tries to build and understand systems whdata, to make good prediction, which are able to generalize,Many terms only vaguely defined or there are many alternat

Marcus Hutter - 5 - Loss Bounds for UniOccam to the Rescue• Is it possible to give machine learning a rigorous mathematicframework/definition?• Yes! Use Occam’s razor, quantified in terms of Kolmogorovcombine it with Bayes, and possibly sequential decision theo• There is at the moment no alternative suggestion of how tolearning rigorously.My view of (future) Machine lea• Application = Solve learning tasks by approximating Kolmog(MML, MDL, SRM, and much more specific ones, like SVM• Theory = Proof theorems, especially on convergence and ap• Non-standard ML = Modify “Occam’s axiom” with the goabetter.

Marcus Hutter - 6 - Loss Bounds for UniInduction = Predicting the FuExtrapolate past observations to the futurebut how can we know something about the futEpicurus’ principle of multiple explanationsIf more than one theory is consistent with the observaOckhams’ razor (simplicity) principleEntities should not be multiplied beyond necessity.Hume’s negation of Induction The only form of induction as the conclusion is already logically contained inBayes’ rule for conditional probabilitiesGiven the prior believe/probability one can predict aSolomonoff’s universal priorSolves the question of how to choose the prior if not

Marcus Hutter - 7 - Loss Bounds for UniStrings and Conditional ProbabStrings: x=x 1 x 2 ...x n with x t ∈X and x 1:m := x 1 x 2 ...x m−1 x mρ(x 1 ...x n ) is the probability that an (infinite) sequence starts wiHeavy use of Bayes’ rule in the following forms:ρ(x n |x

Marcus Hutter - 8 - Loss Bounds for UniProbability of Sunrise TomorrWhat is the probability that the sun will rise tomorrow? It is µ(0lifetime of the sun in days.1 = sun raised. 0 = sun will not raise.• The probability is undefined, because there has never been atested the existence of the sum tomorrow (reference class p• The probability is 1, because in all experiments that have bedays) the sun raised.• The probability is 1 − ɛ, where ɛ is the proportion of stars inexplode in a supernova per day.• The probability is (d + 1)/(d + 2) (Laplace estimate by assuprocess with uniformly distributed raising prior probability p)• The probability can be derived from the type, age, size andsun, even though we never have observed another star withSolomonoff solved the problem of unknown prior µ by introducinprobability distribution ξ related to Algorithmic Information The

Marcus Hutter - 9 - Loss Bounds for UniKolmogorov ComplexityThe Kolmogorov Complexity of a string x is the length of the shproducing x.K(x) := minp{l(p) : U(p) = x} , U = univThe definition is ”nearly” independent of the choice of U|K U (x) − K U ′(x)| < c UU ′, K U (x) + = K U ′+= indicates equality up to a constant c UU ′ independent of x.K satisfies most properties an information measure should satisfK(xy) + ≤ K(x) + K(y).K(x) is not computable, but only co-enumerable (semi-computa

Marcus Hutter - 10 - Loss Bounds for UniUniversal Probability DistributThe universal semimeasure is the probability that output of U stinput is provided with fair coin flipsξ(x) = ∑w µi · µ i (x) =× ∑2 −l(p) , e.g. wµ i ∈Mp : U(p)=x∗[Solomonoff 64]Universality property of ξ: ξ dominates every computable probabξ(x) ≥ w µi ·µ i (x)∀µ i ∈ MFurthermore, the µ expected squared distance sum between ξ acomputable µ∞∑ ∑µ(x

Marcus Hutter - 11 - Loss Bounds for UniConvergence TheoremThe universal conditional probability ξ(x t |x

Marcus Hutter - 12 - Loss Bounds for UniUniversal Sequence PredictioA prediction is very often the basis for some decision. The decisiwhich itself leads to some reward or loss. Let l xt y t∈ [0, 1] be thtaking action y t ∈Y and x t ∈X is the t th symbol of the sequencdecision Y ={umbrella, sunglasses} based on weather forecasts XLoss sunny rainyumbrella 0.3 0.1sunglasses 0.0 1.0The goal is to minimize the µ-expected loss. More generally weprediction scheme∑y Λ ρt := arg min ρ(x t |x

Marcus Hutter - 13 - Loss Bounds for UniLoss Bounds (Main TheoremL Λ µn made by the informed scheme Λ µ ,L Λ ξn made by the universal scheme Λ ξ ,L Λ n made by any (causal) prediction scheme Λ.i) L Λ µn ≤ L Λ n for any (causal) prediction scheme Λ.√ii) 0 ≤ L Λ ξn − L Λ µn ≤ 2D n + 2 L Λ µniii) if L ∞Λµ is finite, then L ∞Λξ is finiteiv)v)D nL Λ ξn /L Λ µn = 1 + O((L Λ µn ) −1/2 ) LΛ µn →∞−→ 1∑ nt=1 E[(l tΛ ξ(x

Marcus Hutter - 14 - Loss Bounds for UniExample ApplicationA dealer has two dice, one with 2 white and 4 black faces, the o2 black faces. He chooses a die according to some deterministicwe bet s=$3 on white or black and receive r =$5 for every corrIf we know µ, i.e. the die the dealer chooses, we should predictsides and win money. Expected Profit (= –Loss): P nΛµ /n = 1 3If we don’t know µ we can use Solomonoff prediction scheme Λ ξthe same profit:P nΛξ /P nΛµ , = 1 − O(n −1/2 )Bound on Winning TimeEstimate of the number of rounds before reaching the winning zP nΛξ >0 if L Λ ξn 330 ln 2·K(µ) + O(1)Λ ξ is asymptotically optimal with rapid convergence.

Marcus Hutter - 15 - Loss Bounds for UniGeneral Bound for Winning TFor every (passive) game of chance for which there exists a winnmake money by using Λ ξ even if you don’t know the underlyingprocess/algorithm.Λ ξ finds and exploits every regularity.The time n needed to reach the winning zone is( 2p∆)1n ≤2·ln , ¯p nΛµ := 1 ¯p nΛµ w µ nn∑p tΛµ , p ∆ =t=1

Marcus Hutter - 16 - Loss Bounds for UniGeneralization: Continuous ProbabilityIn statistical parameter estimation one often has a continuous hyBernoulli(θ) process with unknown θ ∈ [0, 1]).∫M := {µ θ : θ ∈ IR d }, ξ(x 1:n ) := dθ w(θ)·µ θ (x 1:n ),IR dThe only property of ξ needed was ξ(x 1:n )≥w µi ·µ i (x 1:n ) whichdropping the sum over µ i . Here, restrict the integral over IR d toθ. For sufficiently smooth µ θ and w(θ) we expectξ(x 1:n )> ∼ |N δn |·w(θ)·µ θ (x 1:n ) =⇒ D n< ∼ ln 1w µ+The average Fisher information ¯j n measures the curvature (paraln µ θ . Under some weak regularity conditions on ¯j n one can shoD n:= ∑ x 1:nµ(x 1:n ) ln µ(x 1:n)ξ(x 1:n )≤ ln 1w µ+ d 2 ln n 2π + 1 2 li.e. D n grows only logarithmically with n.

Marcus Hutter - 17 - Loss Bounds for UniOptimality of the Universal Pred• There are M and µ ∈ M and weights w µ for which the loss• The universal prior ξ is pareto-optimal, in the sense that theF(µ a , ρ) ≤ F(µ a , ξ) for all µ a ∈ M and strict inequality forwhere F is the instantaneous or total squared distance s t , Sd t , D n , or error e t , E n , or loss l t , L n .• ξ is elastic pareto-optimal in the sense that by accepting a sdecrease in some environments one can only achieve a slightin other environments.• Within the set of enumerable weight functions with short prweights w ν = 2 −K(ν) lead to the smallest performance boun(to ln w −1µ ) constant in all enumerable environments.Does all this justify Occam’s razor ?

Marcus Hutter - 18 - Loss Bounds for UniLarger & Smaller Environmental C• all finitely computable probability measures(ξ ∉ M in no sense computable)• all enumerable (approximable from below) semi-measures [S(ξ ∈ M enumerable)• all cumulatively enumerable semi-measures [Schmidhuber 01(distribution enumerable and ∈ M)• all approximable (asymptotically computable) measures [Sch(ξ ∉ M in no sense computable)• Speed prior related to Levin complexity and Levin search [Sc(which distributions are dominated?)• finite-state automata instead of general Turing machines [Feto Lempel-Ziv data compression (ξ ∉ M)

Marcus Hutter - 19 - Loss Bounds for UniGeneralization: The Universal AIξUniversal AI = Universal Induction + Sequential DeciReplace µ AI in decision theory model AIµ by an appropriate genξ(yx 1:t ) :=∑2 −l(q)q:q(y 1:t )=x 1:ty t= arg maxy t∑x tmaxy t+1∑x t+1... maxy m(r(x t )+ ... +r(x m ))·ξ(x t:m |y t:m )∑x mClaim: AIξ is the most intelligent environmental independent, i.eagent possible.Applications: Strategic Games, Function Minimization, SupervisExamples, Sequence Prediction, Classification.[Proceedings of ECML-2001] and [http://www.hutt

Marcus Hutter - 20 - Loss Bounds for UniFurther Generalizations:• Time and history dependent loss function in general interval• Infinite (countable and uncountable) action/decision space.• Partial Sequence Prediction.• Independent Experiments & Classification.Outlook:• Infinite (prediction) alphabet X .• Delayed and Probabilistic Sequence Prediction.• Unification with (Lossbounds for) aggregating strategies.• Determine suitable performance measures for universal AIξ a• Study learning aspect of Λ ξ and AIξ.• Information theoretic interpretation of winning time.• Implementation and application of Λ ξ for specific finite M.• Downscale theory and results to MDL approach.

Marcus Hutter - 21 - Loss Bounds for UniConclusions• Solomonoff’s prediction scheme, which is related to Kolomoformally solves the general problem of induction.• We proved convergence and loss-bounds for Solomonoff predis well suited, even for difficult prediction problems.• We proved several optimality properties for Solomonoff pred• We made no structural assumptions on the probability distri• The bounds are valid for any bounded loss function.• We proved a bound on the time to win in games of chances• Discrete and continuous probability classes have been consid• Generalizations to active agents with reinforcement feedback• At least all this is a lot of evidence that Occam’s razor is aSee [http://www.idsia.ch/ ∼ marcus] for deta

n - of Marcus Hutter

Create successful ePaper yourself

Delete template?

Save as template?