28.01.2015 Views

Slides in PDF - of Marcus Hutter

Slides in PDF - of Marcus Hutter

Slides in PDF - of Marcus Hutter

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Foundations <strong>of</strong><br />

Universal Induction<br />

and Intelligent Agents<br />

<strong>Marcus</strong> <strong>Hutter</strong><br />

Canberra, ACT, 0200, Australia<br />

http://www.hutter1.net/<br />

ANU RSISE NICTA


<strong>Marcus</strong> <strong>Hutter</strong> - 2 - Universal Induction & Intelligence<br />

Abstract<br />

The dream <strong>of</strong> creat<strong>in</strong>g artificial devices that reach or outperform human<br />

<strong>in</strong>telligence is many centuries old. This lecture series presents the<br />

elegant parameter-free theory, developed <strong>in</strong> [Hut05], <strong>of</strong> an optimal<br />

re<strong>in</strong>forcement learn<strong>in</strong>g agent embedded <strong>in</strong> an arbitrary unknown<br />

environment that possesses essentially all aspects <strong>of</strong> rational <strong>in</strong>telligence.<br />

The theory reduces all conceptual AI problems to pure computational<br />

questions.<br />

How to perform <strong>in</strong>ductive <strong>in</strong>ference is closely related to the AI problem.<br />

The lecture series covers Solomon<strong>of</strong>f’s theory, elaborated on <strong>in</strong> [Hut07],<br />

which solves the <strong>in</strong>duction problem, at least from a philosophical and<br />

statistical perspective.<br />

Both theories are based on Occam’s razor quantified by Kolmogorov<br />

complexity; Bayesian probability theory; and sequential decision theory.


<strong>Marcus</strong> <strong>Hutter</strong> - 3 - Universal Induction & Intelligence<br />

RelationbetweenML&RL&(U)AI<br />

Universal Artificial Intelligence<br />

Covers all Re<strong>in</strong>forcement Learn<strong>in</strong>g problem types<br />

Statistical<br />

Mach<strong>in</strong>e Learn<strong>in</strong>g<br />

Mostly i.i.d. data<br />

classification,<br />

regression,<br />

cluster<strong>in</strong>g<br />

RL Problems<br />

& Algorithms<br />

Stochastic,<br />

unknown,<br />

non-i.i.d.<br />

environments<br />

Artificial<br />

Intelligence<br />

Traditionally<br />

determ<strong>in</strong>istic,<br />

known world /<br />

plann<strong>in</strong>g problem


<strong>Marcus</strong> <strong>Hutter</strong> - 4 - Universal Induction & Intelligence<br />

TABLE OF CONTENTS<br />

1. PHILOSOPHICAL ISSUES<br />

2. BAYESIAN SEQUENCE PREDICTION<br />

3. UNIVERSAL INDUCTIVE INFERENCE<br />

4. UNIVERSAL ARTIFICIAL INTELLIGENCE<br />

5. APPROXIMATIONS AND APPLICATIONS<br />

6. FUTURE DIRECTION, WRAP UP, LITERATURE


<strong>Marcus</strong> <strong>Hutter</strong> - 5 - Universal Induction & Intelligence<br />

PHILOSOPHICAL ISSUES<br />

• Philosophical Problems<br />

• What is (Artificial) Intelligence<br />

• How to do Inductive Inference<br />

• How to Predict (Number) Sequences<br />

• How to make Decisions <strong>in</strong> Unknown Environments<br />

• Occam’s Razor to the Rescue<br />

• The Grue Emerald and Confirmation Paradoxes<br />

• What this Lecture Series is (Not) About


<strong>Marcus</strong> <strong>Hutter</strong> - 6 - Universal Induction & Intelligence<br />

Philosophical Issues: Abstract<br />

I start by consider<strong>in</strong>g the philosophical problems concern<strong>in</strong>g artificial<br />

<strong>in</strong>telligence and mach<strong>in</strong>e learn<strong>in</strong>g <strong>in</strong> general and <strong>in</strong>duction <strong>in</strong> particular.<br />

I illustrate the problems and their <strong>in</strong>tuitive solution on various (classical)<br />

<strong>in</strong>duction examples. The common pr<strong>in</strong>ciple to their solution is Occam’s<br />

simplicity pr<strong>in</strong>ciple. Based on Occam’s and Epicurus’ pr<strong>in</strong>ciple, Bayesian<br />

probability theory, and Tur<strong>in</strong>g’s universal mach<strong>in</strong>e, Solomon<strong>of</strong>f<br />

developed a formal theory <strong>of</strong> <strong>in</strong>duction. I describe the sequential/onl<strong>in</strong>e<br />

setup considered <strong>in</strong> this lecture series and place it <strong>in</strong>to the wider<br />

mach<strong>in</strong>e learn<strong>in</strong>g context.


<strong>Marcus</strong> <strong>Hutter</strong> - 7 - Universal Induction & Intelligence<br />

What is (Artificial) Intelligence<br />

Intelligence can have many faces ⇒ formal def<strong>in</strong>ition difficult<br />

• reason<strong>in</strong>g<br />

• creativity<br />

• association<br />

• generalization<br />

• pattern recognition<br />

• problem solv<strong>in</strong>g<br />

• memorization<br />

• plann<strong>in</strong>g<br />

• achiev<strong>in</strong>g goals<br />

• learn<strong>in</strong>g<br />

• optimization<br />

• self-preservation<br />

• vision<br />

• language process<strong>in</strong>g<br />

• classification<br />

• <strong>in</strong>duction<br />

• deduction<br />

• ...<br />

What is AI Th<strong>in</strong>k<strong>in</strong>g Act<strong>in</strong>g<br />

humanly Cognitive Tur<strong>in</strong>g test,<br />

Science Behaviorism<br />

rationally Laws Do<strong>in</strong>g the<br />

Thought Right Th<strong>in</strong>g<br />

Collection <strong>of</strong> 70+ Defs <strong>of</strong> Intelligence<br />

http://www.vetta.org/<br />

def<strong>in</strong>itions-<strong>of</strong>-<strong>in</strong>telligence/<br />

Real world is nasty: partially unobservable,<br />

uncerta<strong>in</strong>, unknown, non-ergodic, reactive,<br />

vast, but luckily structured, ...


<strong>Marcus</strong> <strong>Hutter</strong> - 8 - Universal Induction & Intelligence<br />

Informal Def<strong>in</strong>ition <strong>of</strong> (Artificial) Intelligence<br />

Intelligence measures an agent’s ability to achieve goals<br />

<strong>in</strong> a wide range <strong>of</strong> environments. [S. Legg and M. <strong>Hutter</strong>]<br />

Emergent: Features such as the ability to learn and adapt, or to<br />

understand, are implicit <strong>in</strong> the above def<strong>in</strong>ition as these capacities<br />

enable an agent to succeed <strong>in</strong> a wide range <strong>of</strong> environments.<br />

The science <strong>of</strong> Artificial Intelligence is concerned with the construction<br />

<strong>of</strong> <strong>in</strong>telligent systems/artifacts/agents and their analysis.<br />

What next Substantiate all terms above: agent, ability, utility, goal,<br />

success, learn, adapt, environment, ...


<strong>Marcus</strong> <strong>Hutter</strong> - 9 - Universal Induction & Intelligence<br />

On the Foundations <strong>of</strong> Artificial Intelligence<br />

• Example: Algorithm/complexity theory: The goal is to f<strong>in</strong>d fast<br />

algorithms solv<strong>in</strong>g problems and to show lower bounds on their<br />

computation time. Everyth<strong>in</strong>g is rigorously def<strong>in</strong>ed: algorithm,<br />

Tur<strong>in</strong>g mach<strong>in</strong>e, problem classes, computation time, ...<br />

• Most discipl<strong>in</strong>es start with an <strong>in</strong>formal way <strong>of</strong> attack<strong>in</strong>g a subject.<br />

With time they get more and more formalized <strong>of</strong>ten to a po<strong>in</strong>t<br />

where they are completely rigorous. Examples: set theory, logical<br />

reason<strong>in</strong>g, pro<strong>of</strong> theory, probability theory, <strong>in</strong>f<strong>in</strong>itesimal calculus,<br />

energy, temperature, quantum field theory, ...<br />

• Artificial Intelligence: Tries to build and understand systems that<br />

act <strong>in</strong>telligently, learn from experience, make good predictions, are<br />

able to generalize, ... Many terms are only vaguely def<strong>in</strong>ed or there<br />

are many alternate def<strong>in</strong>itions.


<strong>Marcus</strong> <strong>Hutter</strong> - 10 - Universal Induction & Intelligence<br />

Induction→Prediction→Decision→Action<br />

Induction <strong>in</strong>fers general models from specific observations/facts/data,<br />

usually exhibit<strong>in</strong>g regularities or properties or relations <strong>in</strong> the latter.<br />

Hav<strong>in</strong>g or acquir<strong>in</strong>g or learn<strong>in</strong>g or <strong>in</strong>duc<strong>in</strong>g a model <strong>of</strong> the environment<br />

an agent <strong>in</strong>teracts with allows the agent to make predictions and utilize<br />

them <strong>in</strong> its decision process <strong>of</strong> f<strong>in</strong>d<strong>in</strong>g a good next action.<br />

Example<br />

Induction: F<strong>in</strong>d a model <strong>of</strong> the world economy.<br />

Prediction: Use the model for predict<strong>in</strong>g the future stock market.<br />

Decision: Decide whether to <strong>in</strong>vest assets <strong>in</strong> stocks or bonds.<br />

Action: Trad<strong>in</strong>g large quantities <strong>of</strong> stocks <strong>in</strong>fluences the market.


<strong>Marcus</strong> <strong>Hutter</strong> - 11 - Universal Induction & Intelligence<br />

Example 1: Probability <strong>of</strong> Sunrise Tomorrow<br />

What is the probability p(1|1 d ) that the sun will rise tomorrow<br />

(d = past # days sun rose, 1 =sun rises. 0 = sun will not rise)<br />

• p is undef<strong>in</strong>ed, because there has never been an experiment that<br />

tested the existence <strong>of</strong> the sun tomorrow (ref. class problem).<br />

• The p = 1, because the sun rose <strong>in</strong> all past experiments.<br />

• p = 1 − ϵ, where ϵ is the proportion <strong>of</strong> stars that explode per day.<br />

• p = d+1<br />

d+2<br />

, which is Laplace rule derived from Bayes rule.<br />

• Derive p from the type, age, size and temperature <strong>of</strong> the sun, even<br />

though we never observed another star with those exact properties.<br />

Conclusion: We predict that the sun will rise tomorrow with high<br />

probability <strong>in</strong>dependent <strong>of</strong> the justification.


<strong>Marcus</strong> <strong>Hutter</strong> - 12 - Universal Induction & Intelligence<br />

Example 2: Digits <strong>of</strong> a Computable Number<br />

• Extend 14159265358979323846264338327950288419716939937<br />

• Looks random!<br />

• Frequency estimate: n = length <strong>of</strong> sequence. k i = number <strong>of</strong><br />

occured i =⇒ Probability <strong>of</strong> next digit be<strong>in</strong>g i is i n . Asymptotically<br />

i<br />

n → 1 10<br />

(seems to be) true.<br />

• But we have the strong feel<strong>in</strong>g that (i.e. with high probability) the<br />

next digit will be 5 because the previous digits were the expansion<br />

<strong>of</strong> π.<br />

• Conclusion: We prefer answer 5, s<strong>in</strong>ce we see more structure <strong>in</strong> the<br />

sequence than just random digits.


<strong>Marcus</strong> <strong>Hutter</strong> - 13 - Universal Induction & Intelligence<br />

Sequence:<br />

Example 3: Number Sequences<br />

x 1 , x 2 , x 3 , x 4 , x 5 , ...<br />

1, 2, 3, 4, , ...<br />

• x 5 = 5, s<strong>in</strong>ce x i = i for i = 1..4.<br />

• x 5 = 29, s<strong>in</strong>ce x i = i 4 − 10i 3 + 35i 2 − 49i + 24.<br />

Conclusion: We prefer 5, s<strong>in</strong>ce l<strong>in</strong>ear relation <strong>in</strong>volves less arbitrary<br />

parameters than 4th-order polynomial.<br />

Sequence: 2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,<br />

• 61, s<strong>in</strong>ce this is the next prime<br />

• 60, s<strong>in</strong>ce this is the order <strong>of</strong> the next simple group<br />

Conclusion: We prefer answer 61, s<strong>in</strong>ce primes are a more familiar<br />

concept than simple groups.<br />

On-L<strong>in</strong>e Encyclopedia <strong>of</strong> Integer Sequences:<br />

http://www.research.att.com/∼njas/sequences/


<strong>Marcus</strong> <strong>Hutter</strong> - 14 - Universal Induction & Intelligence<br />

Occam’s Razor to the Rescue<br />

• Is there a unique pr<strong>in</strong>ciple which allows us to formally arrive at a<br />

prediction which<br />

- co<strong>in</strong>cides (always) with our <strong>in</strong>tuitive guess -or- even better,<br />

- which is (<strong>in</strong> some sense) most likely the best or correct answer<br />

• Yes! Occam’s razor: Use the simplest explanation consistent with<br />

past data (and use it for prediction).<br />

• Works! For examples presented and for many more.<br />

• Actually Occam’s razor can serve as a foundation <strong>of</strong> mach<strong>in</strong>e<br />

learn<strong>in</strong>g <strong>in</strong> general, and is even a fundamental pr<strong>in</strong>ciple (or maybe<br />

even the mere def<strong>in</strong>ition) <strong>of</strong> science.<br />

• Problem: Not a formal/mathematical objective pr<strong>in</strong>ciple.<br />

What is simple for one may be complicated for another.


<strong>Marcus</strong> <strong>Hutter</strong> - 15 - Universal Induction & Intelligence<br />

Grue Emerald Paradox<br />

Hypothesis 1: All emeralds are green.<br />

Hypothesis 2: All emeralds found till y2020 are green,<br />

thereafter all emeralds are blue.<br />

• Which hypothesis is more plausible H1! Justification<br />

• Occam’s razor: take simplest hypothesis consistent with data.<br />

is the most important pr<strong>in</strong>ciple <strong>in</strong> mach<strong>in</strong>e learn<strong>in</strong>g and science.


<strong>Marcus</strong> <strong>Hutter</strong> - 16 - Universal Induction & Intelligence<br />

Confirmation Paradox<br />

(i) R → B is confirmed by an R-<strong>in</strong>stance with property B<br />

(ii) ¬B → ¬R is confirmed by a ¬B-<strong>in</strong>stance with property ¬R.<br />

(iii) S<strong>in</strong>ce R → B and ¬B → ¬R are logically equivalent,<br />

R → B is also confirmed by a ¬B-<strong>in</strong>stance with property ¬R.<br />

Example: Hypothesis (o): All ravens are black (R=Raven, B=Black).<br />

(i) observ<strong>in</strong>g a Black Raven confirms Hypothesis (o).<br />

(iii) observ<strong>in</strong>g a White Sock also confirms that all Ravens are Black,<br />

s<strong>in</strong>ce a White Sock is a non-Raven which is non-Black.<br />

This conclusion sounds absurd!<br />

What’s the problem


<strong>Marcus</strong> <strong>Hutter</strong> - 17 - Universal Induction & Intelligence<br />

What This Lecture Series is (Not) About<br />

Dichotomies <strong>in</strong> Artificial Intelligence & Mach<strong>in</strong>e Learn<strong>in</strong>g<br />

scope <strong>of</strong> this lecture series ⇔ scope <strong>of</strong> other lecture seriess<br />

(mach<strong>in</strong>e) learn<strong>in</strong>g ⇔ (GOFAI) knowledge-based<br />

statistical ⇔ logic-based<br />

decision ⇔ prediction ⇔ <strong>in</strong>duction ⇔ action<br />

classification ⇔ regression<br />

sequential / non-iid ⇔ <strong>in</strong>dependent identically distributed<br />

onl<strong>in</strong>e learn<strong>in</strong>g ⇔ <strong>of</strong>fl<strong>in</strong>e/batch learn<strong>in</strong>g<br />

passive prediction ⇔ active learn<strong>in</strong>g<br />

Bayes ⇔ MDL ⇔ Expert ⇔ Frequentist<br />

un<strong>in</strong>formed / universal ⇔ <strong>in</strong>formed / problem-specific<br />

conceptual/mathematical issues ⇔ computational issues<br />

exact/pr<strong>in</strong>cipled ⇔ heuristic<br />

supervised learn<strong>in</strong>g ⇔ unsupervised ⇔ RL learn<strong>in</strong>g<br />

exploitation ⇔ exploration


<strong>Marcus</strong> <strong>Hutter</strong> - 18 - Universal Induction & Intelligence<br />

BAYESIAN SEQUENCE PREDICTION<br />

• Sequential/Onl<strong>in</strong>e Prediction – Setup<br />

• Uncerta<strong>in</strong>ty and Probability<br />

• Frequency Interpretation: Count<strong>in</strong>g<br />

• Objective Uncerta<strong>in</strong> Events & Subjective Degrees <strong>of</strong> Belief<br />

• Bayes’ and Laplace’s Rules<br />

• The Bayes-mixture distribution<br />

• Predictive Convergence<br />

• Sequential Decisions and Loss Bounds<br />

• Generalization: Cont<strong>in</strong>uous Probability Classes<br />

• Summary


<strong>Marcus</strong> <strong>Hutter</strong> - 19 - Universal Induction & Intelligence<br />

Bayesian Sequence Prediction: Abstract<br />

The aim <strong>of</strong> probability theory is to describe uncerta<strong>in</strong>ty. There are<br />

various sources and <strong>in</strong>terpretations <strong>of</strong> uncerta<strong>in</strong>ty. I compare the<br />

frequency, objective, and subjective probabilities, and show that they all<br />

respect the same rules, and derive Bayes’ and Laplace’s famous and<br />

fundamental rules. Then I concentrate on general sequence prediction<br />

tasks. I def<strong>in</strong>e the Bayes mixture distribution and show that the<br />

posterior converges rapidly to the true posterior by exploit<strong>in</strong>g some<br />

bounds on the relative entropy. F<strong>in</strong>ally I show that the mixture predictor<br />

is also optimal <strong>in</strong> a decision-theoretic sense w.r.t. any bounded loss<br />

function.


<strong>Marcus</strong> <strong>Hutter</strong> - 20 - Universal Induction & Intelligence<br />

Sequential/Onl<strong>in</strong>e Prediction – Setup<br />

In sequential or onl<strong>in</strong>e prediction, for times t = 1, 2, 3, ...,<br />

our predictor p makes a prediction y p t ∈ Y<br />

based on past observations x 1 , ..., x t−1 .<br />

Thereafter x t ∈ X is observed and p suffers Loss(x t , y p t ).<br />

The goal is to design predictors with small total loss or cumulative<br />

Loss 1:T (p) := ∑ T<br />

t=1 Loss(x t, y p t ).<br />

Applications are abundant, e.g. weather or stock market forecast<strong>in</strong>g.<br />

Example:<br />

Y =<br />

Loss(x, y)<br />

{<br />

umbrella<br />

sunglasses<br />

}<br />

X = {sunny , ra<strong>in</strong>y}<br />

0.1 0.3<br />

0.0 1.0<br />

Setup also <strong>in</strong>cludes: Classification and Regression problems.


<strong>Marcus</strong> <strong>Hutter</strong> - 21 - Universal Induction & Intelligence<br />

Uncerta<strong>in</strong>ty and Probability<br />

The aim <strong>of</strong> probability theory is to describe uncerta<strong>in</strong>ty.<br />

Sources/<strong>in</strong>terpretations for uncerta<strong>in</strong>ty:<br />

• Frequentist: probabilities are relative frequencies.<br />

(e.g. the relative frequency <strong>of</strong> toss<strong>in</strong>g head.)<br />

• Objectivist: probabilities are real aspects <strong>of</strong> the world.<br />

(e.g. the probability that some atom decays <strong>in</strong> the next hour)<br />

• Subjectivist: probabilities describe an agent’s degree <strong>of</strong> belief.<br />

(e.g. it is (im)plausible that extraterrestrians exist)


<strong>Marcus</strong> <strong>Hutter</strong> - 22 - Universal Induction & Intelligence<br />

Frequency Interpretation: Count<strong>in</strong>g<br />

• The frequentist <strong>in</strong>terprets probabilities as relative frequencies.<br />

• If <strong>in</strong> a sequence <strong>of</strong> n <strong>in</strong>dependent identically distributed (i.i.d.)<br />

experiments (trials) an event occurs k(n) times, the relative<br />

frequency <strong>of</strong> the event is k(n)/n.<br />

• The limit lim n→∞ k(n)/n is def<strong>in</strong>ed as the probability <strong>of</strong> the event.<br />

• For <strong>in</strong>stance, the probability <strong>of</strong> the event head <strong>in</strong> a sequence <strong>of</strong><br />

repeatedly toss<strong>in</strong>g a fair co<strong>in</strong> is 1 2 .<br />

• The frequentist position is the easiest to grasp, but it has several<br />

shortcom<strong>in</strong>gs:<br />

• Problems: def<strong>in</strong>ition circular, limited to i.i.d, reference class<br />

problem.


<strong>Marcus</strong> <strong>Hutter</strong> - 23 - Universal Induction & Intelligence<br />

Objective Interpretation: Uncerta<strong>in</strong> Events<br />

• For the objectivist probabilities are real aspects <strong>of</strong> the world.<br />

• The outcome <strong>of</strong> an observation or an experiment is not<br />

determ<strong>in</strong>istic, but <strong>in</strong>volves physical random processes.<br />

• The set Ω <strong>of</strong> all possible outcomes is called the sample space.<br />

• It is said that an event E ⊂ Ω occurred if the outcome is <strong>in</strong> E.<br />

• In the case <strong>of</strong> i.i.d. experiments the probabilities p assigned to<br />

events E should be <strong>in</strong>terpretable as limit<strong>in</strong>g frequencies, but the<br />

application is not limited to this case.<br />

• (Some) probability axioms:<br />

p(Ω) = 1 and p({}) = 0 and 0 ≤ p(E) ≤ 1.<br />

p(A ∪ B) = p(A) + p(B) − p(A ∩ B).<br />

p(B|A) = p(A∩B)<br />

p(A)<br />

is the probability <strong>of</strong> B given event A occurred.


<strong>Marcus</strong> <strong>Hutter</strong> - 24 - Universal Induction & Intelligence<br />

Subjective Interpretation: Degrees <strong>of</strong> Belief<br />

• The subjectivist uses probabilities to characterize an agent’s degree<br />

<strong>of</strong> belief <strong>in</strong> someth<strong>in</strong>g, rather than to characterize physical random<br />

processes.<br />

• This is the most relevant <strong>in</strong>terpretation <strong>of</strong> probabilities <strong>in</strong> AI.<br />

• We def<strong>in</strong>e the plausibility <strong>of</strong> an event as the degree <strong>of</strong> belief <strong>in</strong> the<br />

event, or the subjective probability <strong>of</strong> the event.<br />

• It is natural to assume that plausibilities/beliefs Bel(·|·) can be repr.<br />

by real numbers, that the rules qualitatively correspond to common<br />

sense, and that the rules are mathematically consistent. ⇒<br />

• Cox’s theorem: Bel(·|A) is isomorphic to a probability function<br />

p(·|·) that satisfies the axioms <strong>of</strong> (objective) probabilities.<br />

• Conclusion:<br />

Beliefs follow the same rules as probabilities


<strong>Marcus</strong> <strong>Hutter</strong> - 25 - Universal Induction & Intelligence<br />

Bayes’ Famous Rule<br />

Let D be some possible data (i.e. D is event with p(D) > 0) and<br />

{H i } i∈I be a countable complete class <strong>of</strong> mutually exclusive hypotheses<br />

(i.e. H i are events with H i ∩ H j = {} ∀i ≠ j and ∪ i∈I H i = Ω).<br />

Given: p(H i ) = a priori plausibility <strong>of</strong> hypotheses H i (subj. prob.)<br />

Given: p(D|H i ) = likelihood <strong>of</strong> data D under hypothesis H i (obj. prob.)<br />

Goal: p(H i |D) = a posteriori plausibility <strong>of</strong> hypothesis H i (subj. prob.)<br />

Solution: p(H i |D) =<br />

p(D|H i )p(H i )<br />

∑<br />

i∈I p(D|H i)p(H i )<br />

Pro<strong>of</strong>: From the def<strong>in</strong>ition <strong>of</strong> conditional probability and<br />

∑<br />

p(H i |...) = 1 ⇒ ∑ p(D|H i )p(H i ) = ∑ p(H i |D)p(D) = p(D)<br />

i∈I<br />

i∈I<br />

i∈I


<strong>Marcus</strong> <strong>Hutter</strong> - 26 - Universal Induction & Intelligence<br />

Example: Bayes’ and Laplace’s Rule<br />

Assume data is generated by a biased co<strong>in</strong> with head probability θ, i.e.<br />

H θ :=Bernoulli(θ) with θ ∈ Θ := [0, 1].<br />

F<strong>in</strong>ite sequence: x = x 1 x 2 ...x n with n 1 ones and n 0 zeros.<br />

Sample <strong>in</strong>f<strong>in</strong>ite sequence: ω ∈ Ω = {0, 1} ∞<br />

Basic event: Γ x = {ω : ω 1 = x 1 , ..., ω n = x n } = set <strong>of</strong> all sequences<br />

start<strong>in</strong>g with x.<br />

Data likelihood: p θ (x) := p(Γ x |H θ ) = θ n 1<br />

(1 − θ) n 0<br />

.<br />

Bayes (1763): Uniform prior plausibility: p(θ) := p(H θ ) = 1<br />

( ∫ 1<br />

0 p(θ) dθ = 1 <strong>in</strong>stead ∑ i∈I p(H i) = 1)<br />

Evidence: p(x) = ∫ 1<br />

0 p θ(x)p(θ) dθ = ∫ 1<br />

0 θn 1<br />

(1 − θ) n 0<br />

dθ = n 1!n 0 !<br />

(n 0 +n 1 +1)!


<strong>Marcus</strong> <strong>Hutter</strong> - 27 - Universal Induction & Intelligence<br />

Example: Bayes’ and Laplace’s Rule<br />

Bayes: Posterior plausibility <strong>of</strong> θ<br />

after see<strong>in</strong>g x is:<br />

p(θ|x) = p(x|θ)p(θ)<br />

p(x)<br />

.<br />

= (n+1)!<br />

n 1 !n 0 ! θn 1<br />

(1−θ) n 0<br />

Laplace: What is the probability <strong>of</strong> see<strong>in</strong>g 1 after hav<strong>in</strong>g observed x<br />

p(x n+1 = 1|x 1 ...x n ) = p(x1)<br />

p(x) = n 1+1<br />

n + 2<br />

Laplace believed that the sun had risen for 5000 years = 1’826’213 days,<br />

so he concluded that the probability <strong>of</strong> doomsday tomorrow is<br />

1<br />

1826215 .


<strong>Marcus</strong> <strong>Hutter</strong> - 28 - Universal Induction & Intelligence<br />

Exercise: Envelope Paradox<br />

• I <strong>of</strong>fer you two closed envelopes, one <strong>of</strong> them conta<strong>in</strong>s twice the<br />

amount <strong>of</strong> money than the other. You are allowed to pick one and<br />

open it. Now you have two options. Keep the money or decide for<br />

the other envelope (which could double or half your ga<strong>in</strong>).<br />

• Symmetry argument: It doesn’t matter whether you switch, the<br />

expected ga<strong>in</strong> is the same.<br />

• Refutation: With probability p = 1/2, the other envelope conta<strong>in</strong>s<br />

twice/half the amount, i.e. if you switch your expected ga<strong>in</strong><br />

<strong>in</strong>creases by a factor 1.25=(1/2)*2+(1/2)*(1/2).<br />

• Present a Bayesian solution.


<strong>Marcus</strong> <strong>Hutter</strong> - 29 - Universal Induction & Intelligence<br />

Notation: Str<strong>in</strong>gs & Probabilities<br />

Str<strong>in</strong>gs: x=x 1 x 2 ...x n with x t ∈X and x 1:n := x 1 x 2 ...x n−1 x n and<br />

x


<strong>Marcus</strong> <strong>Hutter</strong> - 30 - Universal Induction & Intelligence<br />

The Bayes-Mixture Distribution ξ<br />

• Assumption: The true (objective) environment µ is unknown.<br />

• Bayesian approach: Replace true probability distribution µ by a<br />

Bayes-mixture ξ.<br />

• Assumption: We know that the true environment µ is conta<strong>in</strong>ed <strong>in</strong><br />

some known countable (<strong>in</strong>)f<strong>in</strong>ite set M <strong>of</strong> environments.<br />

• The Bayes-mixture ξ is def<strong>in</strong>ed as<br />

ξ(x) := ∑<br />

w ν ν(x) with<br />

ν∈M<br />

∑<br />

ν∈M<br />

w ν = 1,<br />

w ν > 0 ∀ν<br />

• The weights w ν may be <strong>in</strong>terpreted as the prior degree <strong>of</strong> belief that<br />

the true environment is ν, or k ν = ln wν<br />

−1 as a complexity penalty<br />

(prefix code length) <strong>of</strong> environment ν.<br />

• Then ξ(x) could be <strong>in</strong>terpreted as the prior subjective belief<br />

probability <strong>in</strong> observ<strong>in</strong>g x.


<strong>Marcus</strong> <strong>Hutter</strong> - 31 - Universal Induction & Intelligence<br />

Relative Entropy<br />

Relative entropy: D(p||q) := ∑ i p i ln p i<br />

q i<br />

Properties: D(p||q) ≥ 0 and D(p||q) = 0 ⇔ p = q<br />

Instantaneous relative entropy: d t (x


<strong>Marcus</strong> <strong>Hutter</strong> - 32 - Universal Induction & Intelligence<br />

Pro<strong>of</strong> <strong>of</strong> the Entropy Bound<br />

D n ≡<br />

n∑<br />

t=1<br />

∑<br />

µ(x


<strong>Marcus</strong> <strong>Hutter</strong> - 33 - Universal Induction & Intelligence<br />

Predictive Convergence<br />

Theorem: ξ(x t |x


<strong>Marcus</strong> <strong>Hutter</strong> - 34 - Universal Induction & Intelligence<br />

Sequential Decisions<br />

A prediction is very <strong>of</strong>ten the basis for some decision. The decision<br />

results <strong>in</strong> an action, which itself leads to some reward or loss.<br />

Let Loss(x t , y t ) ∈ [0, 1] be the received loss when tak<strong>in</strong>g action y t ∈Y<br />

and x t ∈X is the t th symbol <strong>of</strong> the sequence.<br />

For <strong>in</strong>stance, decision Y ={umbrella, sunglasses} based on weather<br />

forecasts X ={sunny, ra<strong>in</strong>y}. Loss sunny ra<strong>in</strong>y<br />

umbrella 0.1 0.3<br />

sunglasses 0.0 1.0<br />

The goal is to m<strong>in</strong>imize the µ-expected loss. More generally we def<strong>in</strong>e<br />

the Λ σ prediction scheme, which m<strong>in</strong>imizes the σ-expected loss:<br />

∑<br />

y Λ σ<br />

t := arg m<strong>in</strong> σ(x t |x


<strong>Marcus</strong> <strong>Hutter</strong> - 35 - Universal Induction & Intelligence<br />

Loss Bounds<br />

• Def<strong>in</strong>ition: µ-expected loss when Λ σ predicts the t th symbol:<br />

Loss t (Λ σ )(x


<strong>Marcus</strong> <strong>Hutter</strong> - 36 - Universal Induction & Intelligence<br />

Pro<strong>of</strong> <strong>of</strong> Instantaneous Loss Bounds<br />

Abbreviations: X = {1, ..., N}, N = |X |, i = x t , y i = µ(x t |x


<strong>Marcus</strong> <strong>Hutter</strong> - 37 - Universal Induction & Intelligence<br />

Generalization: Cont<strong>in</strong>uous Classes M<br />

In statistical parameter estimation one <strong>of</strong>ten has a cont<strong>in</strong>uous<br />

hypothesis class (e.g. a Bernoulli(θ) process with unknown θ ∈ [0, 1]).<br />

∫<br />

∫<br />

M := {ν θ : θ ∈ IR d }, ξ(x) := dθ w(θ) ν θ (x), dθ w(θ) = 1<br />

IR d<br />

Under weak regularity conditions [CB90,H’03]:<br />

IR d<br />

Theorem: D n (µ||ξ) ≤ ln w(µ) −1 + d 2 ln n 2π + O(1)<br />

where O(1) depends on the local curvature (parametric complexity) <strong>of</strong><br />

ln ν θ , and is <strong>in</strong>dependent n for many reasonable classes, <strong>in</strong>clud<strong>in</strong>g all<br />

stationary (k th -order) f<strong>in</strong>ite-state Markov processes (k = 0 is i.i.d.).<br />

D n ∝ log(n) = o(n) still implies excellent prediction and decision for<br />

most n.<br />

[RH’07]


<strong>Marcus</strong> <strong>Hutter</strong> - 38 - Universal Induction & Intelligence<br />

Bayesian Sequence Prediction: Summary<br />

• The aim <strong>of</strong> probability theory is to describe uncerta<strong>in</strong>ty.<br />

• Various sources and <strong>in</strong>terpretations <strong>of</strong> uncerta<strong>in</strong>ty:<br />

frequency, objective, and subjective probabilities.<br />

• They all respect the same rules.<br />

• General sequence prediction: Use known (subj.) Bayes mixture<br />

ξ = ∑ ν∈M w νν <strong>in</strong> place <strong>of</strong> unknown (obj.) true distribution µ.<br />

• Bound on the relative entropy between ξ and µ.<br />

⇒ posterior <strong>of</strong> ξ converges rapidly to the true posterior µ.<br />

• ξ is also optimal <strong>in</strong> a decision-theoretic sense w.r.t. any bounded<br />

loss function.<br />

• No structural assumptions on M and ν ∈ M.


<strong>Marcus</strong> <strong>Hutter</strong> - 39 - Universal Induction & Intelligence<br />

UNIVERSAL INDUCTIVE INFERENCE<br />

• Foundations <strong>of</strong> Universal Induction<br />

• Bayesian Sequence Prediction and Confirmation<br />

• Convergence and Decisions<br />

• How to Choose the Prior – Universal<br />

• Kolmogorov Complexity<br />

• How to Choose the Model Class – Universal<br />

• The Problem <strong>of</strong> Zero Prior<br />

• Reparametrization and Regroup<strong>in</strong>g Invariance<br />

• The Problem <strong>of</strong> Old Evidence / New Theories<br />

• Universal is Better than Cont<strong>in</strong>uous Class<br />

• More Bounds / Stuff / Critique / Problems<br />

• Summary / Outlook / Literature


<strong>Marcus</strong> <strong>Hutter</strong> - 40 - Universal Induction & Intelligence<br />

Universal Inductive Inference: Abstract<br />

Solomon<strong>of</strong>f completed the Bayesian framework by provid<strong>in</strong>g a rigorous,<br />

unique, formal, and universal choice for the model class and the prior. I<br />

will discuss <strong>in</strong> breadth how and <strong>in</strong> which sense universal (non-i.i.d.)<br />

sequence prediction solves various (philosophical) problems <strong>of</strong> traditional<br />

Bayesian sequence prediction. I show that Solomon<strong>of</strong>f’s model possesses<br />

many desirable properties: Strong total and weak <strong>in</strong>stantaneous bounds<br />

, and <strong>in</strong> contrast to most classical cont<strong>in</strong>uous prior densities has no zero<br />

p(oste)rior problem, i.e. can confirm universal hypotheses, is<br />

reparametrization and regroup<strong>in</strong>g <strong>in</strong>variant, and avoids the old-evidence<br />

and updat<strong>in</strong>g problem. It even performs well (actually better) <strong>in</strong><br />

non-computable environments.


<strong>Marcus</strong> <strong>Hutter</strong> - 41 - Universal Induction & Intelligence<br />

Induction Examples<br />

Sequence prediction: Predict weather/stock-quote/... tomorrow, based<br />

on past sequence. Cont<strong>in</strong>ue IQ test sequence like 1,4,9,16,<br />

Classification: Predict whether email is spam.<br />

Classification can be reduced to sequence prediction.<br />

Hypothesis test<strong>in</strong>g/identification: Does treatment X cure cancer<br />

Do observations <strong>of</strong> white swans confirm that all ravens are black<br />

These are <strong>in</strong>stances <strong>of</strong> the important problem <strong>of</strong> <strong>in</strong>ductive <strong>in</strong>ference or<br />

time-series forecast<strong>in</strong>g or sequence prediction.<br />

Problem: F<strong>in</strong>d<strong>in</strong>g prediction rules for every particular (new) problem is<br />

possible but cumbersome and prone to disagreement or contradiction.<br />

Goal: A s<strong>in</strong>gle, formal, general, complete theory for prediction.<br />

Beyond <strong>in</strong>duction: active/reward learn<strong>in</strong>g, fct. optimization, game theory.


<strong>Marcus</strong> <strong>Hutter</strong> - 42 - Universal Induction & Intelligence<br />

Foundations <strong>of</strong> Universal Induction<br />

Ockhams’ razor (simplicity) pr<strong>in</strong>ciple<br />

Entities should not be multiplied beyond necessity.<br />

Epicurus’ pr<strong>in</strong>ciple <strong>of</strong> multiple explanations<br />

If more than one theory is consistent with the observations, keep<br />

all theories.<br />

Bayes’ rule for conditional probabilities<br />

Given the prior belief/probability one can predict all future probabilities.<br />

Tur<strong>in</strong>g’s universal mach<strong>in</strong>e<br />

Everyth<strong>in</strong>g computable by a human us<strong>in</strong>g a fixed procedure can<br />

also be computed by a (universal) Tur<strong>in</strong>g mach<strong>in</strong>e.<br />

Kolmogorov’s complexity<br />

The complexity or <strong>in</strong>formation content <strong>of</strong> an object is the length<br />

<strong>of</strong> its shortest description on a universal Tur<strong>in</strong>g mach<strong>in</strong>e.<br />

Solomon<strong>of</strong>f’s universal prior=Ockham+Epicurus+Bayes+Tur<strong>in</strong>g<br />

Solves the question <strong>of</strong> how to choose the prior if noth<strong>in</strong>g is known.<br />

⇒ universal <strong>in</strong>duction, formal Occam, AIT,MML,MDL,SRM,...


<strong>Marcus</strong> <strong>Hutter</strong> - 43 - Universal Induction & Intelligence<br />

Bayesian Sequence Prediction and Confirmation<br />

• Assumption: Sequence ω ∈ X ∞ is sampled from the “true”<br />

probability measure µ, i.e. µ(x) := P[x|µ] is the µ-probability that<br />

ω starts with x ∈ X n .<br />

• Model class: We assume that µ is unknown but known to belong to<br />

a countable class <strong>of</strong> environments=models=measures<br />

M = {ν 1 , ν 2 , ...}.<br />

[no i.i.d./ergodic/stationary assumption]<br />

• Hypothesis class: {H ν : ν ∈ M} forms a mutually exclusive and<br />

complete class <strong>of</strong> hypotheses.<br />

• Prior: w ν := P[H ν ] is our prior belief <strong>in</strong> H ν<br />

⇒ Evidence: ξ(x) := P[x] = ∑ ν∈M P[x|H ν]P[H ν ] = ∑ ν w νν(x)<br />

must be our (prior) belief <strong>in</strong> x.<br />

⇒ Posterior: w ν (x) := P[H ν |x] = P[x|H ν]P[H ν ]<br />

P[x]<br />

is our posterior belief<br />

<strong>in</strong> ν (Bayes’ rule).


<strong>Marcus</strong> <strong>Hutter</strong> - 44 - Universal Induction & Intelligence<br />

When is a Sequence Random<br />

a) Is 0110010100101101101001111011 generated by a fair co<strong>in</strong> flip<br />

b) Is 1111111111111111111111111111 generated by a fair co<strong>in</strong> flip<br />

c) Is 1100100100001111110110101010 generated by a fair co<strong>in</strong> flip<br />

d) Is 0101010101010101010101010101 generated by a fair co<strong>in</strong> flip<br />

• Intuitively: (a) and (c) look random, but (b) and (d) look unlikely.<br />

• Problem: Formally (a-d) have equal probability ( 1 2 )length .<br />

• Classical solution: Consider hypothesis class H := {Bernoulli(p) :<br />

p ∈ Θ ⊆ [0, 1]} and determ<strong>in</strong>e p for which sequence has maximum<br />

likelihood =⇒ (a,c,d) are fair Bernoulli( 1 2<br />

) co<strong>in</strong>s, (b) not.<br />

• Problem: (d) is non-random, also (c) is b<strong>in</strong>ary expansion <strong>of</strong> π.<br />

• Solution: Choose H larger, but how large Overfitt<strong>in</strong>g MDL<br />

• AIT Solution: A sequence is random iff it is <strong>in</strong>compressible.


<strong>Marcus</strong> <strong>Hutter</strong> - 45 - Universal Induction & Intelligence<br />

What does Probability Mean<br />

Naive frequency <strong>in</strong>terpretation is circular:<br />

• Probability <strong>of</strong> event E is p := lim n→∞<br />

k n (E)<br />

n ,<br />

n = # i.i.d. trials, k n (E) = # occurrences <strong>of</strong> event E <strong>in</strong> n trials.<br />

• Problem: Limit may be anyth<strong>in</strong>g (or noth<strong>in</strong>g):<br />

e.g. a fair co<strong>in</strong> can give: Head, Head, Head, Head, ... ⇒ p = 1.<br />

• Of course, for a fair co<strong>in</strong> this sequence is “unlikely”.<br />

For fair co<strong>in</strong>, p = 1/2 with “high probability”.<br />

• But to make this statement rigorous we need to formally know what<br />

“high probability” means.<br />

Circularity!<br />

Also: In complex doma<strong>in</strong>s typical for AI, sample size is <strong>of</strong>ten 1.<br />

(e.g. a s<strong>in</strong>gle non-iid historic weather data sequences is given).<br />

We want to know whether certa<strong>in</strong> properties hold for this particular seq.


<strong>Marcus</strong> <strong>Hutter</strong> - 46 - Universal Induction & Intelligence<br />

How to Choose the Prior<br />

The probability axioms allow relat<strong>in</strong>g probabilities and plausibilities <strong>of</strong><br />

different events, but they do not uniquely fix a numerical value for each<br />

event, except for the sure event Ω and the empty event {}.<br />

We need new pr<strong>in</strong>ciples for determ<strong>in</strong><strong>in</strong>g values for at least some basis<br />

events from which others can then be computed.<br />

There seem to be only 3 general pr<strong>in</strong>ciples:<br />

• The pr<strong>in</strong>ciple <strong>of</strong> <strong>in</strong>difference — the symmetry pr<strong>in</strong>ciple<br />

• The maximum entropy pr<strong>in</strong>ciple<br />

• Occam’s razor — the simplicity pr<strong>in</strong>ciple<br />

Concrete: How shall we choose the hypothesis space {H i } and their<br />

prior p(H i ) –or– M = {ν} and their weight w ν .


<strong>Marcus</strong> <strong>Hutter</strong> - 47 - Universal Induction & Intelligence<br />

Indifference or Symmetry Pr<strong>in</strong>ciple<br />

Assign same probability to all hypotheses:<br />

p(H i ) = 1<br />

|I|<br />

for f<strong>in</strong>ite I<br />

p(H θ ) = [Vol(Θ)] −1 for compact and measurable Θ.<br />

⇒ p(H i |D) ∝ p(D|H i ) ∧ = classical Hypothesis test<strong>in</strong>g (Max.Likelihood).<br />

Prev. Example: H θ =Bernoulli(θ) with p(θ) = 1 for θ ∈ Θ := [0, 1].<br />

Problems: Does not work for “large” hypothesis spaces:<br />

(a) Uniform distr. on <strong>in</strong>f<strong>in</strong>ite I = IN or noncompact Θ not possible!<br />

(b) Reparametrization: θ ❀ f(θ). Uniform <strong>in</strong> θ is not uniform <strong>in</strong> f(θ).<br />

Example: “Uniform” distr. on space <strong>of</strong> all (b<strong>in</strong>ary) sequences {0, 1} ∞ :<br />

p(x 1 ...x n ) = ( 1 2 )n ∀n∀x 1 ...x n ⇒ p(x n+1 = 1|x 1 ...x n ) = 1 2 always!<br />

Inference so not possible (No-Free-Lunch myth).<br />

Predictive sett<strong>in</strong>g: All we need is p(x).


<strong>Marcus</strong> <strong>Hutter</strong> - 48 - Universal Induction & Intelligence<br />

Occam’s Razor — The Simplicity Pr<strong>in</strong>ciple<br />

• Only Occam’s razor (<strong>in</strong> comb<strong>in</strong>ation with Epicurus’ pr<strong>in</strong>ciple) is<br />

general enough to assign prior probabilities <strong>in</strong> every situation.<br />

• The idea is to assign high (subjective) probability to simple events,<br />

and low probability to complex events.<br />

• Simple events (str<strong>in</strong>gs) are more plausible a priori than complex<br />

ones.<br />

• This gives (approximately) justice to both Occam’s razor and<br />

Epicurus’ pr<strong>in</strong>ciple.


<strong>Marcus</strong> <strong>Hutter</strong> - 49 - Universal Induction & Intelligence<br />

Prefix Sets/Codes<br />

Str<strong>in</strong>g x is (proper) prefix <strong>of</strong> y :⇐⇒ ∃ z(≠ ϵ) such that xz = y.<br />

Set P is prefix-free or a prefix code :⇐⇒<br />

no element is a proper<br />

prefix <strong>of</strong> another.<br />

Example: A self-delimit<strong>in</strong>g code (e.g. P = {0, 10, 11}) is prefix-free.<br />

Kraft Inequality<br />

For a prefix code P we have ∑ x∈P 2−l(x) ≤ 1.<br />

Conversely, let l 1 , l 2 , ... be a countable sequence <strong>of</strong> natural numbers<br />

such that Kraft’s <strong>in</strong>equality ∑ k 2−l k<br />

≤ 1 is satisfied. Then there exists<br />

a prefix code P with these lengths <strong>of</strong> its b<strong>in</strong>ary code.


<strong>Marcus</strong> <strong>Hutter</strong> - 50 - Universal Induction & Intelligence<br />

Pro<strong>of</strong> <strong>of</strong> the Kraft-Inequality<br />

Pro<strong>of</strong> ⇒: Assign to each x ∈ P the <strong>in</strong>terval Γ x := [0.x, 0.x + 2 −l(x) ).<br />

Length <strong>of</strong> <strong>in</strong>terval Γ x is 2 −l(x) .<br />

Intervals are disjo<strong>in</strong>t, s<strong>in</strong>ce P is prefix free, hence<br />

∑<br />

2 −l(x) = ∑ Length(Γ x ) ≤ Length([0, 1]) = 1<br />

x∈P x∈P<br />

⇐: Idea: Choose l 1 , l 2 , ... <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order. Successively chop <strong>of</strong>f<br />

<strong>in</strong>tervals <strong>of</strong> lengths 2 −l 1<br />

, 2 −l 2<br />

, ... from left to right from [0, 1) and<br />

def<strong>in</strong>e left <strong>in</strong>terval boundary as code.


<strong>Marcus</strong> <strong>Hutter</strong> - 51 - Universal Induction & Intelligence<br />

Priors from Prefix Codes<br />

• Let Code(H ν ) be a prefix code <strong>of</strong> hypothesis H ν .<br />

• Def<strong>in</strong>e complexity Kw(ν) :=Length(Code(H ν ))<br />

• Choose prior w ν = p(H ν ) = 2 −Kw(ν)<br />

⇒ ∑ ν∈M w ν ≤ 1 is semi-probability (by Kraft).<br />

• How to choose a Code and hypothesis space M <br />

• Praxis: Choose a code which is reasonable for your problem<br />

and M large enough to conta<strong>in</strong> the true model.<br />

• Theory: Choose a universal code and consider “all” hypotheses ...


<strong>Marcus</strong> <strong>Hutter</strong> - 52 - Universal Induction & Intelligence<br />

Kolmogorov Complexity K(x)<br />

K. <strong>of</strong> str<strong>in</strong>g x is the length <strong>of</strong> the shortest (prefix) program produc<strong>in</strong>g x:<br />

K(x) := m<strong>in</strong> p {l(p) : U(p) = x},<br />

U = universal TM<br />

For non-str<strong>in</strong>g objects o (like numbers and functions) we def<strong>in</strong>e<br />

K(o) := K(⟨o⟩), where ⟨o⟩ ∈ X ∗ is some standard code for o.<br />

+ Simple str<strong>in</strong>gs like 000...0 have small K,<br />

irregular (e.g. random) str<strong>in</strong>gs have large K.<br />

• The def<strong>in</strong>ition is nearly <strong>in</strong>dependent <strong>of</strong> the choice <strong>of</strong> U.<br />

+ K satisfies most properties an <strong>in</strong>formation measure should satisfy.<br />

+ K shares many properties with Shannon entropy but is superior.<br />

− K(x) is not computable, but only semi-computable from above.<br />

Fazit:<br />

K is an excellent universal complexity measure,<br />

suitable for quantify<strong>in</strong>g Occam’s razor.


<strong>Marcus</strong> <strong>Hutter</strong> - 53 - Universal Induction & Intelligence<br />

Schematic Graph <strong>of</strong> Kolmogorov Complexity<br />

Although K(x) is <strong>in</strong>computable, we can draw a schematic graph


<strong>Marcus</strong> <strong>Hutter</strong> - 54 - Universal Induction & Intelligence<br />

The Universal Prior<br />

• Quantify the complexity <strong>of</strong> an environment ν or hypothesis H ν by<br />

its Kolmogorov complexity K(ν).<br />

• Universal prior: w ν = w U ν := 2 −K(ν) is a decreas<strong>in</strong>g function <strong>in</strong><br />

the model’s complexity, and sums to (less than) one.<br />

⇒ D n ≤ K(µ) ln 2, i.e. the number <strong>of</strong> ε-deviations <strong>of</strong> ξ from µ or l Λ ξ<br />

from l Λ µ<br />

is proportional to the complexity <strong>of</strong> the environment.<br />

• No other semi-computable prior leads to better prediction (bounds).<br />

• For cont<strong>in</strong>uous M, we can assign a (proper) universal prior (not<br />

density) w U θ = 2−K(θ) > 0 for computable θ, and 0 for uncomp. θ.<br />

• This effectively reduces M to a discrete class {ν θ ∈ M : w U θ > 0}<br />

which is typically dense <strong>in</strong> M.<br />

• This prior has many advantages over the classical prior (densities).


<strong>Marcus</strong> <strong>Hutter</strong> - 55 - Universal Induction & Intelligence<br />

The Problem <strong>of</strong> Zero Prior<br />

= the problem <strong>of</strong> confirmation <strong>of</strong> universal hypotheses<br />

Problem: If the prior is zero, then the posterior is necessarily also zero.<br />

Example: Consider the hypothesis H = H 1 that all balls <strong>in</strong> some urn or<br />

all ravens are black (=1) or that the sun rises every day.<br />

Start<strong>in</strong>g with a prior density as w(θ) = 1 implies that prior P[H θ ] = 0<br />

for all θ, hence posterior P [H θ |1..1] = 0, hence H never gets confirmed.<br />

3 non-solutions: def<strong>in</strong>e H = {ω = 1 ∞ } | use f<strong>in</strong>ite population | abandon<br />

strict/logical/all-quantified/universal hypotheses <strong>in</strong> favor <strong>of</strong> s<strong>of</strong>t hyp.<br />

Solution: Assign non-zero prior to θ = 1 ⇒ P[H|1 n ] → 1.<br />

Generalization: Assign non-zero prior to all “special” θ, like 1 2 and 1 6 ,<br />

which may naturally appear <strong>in</strong> a hypothesis, like “is the co<strong>in</strong> or die fair”.<br />

Universal solution: Assign non-zero prior to all comp. θ, e.g. w U θ<br />

= 2−K(θ)


<strong>Marcus</strong> <strong>Hutter</strong> - 56 - Universal Induction & Intelligence<br />

Reparametrization Invariance<br />

• New parametrization e.g. ψ = √ θ, then the ψ-density<br />

w ′ (ψ) = 2 √ θ w(θ) is no longer uniform if w(θ) = 1 is uniform<br />

⇒ <strong>in</strong>difference pr<strong>in</strong>ciple is not reparametrization <strong>in</strong>variant (RIP).<br />

• Jeffrey’s and Bernardo’s pr<strong>in</strong>ciple satisfy RIP w.r.t. differentiable<br />

bijective transformations ψ = f −1 (θ).<br />

• The universal prior wθ<br />

U<br />

computable f.<br />

= 2−K(θ) also satisfies RIP w.r.t. simple<br />

(with<strong>in</strong> a multiplicative constant)


<strong>Marcus</strong> <strong>Hutter</strong> - 57 - Universal Induction & Intelligence<br />

Regroup<strong>in</strong>g Invariance<br />

• Non-bijective transformations:<br />

E.g. group<strong>in</strong>g ball colors <strong>in</strong>to categories black/non-black.<br />

• No classical pr<strong>in</strong>ciple is regroup<strong>in</strong>g <strong>in</strong>variant.<br />

• Regroup<strong>in</strong>g <strong>in</strong>variance is regarded as a very important and desirable<br />

property. [Walley’s (1996) solution: sets <strong>of</strong> priors]<br />

• The universal prior wθ<br />

U = 2−K(θ) is <strong>in</strong>variant under regroup<strong>in</strong>g, and<br />

more generally under all simple [computable with complexity O(1)]<br />

even non-bijective transformations. (with<strong>in</strong> a multiplicative constant)<br />

• Note: Reparametrization and regroup<strong>in</strong>g <strong>in</strong>variance hold for<br />

arbitrary classes and are not limited to the i.i.d. case.


<strong>Marcus</strong> <strong>Hutter</strong> - 58 - Universal Induction & Intelligence<br />

Universal Choice <strong>of</strong> Class M<br />

• The larger M the less restrictive is the assumption µ ∈ M.<br />

• The class M U <strong>of</strong> all (semi)computable (semi)measures, although<br />

only countable, is pretty large, s<strong>in</strong>ce it <strong>in</strong>cludes all valid physics<br />

theories. Further, ξ U is semi-computable [ZL70].<br />

• Solomon<strong>of</strong>f’s universal prior M(x) := probability that the output <strong>of</strong><br />

a universal TM U with random <strong>in</strong>put starts with x.<br />

• Formally: M(x) := ∑ p : U(p)=x∗ 2−l(p) where the sum is over all<br />

(m<strong>in</strong>imal) programs p for which U outputs a str<strong>in</strong>g start<strong>in</strong>g with x.<br />

• M may be regarded as a 2 −l(p) -weighted mixture over all<br />

determ<strong>in</strong>istic environments ν p . (ν p (x) = 1 if U(p) = x∗ and 0 else)<br />

• M(x) co<strong>in</strong>cides with ξ U (x) with<strong>in</strong> an irrelevant multiplicative constant.


<strong>Marcus</strong> <strong>Hutter</strong> - 59 - Universal Induction & Intelligence<br />

The Problem <strong>of</strong> Old Evidence / New Theories<br />

• What if some evidence E ̂=x (e.g. Mercury’s perihelion advance) is<br />

known well-before the correct hypothesis/theory/model H ̂=µ<br />

(E<strong>in</strong>ste<strong>in</strong>’s general relativity theory) is found<br />

• How shall H be added to the Bayesian mach<strong>in</strong>ery a posteriori<br />

• What should the “prior” <strong>of</strong> H be<br />

• Should it be the belief <strong>in</strong> H <strong>in</strong> a hypothetical counterfactual world<br />

<strong>in</strong> which E is not known<br />

• Can old evidence E confirm H<br />

• After all, H could simply be constructed/biased/fitted towards<br />

“expla<strong>in</strong><strong>in</strong>g” E.


<strong>Marcus</strong> <strong>Hutter</strong> - 60 - Universal Induction & Intelligence<br />

Solution <strong>of</strong> the Old-Evidence Problem<br />

• The universal class M U and universal prior wν<br />

U<br />

problem.<br />

formally solves this<br />

• The universal prior <strong>of</strong> H is 2 −K(H) <strong>in</strong>dependent <strong>of</strong> M and <strong>of</strong><br />

whether E is known or not.<br />

• Updat<strong>in</strong>g M is unproblematic, and even not necessary when<br />

start<strong>in</strong>g with M U , s<strong>in</strong>ce it <strong>in</strong>cludes all hypothesis (<strong>in</strong>clud<strong>in</strong>g yet<br />

unknown or unnamed ones) a priori.


<strong>Marcus</strong> <strong>Hutter</strong> - 61 - Universal Induction & Intelligence<br />

Universal is Better than Cont<strong>in</strong>uous M<br />

• Although ν θ () and w θ are <strong>in</strong>comp. for cont. classes M for most θ,<br />

ξ() is typically computable. (exactly as for Laplace or numerically)<br />

⇒ D n (µ||M) + < D n (µ||ξ)+K(ξ) ln 2 for all µ<br />

• That is, M is superior to all computable mixture predictors ξ based<br />

on any (cont<strong>in</strong>uous or discrete) model class M and weight w(θ),<br />

save an additive constant K(ξ) ln 2 = O(1), even if environment µ<br />

is not computable.<br />

• While D n (µ||ξ) ∼ d 2<br />

ln n for all µ ∈ M,<br />

D n (µ||M) ≤ K(µ) ln 2 is even f<strong>in</strong>ite for computable µ.<br />

Fazit:<br />

Solomon<strong>of</strong>f prediction works also <strong>in</strong> non-computable environments


<strong>Marcus</strong> <strong>Hutter</strong> - 62 - Universal Induction & Intelligence<br />

Convergence and Loss Bounds<br />

• Total (loss) bounds: ∑ ∞<br />

n=1 E[h n] × < K(µ) ln 2, where<br />

h t (ω


<strong>Marcus</strong> <strong>Hutter</strong> - 63 - Universal Induction & Intelligence<br />

More Stuff / Critique / Problems<br />

• Prior knowledge y can be <strong>in</strong>corporated by us<strong>in</strong>g “subjective” prior<br />

w U ν|y = 2−K(ν|y) or by prefix<strong>in</strong>g observation x by y.<br />

• Additive/multiplicative constant fudges and U-dependence is <strong>of</strong>ten<br />

(but not always) harmless.<br />

• Incomputability: K and M can serve as “gold standards” which<br />

practitioners should aim at, but have to be (crudely) approximated<br />

<strong>in</strong> practice (MDL [Ris89], MML [Wal05], LZW [LZ76], CTW [WSTT95],<br />

NCD [CV05]).<br />

• The M<strong>in</strong>imum Description Length Pr<strong>in</strong>ciple:<br />

M(x) ≈ 2 −K U (x) ≈ 2 −K T (x) . Predict y <strong>of</strong> highest M(y|x) is<br />

approximately same as MDL: Predict y <strong>of</strong> smallest K T (xy).


<strong>Marcus</strong> <strong>Hutter</strong> - 64 - Universal Induction & Intelligence<br />

Universal Inductive Inference: Summary<br />

Universal Solomon<strong>of</strong>f prediction solves/avoids/meliorates many problems<br />

<strong>of</strong> (Bayesian) <strong>in</strong>duction. We discussed:<br />

+ general total bounds for generic class, prior, and loss,<br />

+ i.i.d./universal-specific <strong>in</strong>stantaneous and future bounds,<br />

+ the D n bound for cont<strong>in</strong>uous classes,<br />

+ <strong>in</strong>difference/symmetry pr<strong>in</strong>ciples,<br />

+ the problem <strong>of</strong> zero p(oste)rior & confirm. <strong>of</strong> universal hypotheses,<br />

+ reparametrization and regroup<strong>in</strong>g <strong>in</strong>variance,<br />

+ the problem <strong>of</strong> old evidence and updat<strong>in</strong>g,<br />

+ that M works even <strong>in</strong> non-computable environments,<br />

+ how to <strong>in</strong>corporate prior knowledge,<br />

− the prediction <strong>of</strong> short sequences,<br />

− the constant fudges <strong>in</strong> all results and the U-dependence,<br />

− M’s <strong>in</strong>computability and crude practical approximations.


<strong>Marcus</strong> <strong>Hutter</strong> - 65 - Universal Induction & Intelligence<br />

Outlook<br />

• Relation to Prediction with Expert Advice<br />

• Relation to the M<strong>in</strong>imal Description Length (MDL) Pr<strong>in</strong>ciple<br />

• Generalization to Active/Re<strong>in</strong>forcement learn<strong>in</strong>g (AIXI)<br />

Open Problems<br />

• Prediction <strong>of</strong> selected bits<br />

• Convergence <strong>of</strong> M on Mart<strong>in</strong>-Loef random sequences<br />

• Better <strong>in</strong>stantaneous bounds for M<br />

• Future bounds for Bayes (general ξ)


<strong>Marcus</strong> <strong>Hutter</strong> - 66 - Universal Induction & Intelligence<br />

UNIVERSAL RATIONAL AGENTS<br />

• Rational agents<br />

• Sequential decision theory<br />

• Re<strong>in</strong>forcement learn<strong>in</strong>g<br />

• Value function<br />

• Universal Bayes mixture and AIXI model<br />

• Self-optimiz<strong>in</strong>g and Pareto-optimal policies<br />

• Environmental Classes<br />

• The horizon problem<br />

• Computational Issues


<strong>Marcus</strong> <strong>Hutter</strong> - 67 - Universal Induction & Intelligence<br />

Universal Rational Agents: Abstract<br />

Sequential decision theory formally solves the problem <strong>of</strong> rational agents<br />

<strong>in</strong> uncerta<strong>in</strong> worlds if the true environmental prior probability distribution<br />

is known. Solomon<strong>of</strong>f’s theory <strong>of</strong> universal <strong>in</strong>duction formally solves the<br />

problem <strong>of</strong> sequence prediction for unknown prior distribution.<br />

Here we comb<strong>in</strong>e both ideas and develop an elegant parameter-free<br />

theory <strong>of</strong> an optimal re<strong>in</strong>forcement learn<strong>in</strong>g agent embedded <strong>in</strong> an<br />

arbitrary unknown environment that possesses essentially all aspects <strong>of</strong><br />

rational <strong>in</strong>telligence. The theory reduces all conceptual AI problems to<br />

pure computational ones.<br />

There are strong arguments that the result<strong>in</strong>g AIXI model is the most<br />

<strong>in</strong>telligent unbiased agent possible. Other discussed topics are relations<br />

between problem classes, the horizon problem, and computational issues.


<strong>Marcus</strong> <strong>Hutter</strong> - 68 - Universal Induction & Intelligence<br />

The Agent Model<br />

Most if not all AI problems can be<br />

formulated with<strong>in</strong> the agent<br />

framework<br />

r 1 | o 1 r 2 | o 2 r 3 | o 3 r 4 | o 4 r 5 | o 5 r 6 | o 6<br />

...<br />

work<br />

Agent<br />

p<br />

✟<br />

✟<br />

✟<br />

✟<br />

✟✙<br />

tape ...<br />

❍❨ ❍<br />

❍<br />

❍<br />

❍<br />

work<br />

Environment<br />

q<br />

✏ ✏✏✏✏✏✏✶<br />

tape ...<br />

y 1 y 2 y 3 y 4 y 5 y 6 ...


<strong>Marcus</strong> <strong>Hutter</strong> - 69 - Universal Induction & Intelligence<br />

Rational Agents <strong>in</strong> Determ<strong>in</strong>istic Environments<br />

- p:X ∗ →Y ∗ is determ<strong>in</strong>istic policy <strong>of</strong> the agent,<br />

p(x


<strong>Marcus</strong> <strong>Hutter</strong> - 70 - Universal Induction & Intelligence<br />

Agents <strong>in</strong> Probabilistic Environments<br />

Given history y 1:k x


<strong>Marcus</strong> <strong>Hutter</strong> - 71 - Universal Induction & Intelligence<br />

Optimal Policy and Value<br />

The σ-optimal policy p σ := arg max p V p σ<br />

maximizes V p σ ≤ V ∗ σ := V pσ<br />

σ .<br />

Explicit expressions for the action y k <strong>in</strong> cycle k <strong>of</strong> the σ-optimal policy<br />

p σ and their value Vσ<br />

∗ are<br />

∑ ∑ ∑<br />

y k = arg max max ... max (r k + ... +r m )·σ(x k:m |y 1:m x


<strong>Marcus</strong> <strong>Hutter</strong> - 72 - Universal Induction & Intelligence<br />

Expectimax Tree/Algorithm<br />

V ∗ σ<br />

❅ (yx


<strong>Marcus</strong> <strong>Hutter</strong> - 73 - Universal Induction & Intelligence<br />

Known environment µ<br />

• Assumption: µ is the true environment <strong>in</strong> which the agent operates<br />

• Then, policy p µ is optimal <strong>in</strong> the sense that no other policy for an<br />

agent leads to higher µ AI -expected reward.<br />

• Special choices <strong>of</strong> µ: determ<strong>in</strong>istic or adversarial environments,<br />

Markov decision processes (mdps), adversarial environments.<br />

• There is no pr<strong>in</strong>ciple problem <strong>in</strong> comput<strong>in</strong>g the optimal action y k as<br />

long as µ AI is known and computable and X , Y and m are f<strong>in</strong>ite.<br />

• Th<strong>in</strong>gs drastically change if µ AI is unknown ...


<strong>Marcus</strong> <strong>Hutter</strong> - 74 - Universal Induction & Intelligence<br />

The Bayes-mixture distribution ξ<br />

Assumption: The true environment µ is unknown.<br />

Bayesian approach: The true probability distribution µ AI is not learned<br />

directly, but is replaced by a Bayes-mixture ξ AI .<br />

Assumption: We know that the true environment µ is conta<strong>in</strong>ed <strong>in</strong> some<br />

known (f<strong>in</strong>ite or countable) set M <strong>of</strong> environments.<br />

The Bayes-mixture ξ is def<strong>in</strong>ed as<br />

ξ(x 1:m |y 1:m ) := ∑<br />

w ν ν(x 1:m |y 1:m )<br />

ν∈M<br />

with<br />

∑<br />

ν∈M<br />

w ν = 1,<br />

w ν > 0 ∀ν<br />

The weights w ν may be <strong>in</strong>terpreted as the prior degree <strong>of</strong> belief that the<br />

true environment is ν.<br />

Then ξ(x 1:m |y 1:m ) could be <strong>in</strong>terpreted as the prior subjective belief<br />

probability <strong>in</strong> observ<strong>in</strong>g x 1:m , given actions y 1:m .


<strong>Marcus</strong> <strong>Hutter</strong> - 75 - Universal Induction & Intelligence<br />

Questions <strong>of</strong> Interest<br />

• It is natural to follow the policy p ξ which maximizes V p<br />

ξ .<br />

• If µ is the true environment the expected reward when follow<strong>in</strong>g<br />

policy p ξ will be V pξ<br />

µ .<br />

• The optimal (but <strong>in</strong>feasible) policy p µ yields reward V pµ<br />

µ ≡ V ∗ µ .<br />

• Are there policies with uniformly larger value than V pξ<br />

µ <br />

• How close is V pξ<br />

µ to V ∗ µ <br />

• What is the most general class M and weights w ν .


<strong>Marcus</strong> <strong>Hutter</strong> - 76 - Universal Induction & Intelligence<br />

A universal choice <strong>of</strong> ξ and M<br />

• We have to assume the existence <strong>of</strong> some structure on the<br />

environment to avoid the No-Free-Lunch Theorems [Wolpert 96].<br />

• We can only unravel effective structures which are describable by<br />

(semi)computable probability distributions.<br />

• So we may <strong>in</strong>clude all (semi)computable (semi)distributions <strong>in</strong> M.<br />

• Occam’s razor and Epicurus’ pr<strong>in</strong>ciple <strong>of</strong> multiple explanations tell<br />

us to assign high prior belief to simple environments.<br />

• Us<strong>in</strong>g Kolmogorov’s universal complexity measure K(ν) for<br />

environments ν one should set w ν ∼ 2 −K(ν) , where K(ν) is the<br />

length <strong>of</strong> the shortest program on a universal TM comput<strong>in</strong>g ν.<br />

• The result<strong>in</strong>g AIXI model [<strong>Hutter</strong>:00] is a unification <strong>of</strong> (Bellman’s)<br />

sequential decision and Solomon<strong>of</strong>f’s universal <strong>in</strong>duction theory.


<strong>Marcus</strong> <strong>Hutter</strong> - 77 - Universal Induction & Intelligence<br />

The AIXI Model <strong>in</strong> one L<strong>in</strong>e<br />

complete & essentially unique & limit-computable<br />

∑ ∑<br />

AIXI: a k := arg max ... max [r k + ... + r m ] ∑<br />

a k<br />

a m<br />

o k r k o m r m<br />

2 −length(p)<br />

p : U(p,a 1 ..a m )=o 1 r 1 ..o m r m<br />

k=now, action, observation, reward, Universal TM, program, m=lifespan<br />

AIXI is an elegant mathematical theory <strong>of</strong> AI<br />

Claim: AIXI is the most <strong>in</strong>telligent environmental <strong>in</strong>dependent, i.e.<br />

universally optimal, agent possible.<br />

Pro<strong>of</strong>: For formalizations, quantifications, and pro<strong>of</strong>s, see [Hut05].<br />

Applications: Strategic Games, Function Optimization, Supervised<br />

Learn<strong>in</strong>g, Sequence Prediction, Classification, ...<br />

In the follow<strong>in</strong>g we consider generic M and w ν .


<strong>Marcus</strong> <strong>Hutter</strong> - 78 - Universal Induction & Intelligence<br />

Pareto-Optimality <strong>of</strong> p ξ<br />

Policy p ξ is Pareto-optimal <strong>in</strong> the sense that there is no other policy p<br />

with V p<br />

ν<br />

≥ V pξ<br />

ν for all ν ∈ M and strict <strong>in</strong>equality for at least one ν.<br />

Self-optimiz<strong>in</strong>g Policies<br />

Under which circumstances does the value <strong>of</strong> the universal policy p ξ<br />

converge to optimum<br />

V pξ<br />

ν<br />

→ V ∗<br />

ν for horizon m → ∞ for all ν ∈ M. (1)<br />

The least we must demand from M to have a chance that (1) is true is<br />

that there exists some policy ˜p at all with this property, i.e.<br />

∃˜p : V ˜p<br />

ν<br />

→ V ∗<br />

ν for horizon m → ∞ for all ν ∈ M. (2)<br />

Ma<strong>in</strong> result: (2) ⇒ (1): The necessary condition <strong>of</strong> the existence <strong>of</strong> a<br />

self-optimiz<strong>in</strong>g policy ˜p is also sufficient for p ξ to be self-optimiz<strong>in</strong>g.


<strong>Marcus</strong> <strong>Hutter</strong> - 79 - Universal Induction & Intelligence<br />

Environments w. (Non)Self-Optimiz<strong>in</strong>g Policies


<strong>Marcus</strong> <strong>Hutter</strong> - 80 - Universal Induction & Intelligence<br />

Particularly Interest<strong>in</strong>g Environments<br />

• Sequence Prediction, e.g. weather<br />

√<br />

or stock-market prediction.<br />

Strong result: Vµ ∗ − Vµ pξ = O( ), m =horizon.<br />

K(µ)<br />

m<br />

• Strategic Games: Learn to play well (m<strong>in</strong>imax) strategic zero-sum<br />

games (like chess) or even exploit limited capabilities <strong>of</strong> opponent.<br />

• Optimization: F<strong>in</strong>d (approximate) m<strong>in</strong>imum <strong>of</strong> function with as few<br />

function calls as possible. Difficult exploration versus exploitation<br />

problem.<br />

• Supervised learn<strong>in</strong>g: Learn functions by present<strong>in</strong>g (z, f(z)) pairs<br />

and ask for function values <strong>of</strong> z ′ by present<strong>in</strong>g (z ′ , ) pairs.<br />

Supervised learn<strong>in</strong>g is much faster than re<strong>in</strong>forcement learn<strong>in</strong>g.<br />

AIξ quickly learns to predict, play games, optimize, and learn supervised.


<strong>Marcus</strong> <strong>Hutter</strong> - 81 - Universal Induction & Intelligence<br />

Future Value and the Right Discount<strong>in</strong>g<br />

• Elim<strong>in</strong>ate the arbitrary horizon parameter m by discount<strong>in</strong>g the<br />

rewards r k ❀ γ k r k with Γ k := ∑ ∞<br />

i=k γ i < ∞ and lett<strong>in</strong>g m → ∞:<br />

Vkγ πσ :=<br />

1 ∑<br />

lim (γ k r k +...+γ m r m )σ(x k:m |y 1:m x


<strong>Marcus</strong> <strong>Hutter</strong> - 82 - Universal Induction & Intelligence<br />

Universal Rational Agents: Summary<br />

• Setup: Agents act<strong>in</strong>g <strong>in</strong> general probabilistic environments with<br />

re<strong>in</strong>forcement feedback.<br />

• Assumptions: True environment µ belongs to a known class <strong>of</strong><br />

environments M, but is otherwise unknown.<br />

• Results: The Bayes-optimal policy p ξ based on the Bayes-mixture<br />

ξ = ∑ ν∈M w νν is Pareto-optimal and self-optimiz<strong>in</strong>g if M admits<br />

self-optimiz<strong>in</strong>g policies.<br />

• Application: The class <strong>of</strong> ergodic mdps admits self-optimiz<strong>in</strong>g<br />

policies.<br />

• New: Policy p ξ with unbounded effective horizon is the first purely<br />

Bayesian self-optimiz<strong>in</strong>g consistent policy for ergodic mdps.<br />

• Learn: The comb<strong>in</strong>ed conditions Γ k < ∞ and γ k+1<br />

γ k<br />

→ 1 allow a<br />

consistent self-optimiz<strong>in</strong>g Bayes-optimal policy based on mixtures.


<strong>Marcus</strong> <strong>Hutter</strong> - 83 - Universal Induction & Intelligence<br />

Universal Rational Agents: Remarks<br />

• We have developed a parameterless AI model based on sequential<br />

decisions and algorithmic probability.<br />

• We have reduced the AI problem to pure computational questions.<br />

• AIξ seems not to lack any important known methodology <strong>of</strong> AI,<br />

apart from computational aspects.<br />

• There is no need for implement<strong>in</strong>g extra knowledge, as this can be<br />

learned by present<strong>in</strong>g it <strong>in</strong> o k <strong>in</strong> any form.<br />

• The learn<strong>in</strong>g process itself is an important aspect <strong>of</strong> AI.<br />

• Noise or irrelevant <strong>in</strong>formation <strong>in</strong> the <strong>in</strong>puts do not disturb the AIξ<br />

system.<br />

• Philosophical questions: relevance <strong>of</strong> non-computational physics<br />

(Penrose), number <strong>of</strong> wisdom Ω (Chait<strong>in</strong>), consciousness, social<br />

consequences.


<strong>Marcus</strong> <strong>Hutter</strong> - 84 - Universal Induction & Intelligence<br />

Universal Rational Agents: Outlook<br />

• Cont<strong>in</strong>uous classes M.<br />

• Restricted policy classes.<br />

• Non-asymptotic bounds.<br />

• Tighter bounds by exploit<strong>in</strong>g extra properties <strong>of</strong> the environments,<br />

like the mix<strong>in</strong>g rate <strong>of</strong> mdps.<br />

• Search for other performance criteria [<strong>Hutter</strong>:00].<br />

• Instead <strong>of</strong> convergence <strong>of</strong> the expected reward sum, study<br />

convergence with high probability <strong>of</strong> the actually realized reward<br />

sum.<br />

• Other environmental classes (separability concepts, downscal<strong>in</strong>g).


<strong>Marcus</strong> <strong>Hutter</strong> - 85 - Universal Induction & Intelligence<br />

APPROXIMATIONS & APPLICATIONS<br />

• Universal Similarity Metric<br />

• Universal Search<br />

• Time-Bounded AIXI Model<br />

• Brute-Force Approximation <strong>of</strong> AIXI<br />

• A Monte-Carlo AIXI Approximation<br />

• Feature Re<strong>in</strong>forcement Learn<strong>in</strong>g<br />

• Comparison to other approaches<br />

• Future directions, wrap-up, references.


<strong>Marcus</strong> <strong>Hutter</strong> - 86 - Universal Induction & Intelligence<br />

Approximations & Applications: Abstract<br />

Many fundamental theories have to be approximated for practical use.<br />

S<strong>in</strong>ce the core quantities <strong>of</strong> universal <strong>in</strong>duction and universal <strong>in</strong>telligence<br />

are <strong>in</strong>computable, it is <strong>of</strong>ten hard, but not impossible, to approximate<br />

them. In any case, hav<strong>in</strong>g these “gold standards” to approximate<br />

(top→down) or to aim at (bottom→up) is extremely helpful <strong>in</strong> build<strong>in</strong>g<br />

truly <strong>in</strong>telligent systems. The most impressive direct approximation <strong>of</strong><br />

Kolmogorov complexity to-date is via the universal similarity metric<br />

applied to a variety <strong>of</strong> real-world cluster<strong>in</strong>g problems. A couple <strong>of</strong><br />

universal search algorithms ((adaptive) Lev<strong>in</strong> search, FastPrg, OOPS,<br />

Goedel-mach<strong>in</strong>e, ...) that f<strong>in</strong>d short programs have been developed and<br />

applied to a variety <strong>of</strong> toy problem. The AIXI model itself has been<br />

approximated <strong>in</strong> a couple <strong>of</strong> ways (AIXItl, Brute Force, Monte Carlo,<br />

Feature RL). Some recent applications will be presented. The Lecture<br />

Series concludes by compar<strong>in</strong>g various learn<strong>in</strong>g algorithms along various<br />

dimensions, po<strong>in</strong>t<strong>in</strong>g to future directions, wrap-up, and references.


<strong>Marcus</strong> <strong>Hutter</strong> - 87 - Universal Induction & Intelligence<br />

Conditional Kolmogorov Complexity<br />

Question: When is object=str<strong>in</strong>g x similar to object=str<strong>in</strong>g y<br />

Universal solution: x similar y ⇔ x can be easily (re)constructed from y<br />

⇔ Kolmogorov complexity K(x|y) := m<strong>in</strong>{l(p) : U(p, y) = x} is small<br />

Examples:<br />

1) x is very similar to itself (K(x|x) + = 0)<br />

2) A processed x is similar to x (K(f(x)|x) + = 0 if K(f) = O(1)).<br />

e.g. doubl<strong>in</strong>g, revert<strong>in</strong>g, <strong>in</strong>vert<strong>in</strong>g, encrypt<strong>in</strong>g, partially delet<strong>in</strong>g x.<br />

3) A random str<strong>in</strong>g is with high probability not similar to any other<br />

str<strong>in</strong>g (K(random|y) =length(random)).<br />

The problem with K(x|y) as similarity=distance measure is that it is<br />

neither symmetric nor normalized nor computable.


<strong>Marcus</strong> <strong>Hutter</strong> - 88 - Universal Induction & Intelligence<br />

The Universal Similarity Metric [CV’05]<br />

• Symmetrization and normalization leads to a/the universal metric d:<br />

0 ≤ d(x, y) :=<br />

max{K(x|y), K(y|x)}<br />

max{K(x), K(y)}<br />

≤ 1<br />

• Every effective similarity between x and y is detected by d<br />

• Use K(x|y)≈K(xy)−K(y) and K(x)≡K U (x)≈K T (x) (cod<strong>in</strong>g)<br />

T<br />

=⇒ computable approximation: Normalized compression distance:<br />

d(x, y) ≈ K T (xy) − m<strong>in</strong>{K T (x), K T (y)}<br />

max{K T (x), K T (y)}<br />

1<br />

• For T choose Lempel-Ziv or gzip or bzip(2) (de)compressor <strong>in</strong> the<br />

applications below.<br />

• Theory: Lempel-Ziv compresses asymptotically better than any<br />

probabilistic f<strong>in</strong>ite state automaton predictor/compressor.


<strong>Marcus</strong> <strong>Hutter</strong> - 89 - Universal Induction & Intelligence<br />

Tree-Based Cluster<strong>in</strong>g [CV’05]<br />

• If many objects x 1 , ..., x n need to be compared, determ<strong>in</strong>e the<br />

Similarity matrix: M ij = d(x i , x j ) for 1 ≤ i, j ≤ n<br />

• Now cluster similar objects.<br />

• There are various cluster<strong>in</strong>g techniques.<br />

• Tree-based cluster<strong>in</strong>g: Create a tree connect<strong>in</strong>g similar objects,<br />

• e.g. quartet method (for cluster<strong>in</strong>g)<br />

• Applications: Phylogeny <strong>of</strong> 24 Mammal mtDNA,<br />

50 Language Tree (based on declaration <strong>of</strong> human rights),<br />

composers <strong>of</strong> music, authors <strong>of</strong> novels, SARS virus, fungi,<br />

optical characters, galaxies, ...<br />

[Cilibrasi&Vitanyi’05]


<strong>Marcus</strong> <strong>Hutter</strong> - 90 - Universal Induction & Intelligence<br />

Genomics & Phylogeny: Mammals [CV’05]<br />

Evolutionary tree built from complete mammalian mtDNA <strong>of</strong> 24 species:<br />

Carp<br />

Cow<br />

BlueWhale<br />

F<strong>in</strong>backWhale<br />

Cat<br />

BrownBear<br />

PolarBear<br />

GreySeal<br />

HarborSeal<br />

Horse<br />

WhiteRh<strong>in</strong>o<br />

Gibbon<br />

Gorilla<br />

Human<br />

Chimpanzee<br />

PygmyChimp<br />

Orangutan<br />

SumatranOrangutan<br />

HouseMouse<br />

Rat<br />

Opossum<br />

Wallaroo<br />

Echidna<br />

Platypus<br />

Ferungulates<br />

Eutheria<br />

Primates<br />

Eutheria - Rodents<br />

Metatheria<br />

Prototheria


Language Tree (Re)construction<br />

based on “The Universal Declaration <strong>of</strong><br />

Human Rights” <strong>in</strong> 50 languages.<br />

ROMANCE<br />

GERMANIC<br />

SLAVIC<br />

BALTIC<br />

ALTAIC<br />

CELTIC<br />

UGROFINNIC<br />

Basque [Spa<strong>in</strong>]<br />

Hungarian [Hungary]<br />

Polish [Poland]<br />

Sorbian [Germany]<br />

Slovak [Slovakia]<br />

Czech [Czech Rep]<br />

Slovenian [Slovenia]<br />

Serbian [Serbia]<br />

Bosnian [Bosnia]<br />

Croatian [Croatia]<br />

Romani Balkan [East Europe]<br />

Albanian [Albany]<br />

Lithuanian [Lithuania]<br />

Latvian [Latvia]<br />

Turkish [Turkey]<br />

Uzbek [Utzbekistan]<br />

Breton [France]<br />

Maltese [Malta]<br />

OccitanAuvergnat [France]<br />

Walloon [Belgique]<br />

English [UK]<br />

French [France]<br />

Asturian [Spa<strong>in</strong>]<br />

Portuguese [Portugal]<br />

Spanish [Spa<strong>in</strong>]<br />

Galician [Spa<strong>in</strong>]<br />

Catalan [Spa<strong>in</strong>]<br />

Occitan [France]<br />

Rhaeto Romance [Switzerland]<br />

Friulian [Italy]<br />

Italian [Italy]<br />

Sammar<strong>in</strong>ese [Italy]<br />

Corsican [France]<br />

Sard<strong>in</strong>ian [Italy]<br />

Romanian [Romania]<br />

Romani Vlach [Macedonia]<br />

Welsh [UK]<br />

Scottish Gaelic [UK]<br />

Irish Gaelic [UK]<br />

German [Germany]<br />

Luxembourgish [Luxembourg]<br />

Frisian [Netherlands]<br />

Dutch [Netherlands]<br />

Afrikaans<br />

Swedish [Sweden]<br />

Norwegian Nynorsk [Norway]<br />

Danish [Denmark]<br />

Norwegian Bokmal [Norway]<br />

Faroese [Denmark]<br />

Icelandic [Iceland]<br />

F<strong>in</strong>nish [F<strong>in</strong>land]<br />

Estonian [Estonia]


<strong>Marcus</strong> <strong>Hutter</strong> - 92 - Universal Induction & Intelligence<br />

Universal Search<br />

• Lev<strong>in</strong> search: Fastest algorithm for<br />

<strong>in</strong>version and optimization problems.<br />

• Theoretical application:<br />

Assume somebody found a non-constructive<br />

pro<strong>of</strong> <strong>of</strong> P=NP, then Lev<strong>in</strong>-search is a polynomial<br />

time algorithm for every NP (complete) problem.<br />

• Practical (OOPS) applications (J. Schmidhuber)<br />

Maze, towers <strong>of</strong> hanoi, robotics, ...<br />

• FastPrg: The asymptotically fastest and shortest algorithm for all<br />

well-def<strong>in</strong>ed problems.<br />

• AIXItl: Computable variant <strong>of</strong> AIXI.<br />

• Human Knowledge Compression Prize: (50’000 C=)


<strong>Marcus</strong> <strong>Hutter</strong> - 93 - Universal Induction & Intelligence<br />

The Time-Bounded AIXI Model (AIXItl)<br />

An algorithm p best has been constructed for which the follow<strong>in</strong>g holds:<br />

• Let p be any (extended chronological) policy<br />

• with length l(p)≤˜l and computation time per cycle t(p)≤˜t<br />

• for which there exists a pro<strong>of</strong> <strong>of</strong> length ≤l P that p is a valid<br />

approximation.<br />

• Then an algorithm p best can be constructed, depend<strong>in</strong>g on ˜l,˜t and<br />

l P but not on know<strong>in</strong>g p<br />

• which is effectively more or equally <strong>in</strong>telligent accord<strong>in</strong>g to ≽ c than<br />

any such p.<br />

• The size <strong>of</strong> p best is l(p best )=O(ln(˜l·˜t·l P )),<br />

• the setup-time is t setup (p best )=O(l 2 P ·2l P<br />

),<br />

• the computation time per cycle is t cycle (p best )=O(2˜l·˜t).


<strong>Marcus</strong> <strong>Hutter</strong> - 94 - Universal Induction & Intelligence<br />

Brute-Force Approximation <strong>of</strong> AIXI<br />

• Truncate expectimax tree depth to a small fixed lookahead h.<br />

Optimal action computable <strong>in</strong> time |Y ×X | h × time to evaluate ξ.<br />

• Consider mixture over Markov Decision Processes (MDP) only, i.e.<br />

ξ(x 1:m |y 1:m ) = ∑ ν∈M w ν<br />

∏ m<br />

t=1 ν(x t|x t−1 y t ). Note: ξ is not MDP<br />

• Choose uniform prior over w µ .<br />

Then ξ(x 1:m |y 1:m ) can be computed <strong>in</strong> l<strong>in</strong>ear time.<br />

• Consider (approximately) Markov problems<br />

with very small action and perception space.<br />

• Example application: 2×2 Matrix Games like Prisoner’S Dilemma,<br />

Stag Hunt, Chicken, Battle <strong>of</strong> Sexes, and Match<strong>in</strong>g Pennies. [PH’06]


<strong>Marcus</strong> <strong>Hutter</strong> - 95 - Universal Induction & Intelligence<br />

AIXI Learns to Play 2×2 Matrix Games<br />

• Repeated prisoners dilemma.<br />

Loss matrix<br />

• Game unknown to AIXI.<br />

Must be learned as well<br />

• AIXI behaves appropriately.<br />

avg. p. round cooperation ratio<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

AIXI vs. random<br />

AIXI vs. tit4tat<br />

AIXI vs. 2−tit4tat<br />

AIXI vs. 3−tit4tat<br />

AIXI vs. AIXI<br />

AIXI vs. AIXI2<br />

0<br />

0 20 40 60 80 100<br />

t


<strong>Marcus</strong> <strong>Hutter</strong> - 96 - Universal Induction & Intelligence<br />

A Monte-Carlo AIXI Approximation<br />

Consider class <strong>of</strong> Variable-Order Markov Decision Processes.<br />

The Context Tree Weight<strong>in</strong>g (CTW) algorithm can efficiently mix<br />

(exactly <strong>in</strong> essentially l<strong>in</strong>ear time) all prediction suffix trees.<br />

a1<br />

a2<br />

a3<br />

Monte-Carlo approximation <strong>of</strong> expectimax tree:<br />

o1 o2 o3 o4<br />

Upper Confidence Tree (UCT) algorithm:<br />

• Sample observations from CTW distribution.<br />

• Select actions with highest upper confidence bound.<br />

future reward estimate<br />

• Expand tree by one leaf node (per trajectory).<br />

• Simulate from leaf node further down us<strong>in</strong>g (fixed) playout policy.<br />

• Propagate back the value estimates for each node.<br />

Repeat until timeout.<br />

[VNHS’09]<br />

Guaranteed to converge to exact value.<br />

Extension: Predicate CTW not based on raw obs. but features there<strong>of</strong>.


<strong>Marcus</strong> <strong>Hutter</strong> - 97 - Universal Induction & Intelligence<br />

Monte-Carlo AIXI Applications<br />

1<br />

Normalized Learn<strong>in</strong>g Scalability<br />

Norm. Av. Reward per Trial<br />

0<br />

[Joel Veness et al. 2009]<br />

Experience<br />

Optimum<br />

Tiger<br />

4x4 Grid<br />

1d Maze<br />

Extended Tiger<br />

TicTacToe<br />

Cheese Maze<br />

Pocman*<br />

100 1000 10000 100000 1000000


<strong>Marcus</strong> <strong>Hutter</strong> - 98 - Universal Induction & Intelligence<br />

Feature Re<strong>in</strong>forcement Learn<strong>in</strong>g (FRL)<br />

Goal: Develop efficient general purpose <strong>in</strong>telligent agent.<br />

State-<strong>of</strong>-the-art: (a) AIXI: Incomputable theoretical solution.<br />

(b) MDP: Efficient limited problem class.<br />

(c) POMDP: Notoriously difficult. (d) PSRs: Underdeveloped.<br />

Idea: ΦMDP reduces real problem to MDP automatically by learn<strong>in</strong>g.<br />

Accomplishments so far: (i) Criterion for evaluat<strong>in</strong>g quality <strong>of</strong> reduction.<br />

(ii) Integration <strong>of</strong> the various parts <strong>in</strong>to one learn<strong>in</strong>g algorithm.<br />

(iii) Generalization to structured MDPs (DBNs)<br />

ΦMDP is promis<strong>in</strong>g path towards the grand goal & alternative to (a)-(d)<br />

Problem: F<strong>in</strong>d reduction Φ efficiently (generic optimization problem)


<strong>Marcus</strong> <strong>Hutter</strong> - 99 - Universal Induction & Intelligence<br />

Markov Decision Processes (MDPs)<br />

a computationally tractable class <strong>of</strong> problems<br />

• MDP Assumption: State s t := o t and r t are<br />

probabilistic functions <strong>of</strong> o t−1 and a t−1 only.<br />

• Further Assumption:<br />

State=observation space § is f<strong>in</strong>ite and small.<br />

Example MDP<br />

✞ ✎☞ ✎☞<br />

✝✲ ✍✌<br />

s 1 ✲<br />

✍✌<br />

s 3 r 2<br />

r 1 ✻ ✒<br />

✎☞ ✠ ✎☞ ❄<br />

✛<br />

r 4 ☎<br />

r 3<br />

✍✌<br />

s 4 ✛<br />

✍✌<br />

s 2 ✆<br />

• Goal: Maximize long-term expected reward.<br />

• Learn<strong>in</strong>g: Probability distribution is unknown but can be learned.<br />

• Exploration: Optimal exploration is <strong>in</strong>tractable<br />

but there are polynomial approximations.<br />

• Problem: Real problems are not <strong>of</strong> this simple form.


<strong>Marcus</strong> <strong>Hutter</strong> - 100 - Universal Induction & Intelligence<br />

Map Real Problem to MDP<br />

Map history h t := o 1 a 1 r 1 ...o t−1 to state s t := Φ(h t ), for example:<br />

Games: Full-<strong>in</strong>formation with static opponent: Φ(h t ) = o t .<br />

Classical physics: Position+velocity <strong>of</strong> objects = position at two<br />

time-slices: s t = Φ(h t ) = o t o t−1 is (2nd order) Markov.<br />

I.i.d. processes <strong>of</strong> unknown probability (e.g. cl<strong>in</strong>ical trials ≃ Bandits),<br />

Frequency <strong>of</strong> obs. Φ(h n ) = ( ∑ n<br />

t=1 δ o t o) o∈O is sufficient statistic.<br />

Identity: Φ(h) = h is always sufficient, but not learnable.<br />

F<strong>in</strong>d/Learn Map Automatically<br />

Φ best := arg m<strong>in</strong> Φ Cost(Φ|h t )<br />

• What is the best map/MDP (i.e. what is the right Cost criterion)<br />

• Is the best MDP good enough (i.e. is reduction always possible)<br />

• How to f<strong>in</strong>d the map Φ (i.e. m<strong>in</strong>imize Cost) efficiently


<strong>Marcus</strong> <strong>Hutter</strong> - 101 - Universal Induction & Intelligence<br />

ΦMDP: Computational Flow<br />

✓<br />

✏ ✓<br />

✏<br />

Transition Pr. ˆT exploration<br />

Reward est.<br />

✒<br />

ˆR<br />

✲<br />

bonus<br />

ˆT e , ˆR e<br />

✑ ✒<br />

✑<br />

✓ ✒ frequency estimate<br />

❅<br />

Bellman<br />

✏<br />

✓<br />

❅ ❅❘<br />

Feature Vec. ˆΦ<br />

( ˆQ) ˆValue<br />

✒<br />

✑<br />

✒<br />

✻ Cost(Φ|h) m<strong>in</strong>imization implicit<br />

✓<br />

✏<br />

✓ ❄<br />

History h<br />

Best Policy ˆp<br />

✒<br />

✑<br />

✒<br />

✻<br />

reward r observation o<br />

action a<br />

❄<br />

Environment<br />

✏<br />

✑<br />

✏<br />


<strong>Marcus</strong> <strong>Hutter</strong> - 102 - Universal Induction & Intelligence<br />

Intelligent Agents <strong>in</strong> Perspective<br />

✗<br />

Universal AI<br />

(AIXI)<br />

✔<br />

✖<br />

✕<br />

❅<br />

✡ ✡ ❏<br />

✄ ✄ ❈<br />

❈ ❏<br />

❅<br />

❅<br />

✎ ✡ ✡ ❏<br />

✄ ✄ ❈<br />

❈ ❏ ☞<br />

Feature RL (ΦMDP/DBN/..)<br />

✍<br />

✌<br />

❅<br />

❅<br />

❅<br />

✎ ❅<br />

✡ ✡✡ ❏<br />

❏<br />

✎☞<br />

✄ ✄✄ ❈<br />

❈<br />

☞✎<br />

❈ ☞✎<br />

❏ ❅ ☞<br />

Information Learn<strong>in</strong>g Plann<strong>in</strong>g Complexity<br />

✍ ✌✍<br />

✌✍<br />

✌✍<br />

✌<br />

❅ ❅<br />

Search – Optimization – Computation – Logic – KR<br />

Agents = General Framework, Interface = Robots,Vision,Language


<strong>Marcus</strong> <strong>Hutter</strong> - 103 - Universal Induction & Intelligence<br />

Properties <strong>of</strong> Learn<strong>in</strong>g Algorithms<br />

Comparison <strong>of</strong> AIXI to Other Approaches<br />

Algorithm<br />

Properties<br />

time<br />

efficient<br />

global<br />

optimum<br />

data<br />

efficient<br />

exploration<br />

convergence<br />

generalization<br />

pomdp<br />

learn<strong>in</strong>g<br />

active<br />

Value/Policy iteration yes/no yes – YES YES NO NO NO yes<br />

TD w. func.approx. no/yes NO NO no/yes NO YES NO YES YES<br />

Direct Policy Search no/yes YES NO no/yes NO YES no YES YES<br />

Logic Planners yes/no YES yes YES YES no no YES yes<br />

RL with Split Trees yes YES no YES NO yes YES YES YES<br />

Pred.w. Expert Advice yes/no YES – YES yes/no yes NO YES NO<br />

OOPS yes/no no – yes yes/no YES YES YES YES<br />

Market/Economy RL yes/no no NO no no/yes yes yes/no YES YES<br />

SPXI no YES – YES YES YES NO YES NO<br />

AIXI NO YES YES yes YES YES YES YES YES<br />

AIXItl no/yes YES YES YES yes YES YES YES YES<br />

MC-AIXI-CTW yes/no yes YES YES yes NO yes/no YES YES<br />

Feature RL yes/no YES yes yes yes yes yes YES YES<br />

Human yes yes yes no/yes NO YES YES YES YES


<strong>Marcus</strong> <strong>Hutter</strong> - 104 - Universal Induction & Intelligence<br />

Mach<strong>in</strong>e Intelligence Tests & Def<strong>in</strong>itions<br />

⋆= yes, ·= no,<br />

•= debatable,<br />

= unknown.<br />

Intelligence Test<br />

Valid<br />

Informative<br />

Wide Range<br />

General<br />

Dynamic<br />

Unbiased<br />

Fundamental<br />

Formal<br />

Objective<br />

Fully Def<strong>in</strong>ed<br />

Universal<br />

Practical<br />

Test vs. Def.<br />

Tur<strong>in</strong>g Test • · · · • · · · · • · • T<br />

Total Tur<strong>in</strong>g Test • · · · • · · · · • · · T<br />

Inverted Tur<strong>in</strong>g Test • • · · • · · · · • · • T<br />

Toddler Tur<strong>in</strong>g Test • · · · • · · · · · · • T<br />

L<strong>in</strong>guistic Complexity • ⋆ • · · · · • • · • • T<br />

Text Compression Test • ⋆ ⋆ • · • • ⋆ ⋆ ⋆ • ⋆ T<br />

Tur<strong>in</strong>g Ratio • ⋆ ⋆ ⋆ · T/D<br />

Psychometric AI ⋆ ⋆ • ⋆ • · • • • · • T/D<br />

Smith’s Test • ⋆ ⋆ • · ⋆ ⋆ ⋆ · • T/D<br />

C-Test • ⋆ ⋆ • · ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ T/D<br />

AIXI ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ · D


<strong>Marcus</strong> <strong>Hutter</strong> - 105 - Universal Induction & Intelligence<br />

Common Criticisms<br />

• AIXI is obviously wrong.<br />

(<strong>in</strong>telligence cannot be captured <strong>in</strong> a few simple equations)<br />

• AIXI is obviously correct. (everybody already knows this)<br />

• Assum<strong>in</strong>g that the environment is computable is too strong.<br />

• All standard objections to strong AI also apply to AIXI.<br />

(free will, lookup table, Lucas/Penrose Goedel argument)<br />

• AIXI doesn’t deal with X or cannot do X.<br />

(X = consciousness, creativity, imag<strong>in</strong>ation, emotion, love, soul, etc.)<br />

• AIXI is not <strong>in</strong>telligent because it cannot choose its goals.<br />

• Universal AI is impossible due to the No-Free-Lunch theorem.<br />

See [Legg:08] for refutations <strong>of</strong> these and more criticisms.


<strong>Marcus</strong> <strong>Hutter</strong> - 106 - Universal Induction & Intelligence<br />

General Murky & Quirky AI Questions<br />

• Does current ma<strong>in</strong>stream AI research has anyth<strong>in</strong>g todo with AI<br />

• Are sequential decision and algorithmic probability theory<br />

all we need to well-def<strong>in</strong>e AI<br />

• What is (Universal) AI theory good for<br />

• What are robots good for <strong>in</strong> AI<br />

• Is <strong>in</strong>telligence a fundamentally simple concept<br />

(compare with fractals or physics theories)<br />

• What can we (not) expect from super-<strong>in</strong>telligent agents<br />

• Is maximiz<strong>in</strong>g the expected reward the right criterion<br />

• Isn’t universal learn<strong>in</strong>g impossible due to the NFL theorems


<strong>Marcus</strong> <strong>Hutter</strong> - 107 - Universal Induction & Intelligence<br />

Next Steps<br />

• Address the many open theoretical questions (see <strong>Hutter</strong>:05).<br />

• Bridge the gap between (Universal) AI theory and AI practice.<br />

• Explore what role logical reason<strong>in</strong>g, knowledge representation,<br />

vision, language, etc. play <strong>in</strong> Universal AI.<br />

• Determ<strong>in</strong>e the right discount<strong>in</strong>g <strong>of</strong> future rewards.<br />

• Develop the right nurtur<strong>in</strong>g environment for a learn<strong>in</strong>g agent.<br />

• Consider embodied agents (e.g. <strong>in</strong>ternal↔external reward)<br />

• Analyze AIXI <strong>in</strong> the multi-agent sett<strong>in</strong>g.


<strong>Marcus</strong> <strong>Hutter</strong> - 108 - Universal Induction & Intelligence<br />

The Big Questions<br />

• Is non-computational physics relevant to AI [Penrose]<br />

• Could someth<strong>in</strong>g like the number <strong>of</strong> wisdom Ω prevent a simple<br />

solution to AI [Chait<strong>in</strong>]<br />

• Do we need to understand consciousness before be<strong>in</strong>g able to<br />

understand AI or construct AI systems<br />

• What if we succeed


<strong>Marcus</strong> <strong>Hutter</strong> - 109 - Universal Induction & Intelligence<br />

Wrap Up<br />

• Setup: Given (non)iid data D = (x 1 , ..., x n ), predict x n+1<br />

• Ultimate goal is to maximize pr<strong>of</strong>it or m<strong>in</strong>imize loss<br />

• Consider Models/Hypothesis H i ∈ M<br />

• Max.Likelihood: H best = arg max i p(D|H i ) (overfits if M large)<br />

• Bayes: Posterior probability <strong>of</strong> H i is p(H i |D) ∝ p(D|H i )p(H i )<br />

• Bayes needs prior(H i )<br />

• Occam+Epicurus: High prior for simple models.<br />

• Kolmogorov/Solomon<strong>of</strong>f: Quantification <strong>of</strong> simplicity/complexity<br />

• Bayes works if D is sampled from H true ∈ M<br />

• Bellman equations tell how to optimally act <strong>in</strong> known environments<br />

• Universal AI = Universal Induction + Sequential Decision Theory<br />

• Practice = approximate, restrict, search, optimize, knowledge


<strong>Marcus</strong> <strong>Hutter</strong> - 110 - Universal Induction & Intelligence<br />

Literature<br />

[CV05]<br />

[Hut05]<br />

[Hut07]<br />

[LH07]<br />

[Hut09]<br />

R. Cilibrasi and P. M. B. Vitányi. Cluster<strong>in</strong>g by compression. IEEE<br />

Trans. Information Theory, 51(4):1523–1545, 2005.<br />

M. <strong>Hutter</strong>. Universal Artificial Intelligence: Sequential Decisions<br />

based on Algorithmic Probability. Spr<strong>in</strong>ger, Berl<strong>in</strong>, 2005.<br />

M. <strong>Hutter</strong>. On universal prediction and Bayesian confirmation.<br />

Theoretical Computer Science, 384(1):33–48, 2007.<br />

S. Legg and M. <strong>Hutter</strong>. Universal <strong>in</strong>telligence: a def<strong>in</strong>ition <strong>of</strong><br />

mach<strong>in</strong>e <strong>in</strong>telligence. M<strong>in</strong>ds & Mach<strong>in</strong>es, 17(4):391–444, 2007.<br />

M. <strong>Hutter</strong>. Feature re<strong>in</strong>forcement learn<strong>in</strong>g: Part I: Unstructured<br />

MDPs. Journal <strong>of</strong> Artificial General Intelligence, 1:3–24, 2009.<br />

[VNHS09] J. Veness, K. S. Ng, M. <strong>Hutter</strong>, and D. Silver. A Monte Carlo AIXI<br />

approximation. Technical report, NICTA, Australia, 2009.


<strong>Marcus</strong> <strong>Hutter</strong> - 111 - Universal Induction & Intelligence<br />

Thanks! Questions Details:<br />

Jobs: PostDoc and PhD positions at RSISE and NICTA, Australia<br />

Projects at http://www.hutter1.net/<br />

A Unified View <strong>of</strong> Artificial Intelligence<br />

= =<br />

Decision Theory = Probability + Utility Theory<br />

+ +<br />

Universal Induction = Ockham + Bayes + Tur<strong>in</strong>g<br />

Open research problems at www.hutter1.net/ai/uaibook.htm<br />

Compression contest with 50’000 C= prize at prize.hutter1.net

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!