Slides in PDF - of Marcus Hutter
Slides in PDF - of Marcus Hutter
Slides in PDF - of Marcus Hutter
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Foundations <strong>of</strong><br />
Universal Induction<br />
and Intelligent Agents<br />
<strong>Marcus</strong> <strong>Hutter</strong><br />
Canberra, ACT, 0200, Australia<br />
http://www.hutter1.net/<br />
ANU RSISE NICTA
<strong>Marcus</strong> <strong>Hutter</strong> - 2 - Universal Induction & Intelligence<br />
Abstract<br />
The dream <strong>of</strong> creat<strong>in</strong>g artificial devices that reach or outperform human<br />
<strong>in</strong>telligence is many centuries old. This lecture series presents the<br />
elegant parameter-free theory, developed <strong>in</strong> [Hut05], <strong>of</strong> an optimal<br />
re<strong>in</strong>forcement learn<strong>in</strong>g agent embedded <strong>in</strong> an arbitrary unknown<br />
environment that possesses essentially all aspects <strong>of</strong> rational <strong>in</strong>telligence.<br />
The theory reduces all conceptual AI problems to pure computational<br />
questions.<br />
How to perform <strong>in</strong>ductive <strong>in</strong>ference is closely related to the AI problem.<br />
The lecture series covers Solomon<strong>of</strong>f’s theory, elaborated on <strong>in</strong> [Hut07],<br />
which solves the <strong>in</strong>duction problem, at least from a philosophical and<br />
statistical perspective.<br />
Both theories are based on Occam’s razor quantified by Kolmogorov<br />
complexity; Bayesian probability theory; and sequential decision theory.
<strong>Marcus</strong> <strong>Hutter</strong> - 3 - Universal Induction & Intelligence<br />
RelationbetweenML&RL&(U)AI<br />
Universal Artificial Intelligence<br />
Covers all Re<strong>in</strong>forcement Learn<strong>in</strong>g problem types<br />
Statistical<br />
Mach<strong>in</strong>e Learn<strong>in</strong>g<br />
Mostly i.i.d. data<br />
classification,<br />
regression,<br />
cluster<strong>in</strong>g<br />
RL Problems<br />
& Algorithms<br />
Stochastic,<br />
unknown,<br />
non-i.i.d.<br />
environments<br />
Artificial<br />
Intelligence<br />
Traditionally<br />
determ<strong>in</strong>istic,<br />
known world /<br />
plann<strong>in</strong>g problem
<strong>Marcus</strong> <strong>Hutter</strong> - 4 - Universal Induction & Intelligence<br />
TABLE OF CONTENTS<br />
1. PHILOSOPHICAL ISSUES<br />
2. BAYESIAN SEQUENCE PREDICTION<br />
3. UNIVERSAL INDUCTIVE INFERENCE<br />
4. UNIVERSAL ARTIFICIAL INTELLIGENCE<br />
5. APPROXIMATIONS AND APPLICATIONS<br />
6. FUTURE DIRECTION, WRAP UP, LITERATURE
<strong>Marcus</strong> <strong>Hutter</strong> - 5 - Universal Induction & Intelligence<br />
PHILOSOPHICAL ISSUES<br />
• Philosophical Problems<br />
• What is (Artificial) Intelligence<br />
• How to do Inductive Inference<br />
• How to Predict (Number) Sequences<br />
• How to make Decisions <strong>in</strong> Unknown Environments<br />
• Occam’s Razor to the Rescue<br />
• The Grue Emerald and Confirmation Paradoxes<br />
• What this Lecture Series is (Not) About
<strong>Marcus</strong> <strong>Hutter</strong> - 6 - Universal Induction & Intelligence<br />
Philosophical Issues: Abstract<br />
I start by consider<strong>in</strong>g the philosophical problems concern<strong>in</strong>g artificial<br />
<strong>in</strong>telligence and mach<strong>in</strong>e learn<strong>in</strong>g <strong>in</strong> general and <strong>in</strong>duction <strong>in</strong> particular.<br />
I illustrate the problems and their <strong>in</strong>tuitive solution on various (classical)<br />
<strong>in</strong>duction examples. The common pr<strong>in</strong>ciple to their solution is Occam’s<br />
simplicity pr<strong>in</strong>ciple. Based on Occam’s and Epicurus’ pr<strong>in</strong>ciple, Bayesian<br />
probability theory, and Tur<strong>in</strong>g’s universal mach<strong>in</strong>e, Solomon<strong>of</strong>f<br />
developed a formal theory <strong>of</strong> <strong>in</strong>duction. I describe the sequential/onl<strong>in</strong>e<br />
setup considered <strong>in</strong> this lecture series and place it <strong>in</strong>to the wider<br />
mach<strong>in</strong>e learn<strong>in</strong>g context.
<strong>Marcus</strong> <strong>Hutter</strong> - 7 - Universal Induction & Intelligence<br />
What is (Artificial) Intelligence<br />
Intelligence can have many faces ⇒ formal def<strong>in</strong>ition difficult<br />
• reason<strong>in</strong>g<br />
• creativity<br />
• association<br />
• generalization<br />
• pattern recognition<br />
• problem solv<strong>in</strong>g<br />
• memorization<br />
• plann<strong>in</strong>g<br />
• achiev<strong>in</strong>g goals<br />
• learn<strong>in</strong>g<br />
• optimization<br />
• self-preservation<br />
• vision<br />
• language process<strong>in</strong>g<br />
• classification<br />
• <strong>in</strong>duction<br />
• deduction<br />
• ...<br />
What is AI Th<strong>in</strong>k<strong>in</strong>g Act<strong>in</strong>g<br />
humanly Cognitive Tur<strong>in</strong>g test,<br />
Science Behaviorism<br />
rationally Laws Do<strong>in</strong>g the<br />
Thought Right Th<strong>in</strong>g<br />
Collection <strong>of</strong> 70+ Defs <strong>of</strong> Intelligence<br />
http://www.vetta.org/<br />
def<strong>in</strong>itions-<strong>of</strong>-<strong>in</strong>telligence/<br />
Real world is nasty: partially unobservable,<br />
uncerta<strong>in</strong>, unknown, non-ergodic, reactive,<br />
vast, but luckily structured, ...
<strong>Marcus</strong> <strong>Hutter</strong> - 8 - Universal Induction & Intelligence<br />
Informal Def<strong>in</strong>ition <strong>of</strong> (Artificial) Intelligence<br />
Intelligence measures an agent’s ability to achieve goals<br />
<strong>in</strong> a wide range <strong>of</strong> environments. [S. Legg and M. <strong>Hutter</strong>]<br />
Emergent: Features such as the ability to learn and adapt, or to<br />
understand, are implicit <strong>in</strong> the above def<strong>in</strong>ition as these capacities<br />
enable an agent to succeed <strong>in</strong> a wide range <strong>of</strong> environments.<br />
The science <strong>of</strong> Artificial Intelligence is concerned with the construction<br />
<strong>of</strong> <strong>in</strong>telligent systems/artifacts/agents and their analysis.<br />
What next Substantiate all terms above: agent, ability, utility, goal,<br />
success, learn, adapt, environment, ...
<strong>Marcus</strong> <strong>Hutter</strong> - 9 - Universal Induction & Intelligence<br />
On the Foundations <strong>of</strong> Artificial Intelligence<br />
• Example: Algorithm/complexity theory: The goal is to f<strong>in</strong>d fast<br />
algorithms solv<strong>in</strong>g problems and to show lower bounds on their<br />
computation time. Everyth<strong>in</strong>g is rigorously def<strong>in</strong>ed: algorithm,<br />
Tur<strong>in</strong>g mach<strong>in</strong>e, problem classes, computation time, ...<br />
• Most discipl<strong>in</strong>es start with an <strong>in</strong>formal way <strong>of</strong> attack<strong>in</strong>g a subject.<br />
With time they get more and more formalized <strong>of</strong>ten to a po<strong>in</strong>t<br />
where they are completely rigorous. Examples: set theory, logical<br />
reason<strong>in</strong>g, pro<strong>of</strong> theory, probability theory, <strong>in</strong>f<strong>in</strong>itesimal calculus,<br />
energy, temperature, quantum field theory, ...<br />
• Artificial Intelligence: Tries to build and understand systems that<br />
act <strong>in</strong>telligently, learn from experience, make good predictions, are<br />
able to generalize, ... Many terms are only vaguely def<strong>in</strong>ed or there<br />
are many alternate def<strong>in</strong>itions.
<strong>Marcus</strong> <strong>Hutter</strong> - 10 - Universal Induction & Intelligence<br />
Induction→Prediction→Decision→Action<br />
Induction <strong>in</strong>fers general models from specific observations/facts/data,<br />
usually exhibit<strong>in</strong>g regularities or properties or relations <strong>in</strong> the latter.<br />
Hav<strong>in</strong>g or acquir<strong>in</strong>g or learn<strong>in</strong>g or <strong>in</strong>duc<strong>in</strong>g a model <strong>of</strong> the environment<br />
an agent <strong>in</strong>teracts with allows the agent to make predictions and utilize<br />
them <strong>in</strong> its decision process <strong>of</strong> f<strong>in</strong>d<strong>in</strong>g a good next action.<br />
Example<br />
Induction: F<strong>in</strong>d a model <strong>of</strong> the world economy.<br />
Prediction: Use the model for predict<strong>in</strong>g the future stock market.<br />
Decision: Decide whether to <strong>in</strong>vest assets <strong>in</strong> stocks or bonds.<br />
Action: Trad<strong>in</strong>g large quantities <strong>of</strong> stocks <strong>in</strong>fluences the market.
<strong>Marcus</strong> <strong>Hutter</strong> - 11 - Universal Induction & Intelligence<br />
Example 1: Probability <strong>of</strong> Sunrise Tomorrow<br />
What is the probability p(1|1 d ) that the sun will rise tomorrow<br />
(d = past # days sun rose, 1 =sun rises. 0 = sun will not rise)<br />
• p is undef<strong>in</strong>ed, because there has never been an experiment that<br />
tested the existence <strong>of</strong> the sun tomorrow (ref. class problem).<br />
• The p = 1, because the sun rose <strong>in</strong> all past experiments.<br />
• p = 1 − ϵ, where ϵ is the proportion <strong>of</strong> stars that explode per day.<br />
• p = d+1<br />
d+2<br />
, which is Laplace rule derived from Bayes rule.<br />
• Derive p from the type, age, size and temperature <strong>of</strong> the sun, even<br />
though we never observed another star with those exact properties.<br />
Conclusion: We predict that the sun will rise tomorrow with high<br />
probability <strong>in</strong>dependent <strong>of</strong> the justification.
<strong>Marcus</strong> <strong>Hutter</strong> - 12 - Universal Induction & Intelligence<br />
Example 2: Digits <strong>of</strong> a Computable Number<br />
• Extend 14159265358979323846264338327950288419716939937<br />
• Looks random!<br />
• Frequency estimate: n = length <strong>of</strong> sequence. k i = number <strong>of</strong><br />
occured i =⇒ Probability <strong>of</strong> next digit be<strong>in</strong>g i is i n . Asymptotically<br />
i<br />
n → 1 10<br />
(seems to be) true.<br />
• But we have the strong feel<strong>in</strong>g that (i.e. with high probability) the<br />
next digit will be 5 because the previous digits were the expansion<br />
<strong>of</strong> π.<br />
• Conclusion: We prefer answer 5, s<strong>in</strong>ce we see more structure <strong>in</strong> the<br />
sequence than just random digits.
<strong>Marcus</strong> <strong>Hutter</strong> - 13 - Universal Induction & Intelligence<br />
Sequence:<br />
Example 3: Number Sequences<br />
x 1 , x 2 , x 3 , x 4 , x 5 , ...<br />
1, 2, 3, 4, , ...<br />
• x 5 = 5, s<strong>in</strong>ce x i = i for i = 1..4.<br />
• x 5 = 29, s<strong>in</strong>ce x i = i 4 − 10i 3 + 35i 2 − 49i + 24.<br />
Conclusion: We prefer 5, s<strong>in</strong>ce l<strong>in</strong>ear relation <strong>in</strong>volves less arbitrary<br />
parameters than 4th-order polynomial.<br />
Sequence: 2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,<br />
• 61, s<strong>in</strong>ce this is the next prime<br />
• 60, s<strong>in</strong>ce this is the order <strong>of</strong> the next simple group<br />
Conclusion: We prefer answer 61, s<strong>in</strong>ce primes are a more familiar<br />
concept than simple groups.<br />
On-L<strong>in</strong>e Encyclopedia <strong>of</strong> Integer Sequences:<br />
http://www.research.att.com/∼njas/sequences/
<strong>Marcus</strong> <strong>Hutter</strong> - 14 - Universal Induction & Intelligence<br />
Occam’s Razor to the Rescue<br />
• Is there a unique pr<strong>in</strong>ciple which allows us to formally arrive at a<br />
prediction which<br />
- co<strong>in</strong>cides (always) with our <strong>in</strong>tuitive guess -or- even better,<br />
- which is (<strong>in</strong> some sense) most likely the best or correct answer<br />
• Yes! Occam’s razor: Use the simplest explanation consistent with<br />
past data (and use it for prediction).<br />
• Works! For examples presented and for many more.<br />
• Actually Occam’s razor can serve as a foundation <strong>of</strong> mach<strong>in</strong>e<br />
learn<strong>in</strong>g <strong>in</strong> general, and is even a fundamental pr<strong>in</strong>ciple (or maybe<br />
even the mere def<strong>in</strong>ition) <strong>of</strong> science.<br />
• Problem: Not a formal/mathematical objective pr<strong>in</strong>ciple.<br />
What is simple for one may be complicated for another.
<strong>Marcus</strong> <strong>Hutter</strong> - 15 - Universal Induction & Intelligence<br />
Grue Emerald Paradox<br />
Hypothesis 1: All emeralds are green.<br />
Hypothesis 2: All emeralds found till y2020 are green,<br />
thereafter all emeralds are blue.<br />
• Which hypothesis is more plausible H1! Justification<br />
• Occam’s razor: take simplest hypothesis consistent with data.<br />
is the most important pr<strong>in</strong>ciple <strong>in</strong> mach<strong>in</strong>e learn<strong>in</strong>g and science.
<strong>Marcus</strong> <strong>Hutter</strong> - 16 - Universal Induction & Intelligence<br />
Confirmation Paradox<br />
(i) R → B is confirmed by an R-<strong>in</strong>stance with property B<br />
(ii) ¬B → ¬R is confirmed by a ¬B-<strong>in</strong>stance with property ¬R.<br />
(iii) S<strong>in</strong>ce R → B and ¬B → ¬R are logically equivalent,<br />
R → B is also confirmed by a ¬B-<strong>in</strong>stance with property ¬R.<br />
Example: Hypothesis (o): All ravens are black (R=Raven, B=Black).<br />
(i) observ<strong>in</strong>g a Black Raven confirms Hypothesis (o).<br />
(iii) observ<strong>in</strong>g a White Sock also confirms that all Ravens are Black,<br />
s<strong>in</strong>ce a White Sock is a non-Raven which is non-Black.<br />
This conclusion sounds absurd!<br />
What’s the problem
<strong>Marcus</strong> <strong>Hutter</strong> - 17 - Universal Induction & Intelligence<br />
What This Lecture Series is (Not) About<br />
Dichotomies <strong>in</strong> Artificial Intelligence & Mach<strong>in</strong>e Learn<strong>in</strong>g<br />
scope <strong>of</strong> this lecture series ⇔ scope <strong>of</strong> other lecture seriess<br />
(mach<strong>in</strong>e) learn<strong>in</strong>g ⇔ (GOFAI) knowledge-based<br />
statistical ⇔ logic-based<br />
decision ⇔ prediction ⇔ <strong>in</strong>duction ⇔ action<br />
classification ⇔ regression<br />
sequential / non-iid ⇔ <strong>in</strong>dependent identically distributed<br />
onl<strong>in</strong>e learn<strong>in</strong>g ⇔ <strong>of</strong>fl<strong>in</strong>e/batch learn<strong>in</strong>g<br />
passive prediction ⇔ active learn<strong>in</strong>g<br />
Bayes ⇔ MDL ⇔ Expert ⇔ Frequentist<br />
un<strong>in</strong>formed / universal ⇔ <strong>in</strong>formed / problem-specific<br />
conceptual/mathematical issues ⇔ computational issues<br />
exact/pr<strong>in</strong>cipled ⇔ heuristic<br />
supervised learn<strong>in</strong>g ⇔ unsupervised ⇔ RL learn<strong>in</strong>g<br />
exploitation ⇔ exploration
<strong>Marcus</strong> <strong>Hutter</strong> - 18 - Universal Induction & Intelligence<br />
BAYESIAN SEQUENCE PREDICTION<br />
• Sequential/Onl<strong>in</strong>e Prediction – Setup<br />
• Uncerta<strong>in</strong>ty and Probability<br />
• Frequency Interpretation: Count<strong>in</strong>g<br />
• Objective Uncerta<strong>in</strong> Events & Subjective Degrees <strong>of</strong> Belief<br />
• Bayes’ and Laplace’s Rules<br />
• The Bayes-mixture distribution<br />
• Predictive Convergence<br />
• Sequential Decisions and Loss Bounds<br />
• Generalization: Cont<strong>in</strong>uous Probability Classes<br />
• Summary
<strong>Marcus</strong> <strong>Hutter</strong> - 19 - Universal Induction & Intelligence<br />
Bayesian Sequence Prediction: Abstract<br />
The aim <strong>of</strong> probability theory is to describe uncerta<strong>in</strong>ty. There are<br />
various sources and <strong>in</strong>terpretations <strong>of</strong> uncerta<strong>in</strong>ty. I compare the<br />
frequency, objective, and subjective probabilities, and show that they all<br />
respect the same rules, and derive Bayes’ and Laplace’s famous and<br />
fundamental rules. Then I concentrate on general sequence prediction<br />
tasks. I def<strong>in</strong>e the Bayes mixture distribution and show that the<br />
posterior converges rapidly to the true posterior by exploit<strong>in</strong>g some<br />
bounds on the relative entropy. F<strong>in</strong>ally I show that the mixture predictor<br />
is also optimal <strong>in</strong> a decision-theoretic sense w.r.t. any bounded loss<br />
function.
<strong>Marcus</strong> <strong>Hutter</strong> - 20 - Universal Induction & Intelligence<br />
Sequential/Onl<strong>in</strong>e Prediction – Setup<br />
In sequential or onl<strong>in</strong>e prediction, for times t = 1, 2, 3, ...,<br />
our predictor p makes a prediction y p t ∈ Y<br />
based on past observations x 1 , ..., x t−1 .<br />
Thereafter x t ∈ X is observed and p suffers Loss(x t , y p t ).<br />
The goal is to design predictors with small total loss or cumulative<br />
Loss 1:T (p) := ∑ T<br />
t=1 Loss(x t, y p t ).<br />
Applications are abundant, e.g. weather or stock market forecast<strong>in</strong>g.<br />
Example:<br />
Y =<br />
Loss(x, y)<br />
{<br />
umbrella<br />
sunglasses<br />
}<br />
X = {sunny , ra<strong>in</strong>y}<br />
0.1 0.3<br />
0.0 1.0<br />
Setup also <strong>in</strong>cludes: Classification and Regression problems.
<strong>Marcus</strong> <strong>Hutter</strong> - 21 - Universal Induction & Intelligence<br />
Uncerta<strong>in</strong>ty and Probability<br />
The aim <strong>of</strong> probability theory is to describe uncerta<strong>in</strong>ty.<br />
Sources/<strong>in</strong>terpretations for uncerta<strong>in</strong>ty:<br />
• Frequentist: probabilities are relative frequencies.<br />
(e.g. the relative frequency <strong>of</strong> toss<strong>in</strong>g head.)<br />
• Objectivist: probabilities are real aspects <strong>of</strong> the world.<br />
(e.g. the probability that some atom decays <strong>in</strong> the next hour)<br />
• Subjectivist: probabilities describe an agent’s degree <strong>of</strong> belief.<br />
(e.g. it is (im)plausible that extraterrestrians exist)
<strong>Marcus</strong> <strong>Hutter</strong> - 22 - Universal Induction & Intelligence<br />
Frequency Interpretation: Count<strong>in</strong>g<br />
• The frequentist <strong>in</strong>terprets probabilities as relative frequencies.<br />
• If <strong>in</strong> a sequence <strong>of</strong> n <strong>in</strong>dependent identically distributed (i.i.d.)<br />
experiments (trials) an event occurs k(n) times, the relative<br />
frequency <strong>of</strong> the event is k(n)/n.<br />
• The limit lim n→∞ k(n)/n is def<strong>in</strong>ed as the probability <strong>of</strong> the event.<br />
• For <strong>in</strong>stance, the probability <strong>of</strong> the event head <strong>in</strong> a sequence <strong>of</strong><br />
repeatedly toss<strong>in</strong>g a fair co<strong>in</strong> is 1 2 .<br />
• The frequentist position is the easiest to grasp, but it has several<br />
shortcom<strong>in</strong>gs:<br />
• Problems: def<strong>in</strong>ition circular, limited to i.i.d, reference class<br />
problem.
<strong>Marcus</strong> <strong>Hutter</strong> - 23 - Universal Induction & Intelligence<br />
Objective Interpretation: Uncerta<strong>in</strong> Events<br />
• For the objectivist probabilities are real aspects <strong>of</strong> the world.<br />
• The outcome <strong>of</strong> an observation or an experiment is not<br />
determ<strong>in</strong>istic, but <strong>in</strong>volves physical random processes.<br />
• The set Ω <strong>of</strong> all possible outcomes is called the sample space.<br />
• It is said that an event E ⊂ Ω occurred if the outcome is <strong>in</strong> E.<br />
• In the case <strong>of</strong> i.i.d. experiments the probabilities p assigned to<br />
events E should be <strong>in</strong>terpretable as limit<strong>in</strong>g frequencies, but the<br />
application is not limited to this case.<br />
• (Some) probability axioms:<br />
p(Ω) = 1 and p({}) = 0 and 0 ≤ p(E) ≤ 1.<br />
p(A ∪ B) = p(A) + p(B) − p(A ∩ B).<br />
p(B|A) = p(A∩B)<br />
p(A)<br />
is the probability <strong>of</strong> B given event A occurred.
<strong>Marcus</strong> <strong>Hutter</strong> - 24 - Universal Induction & Intelligence<br />
Subjective Interpretation: Degrees <strong>of</strong> Belief<br />
• The subjectivist uses probabilities to characterize an agent’s degree<br />
<strong>of</strong> belief <strong>in</strong> someth<strong>in</strong>g, rather than to characterize physical random<br />
processes.<br />
• This is the most relevant <strong>in</strong>terpretation <strong>of</strong> probabilities <strong>in</strong> AI.<br />
• We def<strong>in</strong>e the plausibility <strong>of</strong> an event as the degree <strong>of</strong> belief <strong>in</strong> the<br />
event, or the subjective probability <strong>of</strong> the event.<br />
• It is natural to assume that plausibilities/beliefs Bel(·|·) can be repr.<br />
by real numbers, that the rules qualitatively correspond to common<br />
sense, and that the rules are mathematically consistent. ⇒<br />
• Cox’s theorem: Bel(·|A) is isomorphic to a probability function<br />
p(·|·) that satisfies the axioms <strong>of</strong> (objective) probabilities.<br />
• Conclusion:<br />
Beliefs follow the same rules as probabilities
<strong>Marcus</strong> <strong>Hutter</strong> - 25 - Universal Induction & Intelligence<br />
Bayes’ Famous Rule<br />
Let D be some possible data (i.e. D is event with p(D) > 0) and<br />
{H i } i∈I be a countable complete class <strong>of</strong> mutually exclusive hypotheses<br />
(i.e. H i are events with H i ∩ H j = {} ∀i ≠ j and ∪ i∈I H i = Ω).<br />
Given: p(H i ) = a priori plausibility <strong>of</strong> hypotheses H i (subj. prob.)<br />
Given: p(D|H i ) = likelihood <strong>of</strong> data D under hypothesis H i (obj. prob.)<br />
Goal: p(H i |D) = a posteriori plausibility <strong>of</strong> hypothesis H i (subj. prob.)<br />
Solution: p(H i |D) =<br />
p(D|H i )p(H i )<br />
∑<br />
i∈I p(D|H i)p(H i )<br />
Pro<strong>of</strong>: From the def<strong>in</strong>ition <strong>of</strong> conditional probability and<br />
∑<br />
p(H i |...) = 1 ⇒ ∑ p(D|H i )p(H i ) = ∑ p(H i |D)p(D) = p(D)<br />
i∈I<br />
i∈I<br />
i∈I
<strong>Marcus</strong> <strong>Hutter</strong> - 26 - Universal Induction & Intelligence<br />
Example: Bayes’ and Laplace’s Rule<br />
Assume data is generated by a biased co<strong>in</strong> with head probability θ, i.e.<br />
H θ :=Bernoulli(θ) with θ ∈ Θ := [0, 1].<br />
F<strong>in</strong>ite sequence: x = x 1 x 2 ...x n with n 1 ones and n 0 zeros.<br />
Sample <strong>in</strong>f<strong>in</strong>ite sequence: ω ∈ Ω = {0, 1} ∞<br />
Basic event: Γ x = {ω : ω 1 = x 1 , ..., ω n = x n } = set <strong>of</strong> all sequences<br />
start<strong>in</strong>g with x.<br />
Data likelihood: p θ (x) := p(Γ x |H θ ) = θ n 1<br />
(1 − θ) n 0<br />
.<br />
Bayes (1763): Uniform prior plausibility: p(θ) := p(H θ ) = 1<br />
( ∫ 1<br />
0 p(θ) dθ = 1 <strong>in</strong>stead ∑ i∈I p(H i) = 1)<br />
Evidence: p(x) = ∫ 1<br />
0 p θ(x)p(θ) dθ = ∫ 1<br />
0 θn 1<br />
(1 − θ) n 0<br />
dθ = n 1!n 0 !<br />
(n 0 +n 1 +1)!
<strong>Marcus</strong> <strong>Hutter</strong> - 27 - Universal Induction & Intelligence<br />
Example: Bayes’ and Laplace’s Rule<br />
Bayes: Posterior plausibility <strong>of</strong> θ<br />
after see<strong>in</strong>g x is:<br />
p(θ|x) = p(x|θ)p(θ)<br />
p(x)<br />
.<br />
= (n+1)!<br />
n 1 !n 0 ! θn 1<br />
(1−θ) n 0<br />
Laplace: What is the probability <strong>of</strong> see<strong>in</strong>g 1 after hav<strong>in</strong>g observed x<br />
p(x n+1 = 1|x 1 ...x n ) = p(x1)<br />
p(x) = n 1+1<br />
n + 2<br />
Laplace believed that the sun had risen for 5000 years = 1’826’213 days,<br />
so he concluded that the probability <strong>of</strong> doomsday tomorrow is<br />
1<br />
1826215 .
<strong>Marcus</strong> <strong>Hutter</strong> - 28 - Universal Induction & Intelligence<br />
Exercise: Envelope Paradox<br />
• I <strong>of</strong>fer you two closed envelopes, one <strong>of</strong> them conta<strong>in</strong>s twice the<br />
amount <strong>of</strong> money than the other. You are allowed to pick one and<br />
open it. Now you have two options. Keep the money or decide for<br />
the other envelope (which could double or half your ga<strong>in</strong>).<br />
• Symmetry argument: It doesn’t matter whether you switch, the<br />
expected ga<strong>in</strong> is the same.<br />
• Refutation: With probability p = 1/2, the other envelope conta<strong>in</strong>s<br />
twice/half the amount, i.e. if you switch your expected ga<strong>in</strong><br />
<strong>in</strong>creases by a factor 1.25=(1/2)*2+(1/2)*(1/2).<br />
• Present a Bayesian solution.
<strong>Marcus</strong> <strong>Hutter</strong> - 29 - Universal Induction & Intelligence<br />
Notation: Str<strong>in</strong>gs & Probabilities<br />
Str<strong>in</strong>gs: x=x 1 x 2 ...x n with x t ∈X and x 1:n := x 1 x 2 ...x n−1 x n and<br />
x
<strong>Marcus</strong> <strong>Hutter</strong> - 30 - Universal Induction & Intelligence<br />
The Bayes-Mixture Distribution ξ<br />
• Assumption: The true (objective) environment µ is unknown.<br />
• Bayesian approach: Replace true probability distribution µ by a<br />
Bayes-mixture ξ.<br />
• Assumption: We know that the true environment µ is conta<strong>in</strong>ed <strong>in</strong><br />
some known countable (<strong>in</strong>)f<strong>in</strong>ite set M <strong>of</strong> environments.<br />
• The Bayes-mixture ξ is def<strong>in</strong>ed as<br />
ξ(x) := ∑<br />
w ν ν(x) with<br />
ν∈M<br />
∑<br />
ν∈M<br />
w ν = 1,<br />
w ν > 0 ∀ν<br />
• The weights w ν may be <strong>in</strong>terpreted as the prior degree <strong>of</strong> belief that<br />
the true environment is ν, or k ν = ln wν<br />
−1 as a complexity penalty<br />
(prefix code length) <strong>of</strong> environment ν.<br />
• Then ξ(x) could be <strong>in</strong>terpreted as the prior subjective belief<br />
probability <strong>in</strong> observ<strong>in</strong>g x.
<strong>Marcus</strong> <strong>Hutter</strong> - 31 - Universal Induction & Intelligence<br />
Relative Entropy<br />
Relative entropy: D(p||q) := ∑ i p i ln p i<br />
q i<br />
Properties: D(p||q) ≥ 0 and D(p||q) = 0 ⇔ p = q<br />
Instantaneous relative entropy: d t (x
<strong>Marcus</strong> <strong>Hutter</strong> - 32 - Universal Induction & Intelligence<br />
Pro<strong>of</strong> <strong>of</strong> the Entropy Bound<br />
D n ≡<br />
n∑<br />
t=1<br />
∑<br />
µ(x
<strong>Marcus</strong> <strong>Hutter</strong> - 33 - Universal Induction & Intelligence<br />
Predictive Convergence<br />
Theorem: ξ(x t |x
<strong>Marcus</strong> <strong>Hutter</strong> - 34 - Universal Induction & Intelligence<br />
Sequential Decisions<br />
A prediction is very <strong>of</strong>ten the basis for some decision. The decision<br />
results <strong>in</strong> an action, which itself leads to some reward or loss.<br />
Let Loss(x t , y t ) ∈ [0, 1] be the received loss when tak<strong>in</strong>g action y t ∈Y<br />
and x t ∈X is the t th symbol <strong>of</strong> the sequence.<br />
For <strong>in</strong>stance, decision Y ={umbrella, sunglasses} based on weather<br />
forecasts X ={sunny, ra<strong>in</strong>y}. Loss sunny ra<strong>in</strong>y<br />
umbrella 0.1 0.3<br />
sunglasses 0.0 1.0<br />
The goal is to m<strong>in</strong>imize the µ-expected loss. More generally we def<strong>in</strong>e<br />
the Λ σ prediction scheme, which m<strong>in</strong>imizes the σ-expected loss:<br />
∑<br />
y Λ σ<br />
t := arg m<strong>in</strong> σ(x t |x
<strong>Marcus</strong> <strong>Hutter</strong> - 35 - Universal Induction & Intelligence<br />
Loss Bounds<br />
• Def<strong>in</strong>ition: µ-expected loss when Λ σ predicts the t th symbol:<br />
Loss t (Λ σ )(x
<strong>Marcus</strong> <strong>Hutter</strong> - 36 - Universal Induction & Intelligence<br />
Pro<strong>of</strong> <strong>of</strong> Instantaneous Loss Bounds<br />
Abbreviations: X = {1, ..., N}, N = |X |, i = x t , y i = µ(x t |x
<strong>Marcus</strong> <strong>Hutter</strong> - 37 - Universal Induction & Intelligence<br />
Generalization: Cont<strong>in</strong>uous Classes M<br />
In statistical parameter estimation one <strong>of</strong>ten has a cont<strong>in</strong>uous<br />
hypothesis class (e.g. a Bernoulli(θ) process with unknown θ ∈ [0, 1]).<br />
∫<br />
∫<br />
M := {ν θ : θ ∈ IR d }, ξ(x) := dθ w(θ) ν θ (x), dθ w(θ) = 1<br />
IR d<br />
Under weak regularity conditions [CB90,H’03]:<br />
IR d<br />
Theorem: D n (µ||ξ) ≤ ln w(µ) −1 + d 2 ln n 2π + O(1)<br />
where O(1) depends on the local curvature (parametric complexity) <strong>of</strong><br />
ln ν θ , and is <strong>in</strong>dependent n for many reasonable classes, <strong>in</strong>clud<strong>in</strong>g all<br />
stationary (k th -order) f<strong>in</strong>ite-state Markov processes (k = 0 is i.i.d.).<br />
D n ∝ log(n) = o(n) still implies excellent prediction and decision for<br />
most n.<br />
[RH’07]
<strong>Marcus</strong> <strong>Hutter</strong> - 38 - Universal Induction & Intelligence<br />
Bayesian Sequence Prediction: Summary<br />
• The aim <strong>of</strong> probability theory is to describe uncerta<strong>in</strong>ty.<br />
• Various sources and <strong>in</strong>terpretations <strong>of</strong> uncerta<strong>in</strong>ty:<br />
frequency, objective, and subjective probabilities.<br />
• They all respect the same rules.<br />
• General sequence prediction: Use known (subj.) Bayes mixture<br />
ξ = ∑ ν∈M w νν <strong>in</strong> place <strong>of</strong> unknown (obj.) true distribution µ.<br />
• Bound on the relative entropy between ξ and µ.<br />
⇒ posterior <strong>of</strong> ξ converges rapidly to the true posterior µ.<br />
• ξ is also optimal <strong>in</strong> a decision-theoretic sense w.r.t. any bounded<br />
loss function.<br />
• No structural assumptions on M and ν ∈ M.
<strong>Marcus</strong> <strong>Hutter</strong> - 39 - Universal Induction & Intelligence<br />
UNIVERSAL INDUCTIVE INFERENCE<br />
• Foundations <strong>of</strong> Universal Induction<br />
• Bayesian Sequence Prediction and Confirmation<br />
• Convergence and Decisions<br />
• How to Choose the Prior – Universal<br />
• Kolmogorov Complexity<br />
• How to Choose the Model Class – Universal<br />
• The Problem <strong>of</strong> Zero Prior<br />
• Reparametrization and Regroup<strong>in</strong>g Invariance<br />
• The Problem <strong>of</strong> Old Evidence / New Theories<br />
• Universal is Better than Cont<strong>in</strong>uous Class<br />
• More Bounds / Stuff / Critique / Problems<br />
• Summary / Outlook / Literature
<strong>Marcus</strong> <strong>Hutter</strong> - 40 - Universal Induction & Intelligence<br />
Universal Inductive Inference: Abstract<br />
Solomon<strong>of</strong>f completed the Bayesian framework by provid<strong>in</strong>g a rigorous,<br />
unique, formal, and universal choice for the model class and the prior. I<br />
will discuss <strong>in</strong> breadth how and <strong>in</strong> which sense universal (non-i.i.d.)<br />
sequence prediction solves various (philosophical) problems <strong>of</strong> traditional<br />
Bayesian sequence prediction. I show that Solomon<strong>of</strong>f’s model possesses<br />
many desirable properties: Strong total and weak <strong>in</strong>stantaneous bounds<br />
, and <strong>in</strong> contrast to most classical cont<strong>in</strong>uous prior densities has no zero<br />
p(oste)rior problem, i.e. can confirm universal hypotheses, is<br />
reparametrization and regroup<strong>in</strong>g <strong>in</strong>variant, and avoids the old-evidence<br />
and updat<strong>in</strong>g problem. It even performs well (actually better) <strong>in</strong><br />
non-computable environments.
<strong>Marcus</strong> <strong>Hutter</strong> - 41 - Universal Induction & Intelligence<br />
Induction Examples<br />
Sequence prediction: Predict weather/stock-quote/... tomorrow, based<br />
on past sequence. Cont<strong>in</strong>ue IQ test sequence like 1,4,9,16,<br />
Classification: Predict whether email is spam.<br />
Classification can be reduced to sequence prediction.<br />
Hypothesis test<strong>in</strong>g/identification: Does treatment X cure cancer<br />
Do observations <strong>of</strong> white swans confirm that all ravens are black<br />
These are <strong>in</strong>stances <strong>of</strong> the important problem <strong>of</strong> <strong>in</strong>ductive <strong>in</strong>ference or<br />
time-series forecast<strong>in</strong>g or sequence prediction.<br />
Problem: F<strong>in</strong>d<strong>in</strong>g prediction rules for every particular (new) problem is<br />
possible but cumbersome and prone to disagreement or contradiction.<br />
Goal: A s<strong>in</strong>gle, formal, general, complete theory for prediction.<br />
Beyond <strong>in</strong>duction: active/reward learn<strong>in</strong>g, fct. optimization, game theory.
<strong>Marcus</strong> <strong>Hutter</strong> - 42 - Universal Induction & Intelligence<br />
Foundations <strong>of</strong> Universal Induction<br />
Ockhams’ razor (simplicity) pr<strong>in</strong>ciple<br />
Entities should not be multiplied beyond necessity.<br />
Epicurus’ pr<strong>in</strong>ciple <strong>of</strong> multiple explanations<br />
If more than one theory is consistent with the observations, keep<br />
all theories.<br />
Bayes’ rule for conditional probabilities<br />
Given the prior belief/probability one can predict all future probabilities.<br />
Tur<strong>in</strong>g’s universal mach<strong>in</strong>e<br />
Everyth<strong>in</strong>g computable by a human us<strong>in</strong>g a fixed procedure can<br />
also be computed by a (universal) Tur<strong>in</strong>g mach<strong>in</strong>e.<br />
Kolmogorov’s complexity<br />
The complexity or <strong>in</strong>formation content <strong>of</strong> an object is the length<br />
<strong>of</strong> its shortest description on a universal Tur<strong>in</strong>g mach<strong>in</strong>e.<br />
Solomon<strong>of</strong>f’s universal prior=Ockham+Epicurus+Bayes+Tur<strong>in</strong>g<br />
Solves the question <strong>of</strong> how to choose the prior if noth<strong>in</strong>g is known.<br />
⇒ universal <strong>in</strong>duction, formal Occam, AIT,MML,MDL,SRM,...
<strong>Marcus</strong> <strong>Hutter</strong> - 43 - Universal Induction & Intelligence<br />
Bayesian Sequence Prediction and Confirmation<br />
• Assumption: Sequence ω ∈ X ∞ is sampled from the “true”<br />
probability measure µ, i.e. µ(x) := P[x|µ] is the µ-probability that<br />
ω starts with x ∈ X n .<br />
• Model class: We assume that µ is unknown but known to belong to<br />
a countable class <strong>of</strong> environments=models=measures<br />
M = {ν 1 , ν 2 , ...}.<br />
[no i.i.d./ergodic/stationary assumption]<br />
• Hypothesis class: {H ν : ν ∈ M} forms a mutually exclusive and<br />
complete class <strong>of</strong> hypotheses.<br />
• Prior: w ν := P[H ν ] is our prior belief <strong>in</strong> H ν<br />
⇒ Evidence: ξ(x) := P[x] = ∑ ν∈M P[x|H ν]P[H ν ] = ∑ ν w νν(x)<br />
must be our (prior) belief <strong>in</strong> x.<br />
⇒ Posterior: w ν (x) := P[H ν |x] = P[x|H ν]P[H ν ]<br />
P[x]<br />
is our posterior belief<br />
<strong>in</strong> ν (Bayes’ rule).
<strong>Marcus</strong> <strong>Hutter</strong> - 44 - Universal Induction & Intelligence<br />
When is a Sequence Random<br />
a) Is 0110010100101101101001111011 generated by a fair co<strong>in</strong> flip<br />
b) Is 1111111111111111111111111111 generated by a fair co<strong>in</strong> flip<br />
c) Is 1100100100001111110110101010 generated by a fair co<strong>in</strong> flip<br />
d) Is 0101010101010101010101010101 generated by a fair co<strong>in</strong> flip<br />
• Intuitively: (a) and (c) look random, but (b) and (d) look unlikely.<br />
• Problem: Formally (a-d) have equal probability ( 1 2 )length .<br />
• Classical solution: Consider hypothesis class H := {Bernoulli(p) :<br />
p ∈ Θ ⊆ [0, 1]} and determ<strong>in</strong>e p for which sequence has maximum<br />
likelihood =⇒ (a,c,d) are fair Bernoulli( 1 2<br />
) co<strong>in</strong>s, (b) not.<br />
• Problem: (d) is non-random, also (c) is b<strong>in</strong>ary expansion <strong>of</strong> π.<br />
• Solution: Choose H larger, but how large Overfitt<strong>in</strong>g MDL<br />
• AIT Solution: A sequence is random iff it is <strong>in</strong>compressible.
<strong>Marcus</strong> <strong>Hutter</strong> - 45 - Universal Induction & Intelligence<br />
What does Probability Mean<br />
Naive frequency <strong>in</strong>terpretation is circular:<br />
• Probability <strong>of</strong> event E is p := lim n→∞<br />
k n (E)<br />
n ,<br />
n = # i.i.d. trials, k n (E) = # occurrences <strong>of</strong> event E <strong>in</strong> n trials.<br />
• Problem: Limit may be anyth<strong>in</strong>g (or noth<strong>in</strong>g):<br />
e.g. a fair co<strong>in</strong> can give: Head, Head, Head, Head, ... ⇒ p = 1.<br />
• Of course, for a fair co<strong>in</strong> this sequence is “unlikely”.<br />
For fair co<strong>in</strong>, p = 1/2 with “high probability”.<br />
• But to make this statement rigorous we need to formally know what<br />
“high probability” means.<br />
Circularity!<br />
Also: In complex doma<strong>in</strong>s typical for AI, sample size is <strong>of</strong>ten 1.<br />
(e.g. a s<strong>in</strong>gle non-iid historic weather data sequences is given).<br />
We want to know whether certa<strong>in</strong> properties hold for this particular seq.
<strong>Marcus</strong> <strong>Hutter</strong> - 46 - Universal Induction & Intelligence<br />
How to Choose the Prior<br />
The probability axioms allow relat<strong>in</strong>g probabilities and plausibilities <strong>of</strong><br />
different events, but they do not uniquely fix a numerical value for each<br />
event, except for the sure event Ω and the empty event {}.<br />
We need new pr<strong>in</strong>ciples for determ<strong>in</strong><strong>in</strong>g values for at least some basis<br />
events from which others can then be computed.<br />
There seem to be only 3 general pr<strong>in</strong>ciples:<br />
• The pr<strong>in</strong>ciple <strong>of</strong> <strong>in</strong>difference — the symmetry pr<strong>in</strong>ciple<br />
• The maximum entropy pr<strong>in</strong>ciple<br />
• Occam’s razor — the simplicity pr<strong>in</strong>ciple<br />
Concrete: How shall we choose the hypothesis space {H i } and their<br />
prior p(H i ) –or– M = {ν} and their weight w ν .
<strong>Marcus</strong> <strong>Hutter</strong> - 47 - Universal Induction & Intelligence<br />
Indifference or Symmetry Pr<strong>in</strong>ciple<br />
Assign same probability to all hypotheses:<br />
p(H i ) = 1<br />
|I|<br />
for f<strong>in</strong>ite I<br />
p(H θ ) = [Vol(Θ)] −1 for compact and measurable Θ.<br />
⇒ p(H i |D) ∝ p(D|H i ) ∧ = classical Hypothesis test<strong>in</strong>g (Max.Likelihood).<br />
Prev. Example: H θ =Bernoulli(θ) with p(θ) = 1 for θ ∈ Θ := [0, 1].<br />
Problems: Does not work for “large” hypothesis spaces:<br />
(a) Uniform distr. on <strong>in</strong>f<strong>in</strong>ite I = IN or noncompact Θ not possible!<br />
(b) Reparametrization: θ ❀ f(θ). Uniform <strong>in</strong> θ is not uniform <strong>in</strong> f(θ).<br />
Example: “Uniform” distr. on space <strong>of</strong> all (b<strong>in</strong>ary) sequences {0, 1} ∞ :<br />
p(x 1 ...x n ) = ( 1 2 )n ∀n∀x 1 ...x n ⇒ p(x n+1 = 1|x 1 ...x n ) = 1 2 always!<br />
Inference so not possible (No-Free-Lunch myth).<br />
Predictive sett<strong>in</strong>g: All we need is p(x).
<strong>Marcus</strong> <strong>Hutter</strong> - 48 - Universal Induction & Intelligence<br />
Occam’s Razor — The Simplicity Pr<strong>in</strong>ciple<br />
• Only Occam’s razor (<strong>in</strong> comb<strong>in</strong>ation with Epicurus’ pr<strong>in</strong>ciple) is<br />
general enough to assign prior probabilities <strong>in</strong> every situation.<br />
• The idea is to assign high (subjective) probability to simple events,<br />
and low probability to complex events.<br />
• Simple events (str<strong>in</strong>gs) are more plausible a priori than complex<br />
ones.<br />
• This gives (approximately) justice to both Occam’s razor and<br />
Epicurus’ pr<strong>in</strong>ciple.
<strong>Marcus</strong> <strong>Hutter</strong> - 49 - Universal Induction & Intelligence<br />
Prefix Sets/Codes<br />
Str<strong>in</strong>g x is (proper) prefix <strong>of</strong> y :⇐⇒ ∃ z(≠ ϵ) such that xz = y.<br />
Set P is prefix-free or a prefix code :⇐⇒<br />
no element is a proper<br />
prefix <strong>of</strong> another.<br />
Example: A self-delimit<strong>in</strong>g code (e.g. P = {0, 10, 11}) is prefix-free.<br />
Kraft Inequality<br />
For a prefix code P we have ∑ x∈P 2−l(x) ≤ 1.<br />
Conversely, let l 1 , l 2 , ... be a countable sequence <strong>of</strong> natural numbers<br />
such that Kraft’s <strong>in</strong>equality ∑ k 2−l k<br />
≤ 1 is satisfied. Then there exists<br />
a prefix code P with these lengths <strong>of</strong> its b<strong>in</strong>ary code.
<strong>Marcus</strong> <strong>Hutter</strong> - 50 - Universal Induction & Intelligence<br />
Pro<strong>of</strong> <strong>of</strong> the Kraft-Inequality<br />
Pro<strong>of</strong> ⇒: Assign to each x ∈ P the <strong>in</strong>terval Γ x := [0.x, 0.x + 2 −l(x) ).<br />
Length <strong>of</strong> <strong>in</strong>terval Γ x is 2 −l(x) .<br />
Intervals are disjo<strong>in</strong>t, s<strong>in</strong>ce P is prefix free, hence<br />
∑<br />
2 −l(x) = ∑ Length(Γ x ) ≤ Length([0, 1]) = 1<br />
x∈P x∈P<br />
⇐: Idea: Choose l 1 , l 2 , ... <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order. Successively chop <strong>of</strong>f<br />
<strong>in</strong>tervals <strong>of</strong> lengths 2 −l 1<br />
, 2 −l 2<br />
, ... from left to right from [0, 1) and<br />
def<strong>in</strong>e left <strong>in</strong>terval boundary as code.
<strong>Marcus</strong> <strong>Hutter</strong> - 51 - Universal Induction & Intelligence<br />
Priors from Prefix Codes<br />
• Let Code(H ν ) be a prefix code <strong>of</strong> hypothesis H ν .<br />
• Def<strong>in</strong>e complexity Kw(ν) :=Length(Code(H ν ))<br />
• Choose prior w ν = p(H ν ) = 2 −Kw(ν)<br />
⇒ ∑ ν∈M w ν ≤ 1 is semi-probability (by Kraft).<br />
• How to choose a Code and hypothesis space M <br />
• Praxis: Choose a code which is reasonable for your problem<br />
and M large enough to conta<strong>in</strong> the true model.<br />
• Theory: Choose a universal code and consider “all” hypotheses ...
<strong>Marcus</strong> <strong>Hutter</strong> - 52 - Universal Induction & Intelligence<br />
Kolmogorov Complexity K(x)<br />
K. <strong>of</strong> str<strong>in</strong>g x is the length <strong>of</strong> the shortest (prefix) program produc<strong>in</strong>g x:<br />
K(x) := m<strong>in</strong> p {l(p) : U(p) = x},<br />
U = universal TM<br />
For non-str<strong>in</strong>g objects o (like numbers and functions) we def<strong>in</strong>e<br />
K(o) := K(⟨o⟩), where ⟨o⟩ ∈ X ∗ is some standard code for o.<br />
+ Simple str<strong>in</strong>gs like 000...0 have small K,<br />
irregular (e.g. random) str<strong>in</strong>gs have large K.<br />
• The def<strong>in</strong>ition is nearly <strong>in</strong>dependent <strong>of</strong> the choice <strong>of</strong> U.<br />
+ K satisfies most properties an <strong>in</strong>formation measure should satisfy.<br />
+ K shares many properties with Shannon entropy but is superior.<br />
− K(x) is not computable, but only semi-computable from above.<br />
Fazit:<br />
K is an excellent universal complexity measure,<br />
suitable for quantify<strong>in</strong>g Occam’s razor.
<strong>Marcus</strong> <strong>Hutter</strong> - 53 - Universal Induction & Intelligence<br />
Schematic Graph <strong>of</strong> Kolmogorov Complexity<br />
Although K(x) is <strong>in</strong>computable, we can draw a schematic graph
<strong>Marcus</strong> <strong>Hutter</strong> - 54 - Universal Induction & Intelligence<br />
The Universal Prior<br />
• Quantify the complexity <strong>of</strong> an environment ν or hypothesis H ν by<br />
its Kolmogorov complexity K(ν).<br />
• Universal prior: w ν = w U ν := 2 −K(ν) is a decreas<strong>in</strong>g function <strong>in</strong><br />
the model’s complexity, and sums to (less than) one.<br />
⇒ D n ≤ K(µ) ln 2, i.e. the number <strong>of</strong> ε-deviations <strong>of</strong> ξ from µ or l Λ ξ<br />
from l Λ µ<br />
is proportional to the complexity <strong>of</strong> the environment.<br />
• No other semi-computable prior leads to better prediction (bounds).<br />
• For cont<strong>in</strong>uous M, we can assign a (proper) universal prior (not<br />
density) w U θ = 2−K(θ) > 0 for computable θ, and 0 for uncomp. θ.<br />
• This effectively reduces M to a discrete class {ν θ ∈ M : w U θ > 0}<br />
which is typically dense <strong>in</strong> M.<br />
• This prior has many advantages over the classical prior (densities).
<strong>Marcus</strong> <strong>Hutter</strong> - 55 - Universal Induction & Intelligence<br />
The Problem <strong>of</strong> Zero Prior<br />
= the problem <strong>of</strong> confirmation <strong>of</strong> universal hypotheses<br />
Problem: If the prior is zero, then the posterior is necessarily also zero.<br />
Example: Consider the hypothesis H = H 1 that all balls <strong>in</strong> some urn or<br />
all ravens are black (=1) or that the sun rises every day.<br />
Start<strong>in</strong>g with a prior density as w(θ) = 1 implies that prior P[H θ ] = 0<br />
for all θ, hence posterior P [H θ |1..1] = 0, hence H never gets confirmed.<br />
3 non-solutions: def<strong>in</strong>e H = {ω = 1 ∞ } | use f<strong>in</strong>ite population | abandon<br />
strict/logical/all-quantified/universal hypotheses <strong>in</strong> favor <strong>of</strong> s<strong>of</strong>t hyp.<br />
Solution: Assign non-zero prior to θ = 1 ⇒ P[H|1 n ] → 1.<br />
Generalization: Assign non-zero prior to all “special” θ, like 1 2 and 1 6 ,<br />
which may naturally appear <strong>in</strong> a hypothesis, like “is the co<strong>in</strong> or die fair”.<br />
Universal solution: Assign non-zero prior to all comp. θ, e.g. w U θ<br />
= 2−K(θ)
<strong>Marcus</strong> <strong>Hutter</strong> - 56 - Universal Induction & Intelligence<br />
Reparametrization Invariance<br />
• New parametrization e.g. ψ = √ θ, then the ψ-density<br />
w ′ (ψ) = 2 √ θ w(θ) is no longer uniform if w(θ) = 1 is uniform<br />
⇒ <strong>in</strong>difference pr<strong>in</strong>ciple is not reparametrization <strong>in</strong>variant (RIP).<br />
• Jeffrey’s and Bernardo’s pr<strong>in</strong>ciple satisfy RIP w.r.t. differentiable<br />
bijective transformations ψ = f −1 (θ).<br />
• The universal prior wθ<br />
U<br />
computable f.<br />
= 2−K(θ) also satisfies RIP w.r.t. simple<br />
(with<strong>in</strong> a multiplicative constant)
<strong>Marcus</strong> <strong>Hutter</strong> - 57 - Universal Induction & Intelligence<br />
Regroup<strong>in</strong>g Invariance<br />
• Non-bijective transformations:<br />
E.g. group<strong>in</strong>g ball colors <strong>in</strong>to categories black/non-black.<br />
• No classical pr<strong>in</strong>ciple is regroup<strong>in</strong>g <strong>in</strong>variant.<br />
• Regroup<strong>in</strong>g <strong>in</strong>variance is regarded as a very important and desirable<br />
property. [Walley’s (1996) solution: sets <strong>of</strong> priors]<br />
• The universal prior wθ<br />
U = 2−K(θ) is <strong>in</strong>variant under regroup<strong>in</strong>g, and<br />
more generally under all simple [computable with complexity O(1)]<br />
even non-bijective transformations. (with<strong>in</strong> a multiplicative constant)<br />
• Note: Reparametrization and regroup<strong>in</strong>g <strong>in</strong>variance hold for<br />
arbitrary classes and are not limited to the i.i.d. case.
<strong>Marcus</strong> <strong>Hutter</strong> - 58 - Universal Induction & Intelligence<br />
Universal Choice <strong>of</strong> Class M<br />
• The larger M the less restrictive is the assumption µ ∈ M.<br />
• The class M U <strong>of</strong> all (semi)computable (semi)measures, although<br />
only countable, is pretty large, s<strong>in</strong>ce it <strong>in</strong>cludes all valid physics<br />
theories. Further, ξ U is semi-computable [ZL70].<br />
• Solomon<strong>of</strong>f’s universal prior M(x) := probability that the output <strong>of</strong><br />
a universal TM U with random <strong>in</strong>put starts with x.<br />
• Formally: M(x) := ∑ p : U(p)=x∗ 2−l(p) where the sum is over all<br />
(m<strong>in</strong>imal) programs p for which U outputs a str<strong>in</strong>g start<strong>in</strong>g with x.<br />
• M may be regarded as a 2 −l(p) -weighted mixture over all<br />
determ<strong>in</strong>istic environments ν p . (ν p (x) = 1 if U(p) = x∗ and 0 else)<br />
• M(x) co<strong>in</strong>cides with ξ U (x) with<strong>in</strong> an irrelevant multiplicative constant.
<strong>Marcus</strong> <strong>Hutter</strong> - 59 - Universal Induction & Intelligence<br />
The Problem <strong>of</strong> Old Evidence / New Theories<br />
• What if some evidence E ̂=x (e.g. Mercury’s perihelion advance) is<br />
known well-before the correct hypothesis/theory/model H ̂=µ<br />
(E<strong>in</strong>ste<strong>in</strong>’s general relativity theory) is found<br />
• How shall H be added to the Bayesian mach<strong>in</strong>ery a posteriori<br />
• What should the “prior” <strong>of</strong> H be<br />
• Should it be the belief <strong>in</strong> H <strong>in</strong> a hypothetical counterfactual world<br />
<strong>in</strong> which E is not known<br />
• Can old evidence E confirm H<br />
• After all, H could simply be constructed/biased/fitted towards<br />
“expla<strong>in</strong><strong>in</strong>g” E.
<strong>Marcus</strong> <strong>Hutter</strong> - 60 - Universal Induction & Intelligence<br />
Solution <strong>of</strong> the Old-Evidence Problem<br />
• The universal class M U and universal prior wν<br />
U<br />
problem.<br />
formally solves this<br />
• The universal prior <strong>of</strong> H is 2 −K(H) <strong>in</strong>dependent <strong>of</strong> M and <strong>of</strong><br />
whether E is known or not.<br />
• Updat<strong>in</strong>g M is unproblematic, and even not necessary when<br />
start<strong>in</strong>g with M U , s<strong>in</strong>ce it <strong>in</strong>cludes all hypothesis (<strong>in</strong>clud<strong>in</strong>g yet<br />
unknown or unnamed ones) a priori.
<strong>Marcus</strong> <strong>Hutter</strong> - 61 - Universal Induction & Intelligence<br />
Universal is Better than Cont<strong>in</strong>uous M<br />
• Although ν θ () and w θ are <strong>in</strong>comp. for cont. classes M for most θ,<br />
ξ() is typically computable. (exactly as for Laplace or numerically)<br />
⇒ D n (µ||M) + < D n (µ||ξ)+K(ξ) ln 2 for all µ<br />
• That is, M is superior to all computable mixture predictors ξ based<br />
on any (cont<strong>in</strong>uous or discrete) model class M and weight w(θ),<br />
save an additive constant K(ξ) ln 2 = O(1), even if environment µ<br />
is not computable.<br />
• While D n (µ||ξ) ∼ d 2<br />
ln n for all µ ∈ M,<br />
D n (µ||M) ≤ K(µ) ln 2 is even f<strong>in</strong>ite for computable µ.<br />
Fazit:<br />
Solomon<strong>of</strong>f prediction works also <strong>in</strong> non-computable environments
<strong>Marcus</strong> <strong>Hutter</strong> - 62 - Universal Induction & Intelligence<br />
Convergence and Loss Bounds<br />
• Total (loss) bounds: ∑ ∞<br />
n=1 E[h n] × < K(µ) ln 2, where<br />
h t (ω
<strong>Marcus</strong> <strong>Hutter</strong> - 63 - Universal Induction & Intelligence<br />
More Stuff / Critique / Problems<br />
• Prior knowledge y can be <strong>in</strong>corporated by us<strong>in</strong>g “subjective” prior<br />
w U ν|y = 2−K(ν|y) or by prefix<strong>in</strong>g observation x by y.<br />
• Additive/multiplicative constant fudges and U-dependence is <strong>of</strong>ten<br />
(but not always) harmless.<br />
• Incomputability: K and M can serve as “gold standards” which<br />
practitioners should aim at, but have to be (crudely) approximated<br />
<strong>in</strong> practice (MDL [Ris89], MML [Wal05], LZW [LZ76], CTW [WSTT95],<br />
NCD [CV05]).<br />
• The M<strong>in</strong>imum Description Length Pr<strong>in</strong>ciple:<br />
M(x) ≈ 2 −K U (x) ≈ 2 −K T (x) . Predict y <strong>of</strong> highest M(y|x) is<br />
approximately same as MDL: Predict y <strong>of</strong> smallest K T (xy).
<strong>Marcus</strong> <strong>Hutter</strong> - 64 - Universal Induction & Intelligence<br />
Universal Inductive Inference: Summary<br />
Universal Solomon<strong>of</strong>f prediction solves/avoids/meliorates many problems<br />
<strong>of</strong> (Bayesian) <strong>in</strong>duction. We discussed:<br />
+ general total bounds for generic class, prior, and loss,<br />
+ i.i.d./universal-specific <strong>in</strong>stantaneous and future bounds,<br />
+ the D n bound for cont<strong>in</strong>uous classes,<br />
+ <strong>in</strong>difference/symmetry pr<strong>in</strong>ciples,<br />
+ the problem <strong>of</strong> zero p(oste)rior & confirm. <strong>of</strong> universal hypotheses,<br />
+ reparametrization and regroup<strong>in</strong>g <strong>in</strong>variance,<br />
+ the problem <strong>of</strong> old evidence and updat<strong>in</strong>g,<br />
+ that M works even <strong>in</strong> non-computable environments,<br />
+ how to <strong>in</strong>corporate prior knowledge,<br />
− the prediction <strong>of</strong> short sequences,<br />
− the constant fudges <strong>in</strong> all results and the U-dependence,<br />
− M’s <strong>in</strong>computability and crude practical approximations.
<strong>Marcus</strong> <strong>Hutter</strong> - 65 - Universal Induction & Intelligence<br />
Outlook<br />
• Relation to Prediction with Expert Advice<br />
• Relation to the M<strong>in</strong>imal Description Length (MDL) Pr<strong>in</strong>ciple<br />
• Generalization to Active/Re<strong>in</strong>forcement learn<strong>in</strong>g (AIXI)<br />
Open Problems<br />
• Prediction <strong>of</strong> selected bits<br />
• Convergence <strong>of</strong> M on Mart<strong>in</strong>-Loef random sequences<br />
• Better <strong>in</strong>stantaneous bounds for M<br />
• Future bounds for Bayes (general ξ)
<strong>Marcus</strong> <strong>Hutter</strong> - 66 - Universal Induction & Intelligence<br />
UNIVERSAL RATIONAL AGENTS<br />
• Rational agents<br />
• Sequential decision theory<br />
• Re<strong>in</strong>forcement learn<strong>in</strong>g<br />
• Value function<br />
• Universal Bayes mixture and AIXI model<br />
• Self-optimiz<strong>in</strong>g and Pareto-optimal policies<br />
• Environmental Classes<br />
• The horizon problem<br />
• Computational Issues
<strong>Marcus</strong> <strong>Hutter</strong> - 67 - Universal Induction & Intelligence<br />
Universal Rational Agents: Abstract<br />
Sequential decision theory formally solves the problem <strong>of</strong> rational agents<br />
<strong>in</strong> uncerta<strong>in</strong> worlds if the true environmental prior probability distribution<br />
is known. Solomon<strong>of</strong>f’s theory <strong>of</strong> universal <strong>in</strong>duction formally solves the<br />
problem <strong>of</strong> sequence prediction for unknown prior distribution.<br />
Here we comb<strong>in</strong>e both ideas and develop an elegant parameter-free<br />
theory <strong>of</strong> an optimal re<strong>in</strong>forcement learn<strong>in</strong>g agent embedded <strong>in</strong> an<br />
arbitrary unknown environment that possesses essentially all aspects <strong>of</strong><br />
rational <strong>in</strong>telligence. The theory reduces all conceptual AI problems to<br />
pure computational ones.<br />
There are strong arguments that the result<strong>in</strong>g AIXI model is the most<br />
<strong>in</strong>telligent unbiased agent possible. Other discussed topics are relations<br />
between problem classes, the horizon problem, and computational issues.
<strong>Marcus</strong> <strong>Hutter</strong> - 68 - Universal Induction & Intelligence<br />
The Agent Model<br />
Most if not all AI problems can be<br />
formulated with<strong>in</strong> the agent<br />
framework<br />
r 1 | o 1 r 2 | o 2 r 3 | o 3 r 4 | o 4 r 5 | o 5 r 6 | o 6<br />
...<br />
work<br />
Agent<br />
p<br />
✟<br />
✟<br />
✟<br />
✟<br />
✟✙<br />
tape ...<br />
❍❨ ❍<br />
❍<br />
❍<br />
❍<br />
work<br />
Environment<br />
q<br />
✏ ✏✏✏✏✏✏✶<br />
tape ...<br />
y 1 y 2 y 3 y 4 y 5 y 6 ...
<strong>Marcus</strong> <strong>Hutter</strong> - 69 - Universal Induction & Intelligence<br />
Rational Agents <strong>in</strong> Determ<strong>in</strong>istic Environments<br />
- p:X ∗ →Y ∗ is determ<strong>in</strong>istic policy <strong>of</strong> the agent,<br />
p(x
<strong>Marcus</strong> <strong>Hutter</strong> - 70 - Universal Induction & Intelligence<br />
Agents <strong>in</strong> Probabilistic Environments<br />
Given history y 1:k x
<strong>Marcus</strong> <strong>Hutter</strong> - 71 - Universal Induction & Intelligence<br />
Optimal Policy and Value<br />
The σ-optimal policy p σ := arg max p V p σ<br />
maximizes V p σ ≤ V ∗ σ := V pσ<br />
σ .<br />
Explicit expressions for the action y k <strong>in</strong> cycle k <strong>of</strong> the σ-optimal policy<br />
p σ and their value Vσ<br />
∗ are<br />
∑ ∑ ∑<br />
y k = arg max max ... max (r k + ... +r m )·σ(x k:m |y 1:m x
<strong>Marcus</strong> <strong>Hutter</strong> - 72 - Universal Induction & Intelligence<br />
Expectimax Tree/Algorithm<br />
V ∗ σ<br />
❅ (yx
<strong>Marcus</strong> <strong>Hutter</strong> - 73 - Universal Induction & Intelligence<br />
Known environment µ<br />
• Assumption: µ is the true environment <strong>in</strong> which the agent operates<br />
• Then, policy p µ is optimal <strong>in</strong> the sense that no other policy for an<br />
agent leads to higher µ AI -expected reward.<br />
• Special choices <strong>of</strong> µ: determ<strong>in</strong>istic or adversarial environments,<br />
Markov decision processes (mdps), adversarial environments.<br />
• There is no pr<strong>in</strong>ciple problem <strong>in</strong> comput<strong>in</strong>g the optimal action y k as<br />
long as µ AI is known and computable and X , Y and m are f<strong>in</strong>ite.<br />
• Th<strong>in</strong>gs drastically change if µ AI is unknown ...
<strong>Marcus</strong> <strong>Hutter</strong> - 74 - Universal Induction & Intelligence<br />
The Bayes-mixture distribution ξ<br />
Assumption: The true environment µ is unknown.<br />
Bayesian approach: The true probability distribution µ AI is not learned<br />
directly, but is replaced by a Bayes-mixture ξ AI .<br />
Assumption: We know that the true environment µ is conta<strong>in</strong>ed <strong>in</strong> some<br />
known (f<strong>in</strong>ite or countable) set M <strong>of</strong> environments.<br />
The Bayes-mixture ξ is def<strong>in</strong>ed as<br />
ξ(x 1:m |y 1:m ) := ∑<br />
w ν ν(x 1:m |y 1:m )<br />
ν∈M<br />
with<br />
∑<br />
ν∈M<br />
w ν = 1,<br />
w ν > 0 ∀ν<br />
The weights w ν may be <strong>in</strong>terpreted as the prior degree <strong>of</strong> belief that the<br />
true environment is ν.<br />
Then ξ(x 1:m |y 1:m ) could be <strong>in</strong>terpreted as the prior subjective belief<br />
probability <strong>in</strong> observ<strong>in</strong>g x 1:m , given actions y 1:m .
<strong>Marcus</strong> <strong>Hutter</strong> - 75 - Universal Induction & Intelligence<br />
Questions <strong>of</strong> Interest<br />
• It is natural to follow the policy p ξ which maximizes V p<br />
ξ .<br />
• If µ is the true environment the expected reward when follow<strong>in</strong>g<br />
policy p ξ will be V pξ<br />
µ .<br />
• The optimal (but <strong>in</strong>feasible) policy p µ yields reward V pµ<br />
µ ≡ V ∗ µ .<br />
• Are there policies with uniformly larger value than V pξ<br />
µ <br />
• How close is V pξ<br />
µ to V ∗ µ <br />
• What is the most general class M and weights w ν .
<strong>Marcus</strong> <strong>Hutter</strong> - 76 - Universal Induction & Intelligence<br />
A universal choice <strong>of</strong> ξ and M<br />
• We have to assume the existence <strong>of</strong> some structure on the<br />
environment to avoid the No-Free-Lunch Theorems [Wolpert 96].<br />
• We can only unravel effective structures which are describable by<br />
(semi)computable probability distributions.<br />
• So we may <strong>in</strong>clude all (semi)computable (semi)distributions <strong>in</strong> M.<br />
• Occam’s razor and Epicurus’ pr<strong>in</strong>ciple <strong>of</strong> multiple explanations tell<br />
us to assign high prior belief to simple environments.<br />
• Us<strong>in</strong>g Kolmogorov’s universal complexity measure K(ν) for<br />
environments ν one should set w ν ∼ 2 −K(ν) , where K(ν) is the<br />
length <strong>of</strong> the shortest program on a universal TM comput<strong>in</strong>g ν.<br />
• The result<strong>in</strong>g AIXI model [<strong>Hutter</strong>:00] is a unification <strong>of</strong> (Bellman’s)<br />
sequential decision and Solomon<strong>of</strong>f’s universal <strong>in</strong>duction theory.
<strong>Marcus</strong> <strong>Hutter</strong> - 77 - Universal Induction & Intelligence<br />
The AIXI Model <strong>in</strong> one L<strong>in</strong>e<br />
complete & essentially unique & limit-computable<br />
∑ ∑<br />
AIXI: a k := arg max ... max [r k + ... + r m ] ∑<br />
a k<br />
a m<br />
o k r k o m r m<br />
2 −length(p)<br />
p : U(p,a 1 ..a m )=o 1 r 1 ..o m r m<br />
k=now, action, observation, reward, Universal TM, program, m=lifespan<br />
AIXI is an elegant mathematical theory <strong>of</strong> AI<br />
Claim: AIXI is the most <strong>in</strong>telligent environmental <strong>in</strong>dependent, i.e.<br />
universally optimal, agent possible.<br />
Pro<strong>of</strong>: For formalizations, quantifications, and pro<strong>of</strong>s, see [Hut05].<br />
Applications: Strategic Games, Function Optimization, Supervised<br />
Learn<strong>in</strong>g, Sequence Prediction, Classification, ...<br />
In the follow<strong>in</strong>g we consider generic M and w ν .
<strong>Marcus</strong> <strong>Hutter</strong> - 78 - Universal Induction & Intelligence<br />
Pareto-Optimality <strong>of</strong> p ξ<br />
Policy p ξ is Pareto-optimal <strong>in</strong> the sense that there is no other policy p<br />
with V p<br />
ν<br />
≥ V pξ<br />
ν for all ν ∈ M and strict <strong>in</strong>equality for at least one ν.<br />
Self-optimiz<strong>in</strong>g Policies<br />
Under which circumstances does the value <strong>of</strong> the universal policy p ξ<br />
converge to optimum<br />
V pξ<br />
ν<br />
→ V ∗<br />
ν for horizon m → ∞ for all ν ∈ M. (1)<br />
The least we must demand from M to have a chance that (1) is true is<br />
that there exists some policy ˜p at all with this property, i.e.<br />
∃˜p : V ˜p<br />
ν<br />
→ V ∗<br />
ν for horizon m → ∞ for all ν ∈ M. (2)<br />
Ma<strong>in</strong> result: (2) ⇒ (1): The necessary condition <strong>of</strong> the existence <strong>of</strong> a<br />
self-optimiz<strong>in</strong>g policy ˜p is also sufficient for p ξ to be self-optimiz<strong>in</strong>g.
<strong>Marcus</strong> <strong>Hutter</strong> - 79 - Universal Induction & Intelligence<br />
Environments w. (Non)Self-Optimiz<strong>in</strong>g Policies
<strong>Marcus</strong> <strong>Hutter</strong> - 80 - Universal Induction & Intelligence<br />
Particularly Interest<strong>in</strong>g Environments<br />
• Sequence Prediction, e.g. weather<br />
√<br />
or stock-market prediction.<br />
Strong result: Vµ ∗ − Vµ pξ = O( ), m =horizon.<br />
K(µ)<br />
m<br />
• Strategic Games: Learn to play well (m<strong>in</strong>imax) strategic zero-sum<br />
games (like chess) or even exploit limited capabilities <strong>of</strong> opponent.<br />
• Optimization: F<strong>in</strong>d (approximate) m<strong>in</strong>imum <strong>of</strong> function with as few<br />
function calls as possible. Difficult exploration versus exploitation<br />
problem.<br />
• Supervised learn<strong>in</strong>g: Learn functions by present<strong>in</strong>g (z, f(z)) pairs<br />
and ask for function values <strong>of</strong> z ′ by present<strong>in</strong>g (z ′ , ) pairs.<br />
Supervised learn<strong>in</strong>g is much faster than re<strong>in</strong>forcement learn<strong>in</strong>g.<br />
AIξ quickly learns to predict, play games, optimize, and learn supervised.
<strong>Marcus</strong> <strong>Hutter</strong> - 81 - Universal Induction & Intelligence<br />
Future Value and the Right Discount<strong>in</strong>g<br />
• Elim<strong>in</strong>ate the arbitrary horizon parameter m by discount<strong>in</strong>g the<br />
rewards r k ❀ γ k r k with Γ k := ∑ ∞<br />
i=k γ i < ∞ and lett<strong>in</strong>g m → ∞:<br />
Vkγ πσ :=<br />
1 ∑<br />
lim (γ k r k +...+γ m r m )σ(x k:m |y 1:m x
<strong>Marcus</strong> <strong>Hutter</strong> - 82 - Universal Induction & Intelligence<br />
Universal Rational Agents: Summary<br />
• Setup: Agents act<strong>in</strong>g <strong>in</strong> general probabilistic environments with<br />
re<strong>in</strong>forcement feedback.<br />
• Assumptions: True environment µ belongs to a known class <strong>of</strong><br />
environments M, but is otherwise unknown.<br />
• Results: The Bayes-optimal policy p ξ based on the Bayes-mixture<br />
ξ = ∑ ν∈M w νν is Pareto-optimal and self-optimiz<strong>in</strong>g if M admits<br />
self-optimiz<strong>in</strong>g policies.<br />
• Application: The class <strong>of</strong> ergodic mdps admits self-optimiz<strong>in</strong>g<br />
policies.<br />
• New: Policy p ξ with unbounded effective horizon is the first purely<br />
Bayesian self-optimiz<strong>in</strong>g consistent policy for ergodic mdps.<br />
• Learn: The comb<strong>in</strong>ed conditions Γ k < ∞ and γ k+1<br />
γ k<br />
→ 1 allow a<br />
consistent self-optimiz<strong>in</strong>g Bayes-optimal policy based on mixtures.
<strong>Marcus</strong> <strong>Hutter</strong> - 83 - Universal Induction & Intelligence<br />
Universal Rational Agents: Remarks<br />
• We have developed a parameterless AI model based on sequential<br />
decisions and algorithmic probability.<br />
• We have reduced the AI problem to pure computational questions.<br />
• AIξ seems not to lack any important known methodology <strong>of</strong> AI,<br />
apart from computational aspects.<br />
• There is no need for implement<strong>in</strong>g extra knowledge, as this can be<br />
learned by present<strong>in</strong>g it <strong>in</strong> o k <strong>in</strong> any form.<br />
• The learn<strong>in</strong>g process itself is an important aspect <strong>of</strong> AI.<br />
• Noise or irrelevant <strong>in</strong>formation <strong>in</strong> the <strong>in</strong>puts do not disturb the AIξ<br />
system.<br />
• Philosophical questions: relevance <strong>of</strong> non-computational physics<br />
(Penrose), number <strong>of</strong> wisdom Ω (Chait<strong>in</strong>), consciousness, social<br />
consequences.
<strong>Marcus</strong> <strong>Hutter</strong> - 84 - Universal Induction & Intelligence<br />
Universal Rational Agents: Outlook<br />
• Cont<strong>in</strong>uous classes M.<br />
• Restricted policy classes.<br />
• Non-asymptotic bounds.<br />
• Tighter bounds by exploit<strong>in</strong>g extra properties <strong>of</strong> the environments,<br />
like the mix<strong>in</strong>g rate <strong>of</strong> mdps.<br />
• Search for other performance criteria [<strong>Hutter</strong>:00].<br />
• Instead <strong>of</strong> convergence <strong>of</strong> the expected reward sum, study<br />
convergence with high probability <strong>of</strong> the actually realized reward<br />
sum.<br />
• Other environmental classes (separability concepts, downscal<strong>in</strong>g).
<strong>Marcus</strong> <strong>Hutter</strong> - 85 - Universal Induction & Intelligence<br />
APPROXIMATIONS & APPLICATIONS<br />
• Universal Similarity Metric<br />
• Universal Search<br />
• Time-Bounded AIXI Model<br />
• Brute-Force Approximation <strong>of</strong> AIXI<br />
• A Monte-Carlo AIXI Approximation<br />
• Feature Re<strong>in</strong>forcement Learn<strong>in</strong>g<br />
• Comparison to other approaches<br />
• Future directions, wrap-up, references.
<strong>Marcus</strong> <strong>Hutter</strong> - 86 - Universal Induction & Intelligence<br />
Approximations & Applications: Abstract<br />
Many fundamental theories have to be approximated for practical use.<br />
S<strong>in</strong>ce the core quantities <strong>of</strong> universal <strong>in</strong>duction and universal <strong>in</strong>telligence<br />
are <strong>in</strong>computable, it is <strong>of</strong>ten hard, but not impossible, to approximate<br />
them. In any case, hav<strong>in</strong>g these “gold standards” to approximate<br />
(top→down) or to aim at (bottom→up) is extremely helpful <strong>in</strong> build<strong>in</strong>g<br />
truly <strong>in</strong>telligent systems. The most impressive direct approximation <strong>of</strong><br />
Kolmogorov complexity to-date is via the universal similarity metric<br />
applied to a variety <strong>of</strong> real-world cluster<strong>in</strong>g problems. A couple <strong>of</strong><br />
universal search algorithms ((adaptive) Lev<strong>in</strong> search, FastPrg, OOPS,<br />
Goedel-mach<strong>in</strong>e, ...) that f<strong>in</strong>d short programs have been developed and<br />
applied to a variety <strong>of</strong> toy problem. The AIXI model itself has been<br />
approximated <strong>in</strong> a couple <strong>of</strong> ways (AIXItl, Brute Force, Monte Carlo,<br />
Feature RL). Some recent applications will be presented. The Lecture<br />
Series concludes by compar<strong>in</strong>g various learn<strong>in</strong>g algorithms along various<br />
dimensions, po<strong>in</strong>t<strong>in</strong>g to future directions, wrap-up, and references.
<strong>Marcus</strong> <strong>Hutter</strong> - 87 - Universal Induction & Intelligence<br />
Conditional Kolmogorov Complexity<br />
Question: When is object=str<strong>in</strong>g x similar to object=str<strong>in</strong>g y<br />
Universal solution: x similar y ⇔ x can be easily (re)constructed from y<br />
⇔ Kolmogorov complexity K(x|y) := m<strong>in</strong>{l(p) : U(p, y) = x} is small<br />
Examples:<br />
1) x is very similar to itself (K(x|x) + = 0)<br />
2) A processed x is similar to x (K(f(x)|x) + = 0 if K(f) = O(1)).<br />
e.g. doubl<strong>in</strong>g, revert<strong>in</strong>g, <strong>in</strong>vert<strong>in</strong>g, encrypt<strong>in</strong>g, partially delet<strong>in</strong>g x.<br />
3) A random str<strong>in</strong>g is with high probability not similar to any other<br />
str<strong>in</strong>g (K(random|y) =length(random)).<br />
The problem with K(x|y) as similarity=distance measure is that it is<br />
neither symmetric nor normalized nor computable.
<strong>Marcus</strong> <strong>Hutter</strong> - 88 - Universal Induction & Intelligence<br />
The Universal Similarity Metric [CV’05]<br />
• Symmetrization and normalization leads to a/the universal metric d:<br />
0 ≤ d(x, y) :=<br />
max{K(x|y), K(y|x)}<br />
max{K(x), K(y)}<br />
≤ 1<br />
• Every effective similarity between x and y is detected by d<br />
• Use K(x|y)≈K(xy)−K(y) and K(x)≡K U (x)≈K T (x) (cod<strong>in</strong>g)<br />
T<br />
=⇒ computable approximation: Normalized compression distance:<br />
d(x, y) ≈ K T (xy) − m<strong>in</strong>{K T (x), K T (y)}<br />
max{K T (x), K T (y)}<br />
1<br />
• For T choose Lempel-Ziv or gzip or bzip(2) (de)compressor <strong>in</strong> the<br />
applications below.<br />
• Theory: Lempel-Ziv compresses asymptotically better than any<br />
probabilistic f<strong>in</strong>ite state automaton predictor/compressor.
<strong>Marcus</strong> <strong>Hutter</strong> - 89 - Universal Induction & Intelligence<br />
Tree-Based Cluster<strong>in</strong>g [CV’05]<br />
• If many objects x 1 , ..., x n need to be compared, determ<strong>in</strong>e the<br />
Similarity matrix: M ij = d(x i , x j ) for 1 ≤ i, j ≤ n<br />
• Now cluster similar objects.<br />
• There are various cluster<strong>in</strong>g techniques.<br />
• Tree-based cluster<strong>in</strong>g: Create a tree connect<strong>in</strong>g similar objects,<br />
• e.g. quartet method (for cluster<strong>in</strong>g)<br />
• Applications: Phylogeny <strong>of</strong> 24 Mammal mtDNA,<br />
50 Language Tree (based on declaration <strong>of</strong> human rights),<br />
composers <strong>of</strong> music, authors <strong>of</strong> novels, SARS virus, fungi,<br />
optical characters, galaxies, ...<br />
[Cilibrasi&Vitanyi’05]
<strong>Marcus</strong> <strong>Hutter</strong> - 90 - Universal Induction & Intelligence<br />
Genomics & Phylogeny: Mammals [CV’05]<br />
Evolutionary tree built from complete mammalian mtDNA <strong>of</strong> 24 species:<br />
Carp<br />
Cow<br />
BlueWhale<br />
F<strong>in</strong>backWhale<br />
Cat<br />
BrownBear<br />
PolarBear<br />
GreySeal<br />
HarborSeal<br />
Horse<br />
WhiteRh<strong>in</strong>o<br />
Gibbon<br />
Gorilla<br />
Human<br />
Chimpanzee<br />
PygmyChimp<br />
Orangutan<br />
SumatranOrangutan<br />
HouseMouse<br />
Rat<br />
Opossum<br />
Wallaroo<br />
Echidna<br />
Platypus<br />
Ferungulates<br />
Eutheria<br />
Primates<br />
Eutheria - Rodents<br />
Metatheria<br />
Prototheria
Language Tree (Re)construction<br />
based on “The Universal Declaration <strong>of</strong><br />
Human Rights” <strong>in</strong> 50 languages.<br />
ROMANCE<br />
GERMANIC<br />
SLAVIC<br />
BALTIC<br />
ALTAIC<br />
CELTIC<br />
UGROFINNIC<br />
Basque [Spa<strong>in</strong>]<br />
Hungarian [Hungary]<br />
Polish [Poland]<br />
Sorbian [Germany]<br />
Slovak [Slovakia]<br />
Czech [Czech Rep]<br />
Slovenian [Slovenia]<br />
Serbian [Serbia]<br />
Bosnian [Bosnia]<br />
Croatian [Croatia]<br />
Romani Balkan [East Europe]<br />
Albanian [Albany]<br />
Lithuanian [Lithuania]<br />
Latvian [Latvia]<br />
Turkish [Turkey]<br />
Uzbek [Utzbekistan]<br />
Breton [France]<br />
Maltese [Malta]<br />
OccitanAuvergnat [France]<br />
Walloon [Belgique]<br />
English [UK]<br />
French [France]<br />
Asturian [Spa<strong>in</strong>]<br />
Portuguese [Portugal]<br />
Spanish [Spa<strong>in</strong>]<br />
Galician [Spa<strong>in</strong>]<br />
Catalan [Spa<strong>in</strong>]<br />
Occitan [France]<br />
Rhaeto Romance [Switzerland]<br />
Friulian [Italy]<br />
Italian [Italy]<br />
Sammar<strong>in</strong>ese [Italy]<br />
Corsican [France]<br />
Sard<strong>in</strong>ian [Italy]<br />
Romanian [Romania]<br />
Romani Vlach [Macedonia]<br />
Welsh [UK]<br />
Scottish Gaelic [UK]<br />
Irish Gaelic [UK]<br />
German [Germany]<br />
Luxembourgish [Luxembourg]<br />
Frisian [Netherlands]<br />
Dutch [Netherlands]<br />
Afrikaans<br />
Swedish [Sweden]<br />
Norwegian Nynorsk [Norway]<br />
Danish [Denmark]<br />
Norwegian Bokmal [Norway]<br />
Faroese [Denmark]<br />
Icelandic [Iceland]<br />
F<strong>in</strong>nish [F<strong>in</strong>land]<br />
Estonian [Estonia]
<strong>Marcus</strong> <strong>Hutter</strong> - 92 - Universal Induction & Intelligence<br />
Universal Search<br />
• Lev<strong>in</strong> search: Fastest algorithm for<br />
<strong>in</strong>version and optimization problems.<br />
• Theoretical application:<br />
Assume somebody found a non-constructive<br />
pro<strong>of</strong> <strong>of</strong> P=NP, then Lev<strong>in</strong>-search is a polynomial<br />
time algorithm for every NP (complete) problem.<br />
• Practical (OOPS) applications (J. Schmidhuber)<br />
Maze, towers <strong>of</strong> hanoi, robotics, ...<br />
• FastPrg: The asymptotically fastest and shortest algorithm for all<br />
well-def<strong>in</strong>ed problems.<br />
• AIXItl: Computable variant <strong>of</strong> AIXI.<br />
• Human Knowledge Compression Prize: (50’000 C=)
<strong>Marcus</strong> <strong>Hutter</strong> - 93 - Universal Induction & Intelligence<br />
The Time-Bounded AIXI Model (AIXItl)<br />
An algorithm p best has been constructed for which the follow<strong>in</strong>g holds:<br />
• Let p be any (extended chronological) policy<br />
• with length l(p)≤˜l and computation time per cycle t(p)≤˜t<br />
• for which there exists a pro<strong>of</strong> <strong>of</strong> length ≤l P that p is a valid<br />
approximation.<br />
• Then an algorithm p best can be constructed, depend<strong>in</strong>g on ˜l,˜t and<br />
l P but not on know<strong>in</strong>g p<br />
• which is effectively more or equally <strong>in</strong>telligent accord<strong>in</strong>g to ≽ c than<br />
any such p.<br />
• The size <strong>of</strong> p best is l(p best )=O(ln(˜l·˜t·l P )),<br />
• the setup-time is t setup (p best )=O(l 2 P ·2l P<br />
),<br />
• the computation time per cycle is t cycle (p best )=O(2˜l·˜t).
<strong>Marcus</strong> <strong>Hutter</strong> - 94 - Universal Induction & Intelligence<br />
Brute-Force Approximation <strong>of</strong> AIXI<br />
• Truncate expectimax tree depth to a small fixed lookahead h.<br />
Optimal action computable <strong>in</strong> time |Y ×X | h × time to evaluate ξ.<br />
• Consider mixture over Markov Decision Processes (MDP) only, i.e.<br />
ξ(x 1:m |y 1:m ) = ∑ ν∈M w ν<br />
∏ m<br />
t=1 ν(x t|x t−1 y t ). Note: ξ is not MDP<br />
• Choose uniform prior over w µ .<br />
Then ξ(x 1:m |y 1:m ) can be computed <strong>in</strong> l<strong>in</strong>ear time.<br />
• Consider (approximately) Markov problems<br />
with very small action and perception space.<br />
• Example application: 2×2 Matrix Games like Prisoner’S Dilemma,<br />
Stag Hunt, Chicken, Battle <strong>of</strong> Sexes, and Match<strong>in</strong>g Pennies. [PH’06]
<strong>Marcus</strong> <strong>Hutter</strong> - 95 - Universal Induction & Intelligence<br />
AIXI Learns to Play 2×2 Matrix Games<br />
• Repeated prisoners dilemma.<br />
Loss matrix<br />
• Game unknown to AIXI.<br />
Must be learned as well<br />
• AIXI behaves appropriately.<br />
avg. p. round cooperation ratio<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
AIXI vs. random<br />
AIXI vs. tit4tat<br />
AIXI vs. 2−tit4tat<br />
AIXI vs. 3−tit4tat<br />
AIXI vs. AIXI<br />
AIXI vs. AIXI2<br />
0<br />
0 20 40 60 80 100<br />
t
<strong>Marcus</strong> <strong>Hutter</strong> - 96 - Universal Induction & Intelligence<br />
A Monte-Carlo AIXI Approximation<br />
Consider class <strong>of</strong> Variable-Order Markov Decision Processes.<br />
The Context Tree Weight<strong>in</strong>g (CTW) algorithm can efficiently mix<br />
(exactly <strong>in</strong> essentially l<strong>in</strong>ear time) all prediction suffix trees.<br />
a1<br />
a2<br />
a3<br />
Monte-Carlo approximation <strong>of</strong> expectimax tree:<br />
o1 o2 o3 o4<br />
Upper Confidence Tree (UCT) algorithm:<br />
• Sample observations from CTW distribution.<br />
• Select actions with highest upper confidence bound.<br />
future reward estimate<br />
• Expand tree by one leaf node (per trajectory).<br />
• Simulate from leaf node further down us<strong>in</strong>g (fixed) playout policy.<br />
• Propagate back the value estimates for each node.<br />
Repeat until timeout.<br />
[VNHS’09]<br />
Guaranteed to converge to exact value.<br />
Extension: Predicate CTW not based on raw obs. but features there<strong>of</strong>.
<strong>Marcus</strong> <strong>Hutter</strong> - 97 - Universal Induction & Intelligence<br />
Monte-Carlo AIXI Applications<br />
1<br />
Normalized Learn<strong>in</strong>g Scalability<br />
Norm. Av. Reward per Trial<br />
0<br />
[Joel Veness et al. 2009]<br />
Experience<br />
Optimum<br />
Tiger<br />
4x4 Grid<br />
1d Maze<br />
Extended Tiger<br />
TicTacToe<br />
Cheese Maze<br />
Pocman*<br />
100 1000 10000 100000 1000000
<strong>Marcus</strong> <strong>Hutter</strong> - 98 - Universal Induction & Intelligence<br />
Feature Re<strong>in</strong>forcement Learn<strong>in</strong>g (FRL)<br />
Goal: Develop efficient general purpose <strong>in</strong>telligent agent.<br />
State-<strong>of</strong>-the-art: (a) AIXI: Incomputable theoretical solution.<br />
(b) MDP: Efficient limited problem class.<br />
(c) POMDP: Notoriously difficult. (d) PSRs: Underdeveloped.<br />
Idea: ΦMDP reduces real problem to MDP automatically by learn<strong>in</strong>g.<br />
Accomplishments so far: (i) Criterion for evaluat<strong>in</strong>g quality <strong>of</strong> reduction.<br />
(ii) Integration <strong>of</strong> the various parts <strong>in</strong>to one learn<strong>in</strong>g algorithm.<br />
(iii) Generalization to structured MDPs (DBNs)<br />
ΦMDP is promis<strong>in</strong>g path towards the grand goal & alternative to (a)-(d)<br />
Problem: F<strong>in</strong>d reduction Φ efficiently (generic optimization problem)
<strong>Marcus</strong> <strong>Hutter</strong> - 99 - Universal Induction & Intelligence<br />
Markov Decision Processes (MDPs)<br />
a computationally tractable class <strong>of</strong> problems<br />
• MDP Assumption: State s t := o t and r t are<br />
probabilistic functions <strong>of</strong> o t−1 and a t−1 only.<br />
• Further Assumption:<br />
State=observation space § is f<strong>in</strong>ite and small.<br />
Example MDP<br />
✞ ✎☞ ✎☞<br />
✝✲ ✍✌<br />
s 1 ✲<br />
✍✌<br />
s 3 r 2<br />
r 1 ✻ ✒<br />
✎☞ ✠ ✎☞ ❄<br />
✛<br />
r 4 ☎<br />
r 3<br />
✍✌<br />
s 4 ✛<br />
✍✌<br />
s 2 ✆<br />
• Goal: Maximize long-term expected reward.<br />
• Learn<strong>in</strong>g: Probability distribution is unknown but can be learned.<br />
• Exploration: Optimal exploration is <strong>in</strong>tractable<br />
but there are polynomial approximations.<br />
• Problem: Real problems are not <strong>of</strong> this simple form.
<strong>Marcus</strong> <strong>Hutter</strong> - 100 - Universal Induction & Intelligence<br />
Map Real Problem to MDP<br />
Map history h t := o 1 a 1 r 1 ...o t−1 to state s t := Φ(h t ), for example:<br />
Games: Full-<strong>in</strong>formation with static opponent: Φ(h t ) = o t .<br />
Classical physics: Position+velocity <strong>of</strong> objects = position at two<br />
time-slices: s t = Φ(h t ) = o t o t−1 is (2nd order) Markov.<br />
I.i.d. processes <strong>of</strong> unknown probability (e.g. cl<strong>in</strong>ical trials ≃ Bandits),<br />
Frequency <strong>of</strong> obs. Φ(h n ) = ( ∑ n<br />
t=1 δ o t o) o∈O is sufficient statistic.<br />
Identity: Φ(h) = h is always sufficient, but not learnable.<br />
F<strong>in</strong>d/Learn Map Automatically<br />
Φ best := arg m<strong>in</strong> Φ Cost(Φ|h t )<br />
• What is the best map/MDP (i.e. what is the right Cost criterion)<br />
• Is the best MDP good enough (i.e. is reduction always possible)<br />
• How to f<strong>in</strong>d the map Φ (i.e. m<strong>in</strong>imize Cost) efficiently
<strong>Marcus</strong> <strong>Hutter</strong> - 101 - Universal Induction & Intelligence<br />
ΦMDP: Computational Flow<br />
✓<br />
✏ ✓<br />
✏<br />
Transition Pr. ˆT exploration<br />
Reward est.<br />
✒<br />
ˆR<br />
✲<br />
bonus<br />
ˆT e , ˆR e<br />
✑ ✒<br />
✑<br />
✓ ✒ frequency estimate<br />
❅<br />
Bellman<br />
✏<br />
✓<br />
❅ ❅❘<br />
Feature Vec. ˆΦ<br />
( ˆQ) ˆValue<br />
✒<br />
✑<br />
✒<br />
✻ Cost(Φ|h) m<strong>in</strong>imization implicit<br />
✓<br />
✏<br />
✓ ❄<br />
History h<br />
Best Policy ˆp<br />
✒<br />
✑<br />
✒<br />
✻<br />
reward r observation o<br />
action a<br />
❄<br />
Environment<br />
✏<br />
✑<br />
✏<br />
✑
<strong>Marcus</strong> <strong>Hutter</strong> - 102 - Universal Induction & Intelligence<br />
Intelligent Agents <strong>in</strong> Perspective<br />
✗<br />
Universal AI<br />
(AIXI)<br />
✔<br />
✖<br />
✕<br />
❅<br />
✡ ✡ ❏<br />
✄ ✄ ❈<br />
❈ ❏<br />
❅<br />
❅<br />
✎ ✡ ✡ ❏<br />
✄ ✄ ❈<br />
❈ ❏ ☞<br />
Feature RL (ΦMDP/DBN/..)<br />
✍<br />
✌<br />
❅<br />
❅<br />
❅<br />
✎ ❅<br />
✡ ✡✡ ❏<br />
❏<br />
✎☞<br />
✄ ✄✄ ❈<br />
❈<br />
☞✎<br />
❈ ☞✎<br />
❏ ❅ ☞<br />
Information Learn<strong>in</strong>g Plann<strong>in</strong>g Complexity<br />
✍ ✌✍<br />
✌✍<br />
✌✍<br />
✌<br />
❅ ❅<br />
Search – Optimization – Computation – Logic – KR<br />
Agents = General Framework, Interface = Robots,Vision,Language
<strong>Marcus</strong> <strong>Hutter</strong> - 103 - Universal Induction & Intelligence<br />
Properties <strong>of</strong> Learn<strong>in</strong>g Algorithms<br />
Comparison <strong>of</strong> AIXI to Other Approaches<br />
Algorithm<br />
Properties<br />
time<br />
efficient<br />
global<br />
optimum<br />
data<br />
efficient<br />
exploration<br />
convergence<br />
generalization<br />
pomdp<br />
learn<strong>in</strong>g<br />
active<br />
Value/Policy iteration yes/no yes – YES YES NO NO NO yes<br />
TD w. func.approx. no/yes NO NO no/yes NO YES NO YES YES<br />
Direct Policy Search no/yes YES NO no/yes NO YES no YES YES<br />
Logic Planners yes/no YES yes YES YES no no YES yes<br />
RL with Split Trees yes YES no YES NO yes YES YES YES<br />
Pred.w. Expert Advice yes/no YES – YES yes/no yes NO YES NO<br />
OOPS yes/no no – yes yes/no YES YES YES YES<br />
Market/Economy RL yes/no no NO no no/yes yes yes/no YES YES<br />
SPXI no YES – YES YES YES NO YES NO<br />
AIXI NO YES YES yes YES YES YES YES YES<br />
AIXItl no/yes YES YES YES yes YES YES YES YES<br />
MC-AIXI-CTW yes/no yes YES YES yes NO yes/no YES YES<br />
Feature RL yes/no YES yes yes yes yes yes YES YES<br />
Human yes yes yes no/yes NO YES YES YES YES
<strong>Marcus</strong> <strong>Hutter</strong> - 104 - Universal Induction & Intelligence<br />
Mach<strong>in</strong>e Intelligence Tests & Def<strong>in</strong>itions<br />
⋆= yes, ·= no,<br />
•= debatable,<br />
= unknown.<br />
Intelligence Test<br />
Valid<br />
Informative<br />
Wide Range<br />
General<br />
Dynamic<br />
Unbiased<br />
Fundamental<br />
Formal<br />
Objective<br />
Fully Def<strong>in</strong>ed<br />
Universal<br />
Practical<br />
Test vs. Def.<br />
Tur<strong>in</strong>g Test • · · · • · · · · • · • T<br />
Total Tur<strong>in</strong>g Test • · · · • · · · · • · · T<br />
Inverted Tur<strong>in</strong>g Test • • · · • · · · · • · • T<br />
Toddler Tur<strong>in</strong>g Test • · · · • · · · · · · • T<br />
L<strong>in</strong>guistic Complexity • ⋆ • · · · · • • · • • T<br />
Text Compression Test • ⋆ ⋆ • · • • ⋆ ⋆ ⋆ • ⋆ T<br />
Tur<strong>in</strong>g Ratio • ⋆ ⋆ ⋆ · T/D<br />
Psychometric AI ⋆ ⋆ • ⋆ • · • • • · • T/D<br />
Smith’s Test • ⋆ ⋆ • · ⋆ ⋆ ⋆ · • T/D<br />
C-Test • ⋆ ⋆ • · ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ T/D<br />
AIXI ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ · D
<strong>Marcus</strong> <strong>Hutter</strong> - 105 - Universal Induction & Intelligence<br />
Common Criticisms<br />
• AIXI is obviously wrong.<br />
(<strong>in</strong>telligence cannot be captured <strong>in</strong> a few simple equations)<br />
• AIXI is obviously correct. (everybody already knows this)<br />
• Assum<strong>in</strong>g that the environment is computable is too strong.<br />
• All standard objections to strong AI also apply to AIXI.<br />
(free will, lookup table, Lucas/Penrose Goedel argument)<br />
• AIXI doesn’t deal with X or cannot do X.<br />
(X = consciousness, creativity, imag<strong>in</strong>ation, emotion, love, soul, etc.)<br />
• AIXI is not <strong>in</strong>telligent because it cannot choose its goals.<br />
• Universal AI is impossible due to the No-Free-Lunch theorem.<br />
See [Legg:08] for refutations <strong>of</strong> these and more criticisms.
<strong>Marcus</strong> <strong>Hutter</strong> - 106 - Universal Induction & Intelligence<br />
General Murky & Quirky AI Questions<br />
• Does current ma<strong>in</strong>stream AI research has anyth<strong>in</strong>g todo with AI<br />
• Are sequential decision and algorithmic probability theory<br />
all we need to well-def<strong>in</strong>e AI<br />
• What is (Universal) AI theory good for<br />
• What are robots good for <strong>in</strong> AI<br />
• Is <strong>in</strong>telligence a fundamentally simple concept<br />
(compare with fractals or physics theories)<br />
• What can we (not) expect from super-<strong>in</strong>telligent agents<br />
• Is maximiz<strong>in</strong>g the expected reward the right criterion<br />
• Isn’t universal learn<strong>in</strong>g impossible due to the NFL theorems
<strong>Marcus</strong> <strong>Hutter</strong> - 107 - Universal Induction & Intelligence<br />
Next Steps<br />
• Address the many open theoretical questions (see <strong>Hutter</strong>:05).<br />
• Bridge the gap between (Universal) AI theory and AI practice.<br />
• Explore what role logical reason<strong>in</strong>g, knowledge representation,<br />
vision, language, etc. play <strong>in</strong> Universal AI.<br />
• Determ<strong>in</strong>e the right discount<strong>in</strong>g <strong>of</strong> future rewards.<br />
• Develop the right nurtur<strong>in</strong>g environment for a learn<strong>in</strong>g agent.<br />
• Consider embodied agents (e.g. <strong>in</strong>ternal↔external reward)<br />
• Analyze AIXI <strong>in</strong> the multi-agent sett<strong>in</strong>g.
<strong>Marcus</strong> <strong>Hutter</strong> - 108 - Universal Induction & Intelligence<br />
The Big Questions<br />
• Is non-computational physics relevant to AI [Penrose]<br />
• Could someth<strong>in</strong>g like the number <strong>of</strong> wisdom Ω prevent a simple<br />
solution to AI [Chait<strong>in</strong>]<br />
• Do we need to understand consciousness before be<strong>in</strong>g able to<br />
understand AI or construct AI systems<br />
• What if we succeed
<strong>Marcus</strong> <strong>Hutter</strong> - 109 - Universal Induction & Intelligence<br />
Wrap Up<br />
• Setup: Given (non)iid data D = (x 1 , ..., x n ), predict x n+1<br />
• Ultimate goal is to maximize pr<strong>of</strong>it or m<strong>in</strong>imize loss<br />
• Consider Models/Hypothesis H i ∈ M<br />
• Max.Likelihood: H best = arg max i p(D|H i ) (overfits if M large)<br />
• Bayes: Posterior probability <strong>of</strong> H i is p(H i |D) ∝ p(D|H i )p(H i )<br />
• Bayes needs prior(H i )<br />
• Occam+Epicurus: High prior for simple models.<br />
• Kolmogorov/Solomon<strong>of</strong>f: Quantification <strong>of</strong> simplicity/complexity<br />
• Bayes works if D is sampled from H true ∈ M<br />
• Bellman equations tell how to optimally act <strong>in</strong> known environments<br />
• Universal AI = Universal Induction + Sequential Decision Theory<br />
• Practice = approximate, restrict, search, optimize, knowledge
<strong>Marcus</strong> <strong>Hutter</strong> - 110 - Universal Induction & Intelligence<br />
Literature<br />
[CV05]<br />
[Hut05]<br />
[Hut07]<br />
[LH07]<br />
[Hut09]<br />
R. Cilibrasi and P. M. B. Vitányi. Cluster<strong>in</strong>g by compression. IEEE<br />
Trans. Information Theory, 51(4):1523–1545, 2005.<br />
M. <strong>Hutter</strong>. Universal Artificial Intelligence: Sequential Decisions<br />
based on Algorithmic Probability. Spr<strong>in</strong>ger, Berl<strong>in</strong>, 2005.<br />
M. <strong>Hutter</strong>. On universal prediction and Bayesian confirmation.<br />
Theoretical Computer Science, 384(1):33–48, 2007.<br />
S. Legg and M. <strong>Hutter</strong>. Universal <strong>in</strong>telligence: a def<strong>in</strong>ition <strong>of</strong><br />
mach<strong>in</strong>e <strong>in</strong>telligence. M<strong>in</strong>ds & Mach<strong>in</strong>es, 17(4):391–444, 2007.<br />
M. <strong>Hutter</strong>. Feature re<strong>in</strong>forcement learn<strong>in</strong>g: Part I: Unstructured<br />
MDPs. Journal <strong>of</strong> Artificial General Intelligence, 1:3–24, 2009.<br />
[VNHS09] J. Veness, K. S. Ng, M. <strong>Hutter</strong>, and D. Silver. A Monte Carlo AIXI<br />
approximation. Technical report, NICTA, Australia, 2009.
<strong>Marcus</strong> <strong>Hutter</strong> - 111 - Universal Induction & Intelligence<br />
Thanks! Questions Details:<br />
Jobs: PostDoc and PhD positions at RSISE and NICTA, Australia<br />
Projects at http://www.hutter1.net/<br />
A Unified View <strong>of</strong> Artificial Intelligence<br />
= =<br />
Decision Theory = Probability + Utility Theory<br />
+ +<br />
Universal Induction = Ockham + Bayes + Tur<strong>in</strong>g<br />
Open research problems at www.hutter1.net/ai/uaibook.htm<br />
Compression contest with 50’000 C= prize at prize.hutter1.net