s1 s2 s3 s4 s5 - of Marcus Hutter

Sample ComplexityAn algorithm A is (ɛ, δ)-correct with sample complexity N if for allM ∈ M := {(S, A, p, r, γ) : p transition probabilities},{ ∑∞}Pt=1 [V M ∗ (s t) − VM A(s 1:t) > ɛ] > N < δ# time-steps where A is not ɛ-optimal“The probability that I am ’badly’ suboptimal for morethan N time-steps is at most δ!”

Algorithm Sketch1: loop2: Compute empiric estimate, ˆp, of transition matrix p3: Compute confidence interval about ˆp4: Act according to the most optimistic plausible MDP5: end loop

Analysis◮ Key component is bounding |V AˆM (s 1:t ) − V A M (s 1:t )| < ɛ◮ Require each state to be visited sufficiently often◮ States that are expected to be visited often needed betterestimatesDiscounted future state distributionV AˆM (s t )−VM A (s t ) = γ ∑ s√= ∑ s√≤Bernstein’s, σ 2 (s) := Var V (s ′ |s) and L := log 1/δErrorw(s) (p s − ˆp s ) · ̂VL|S|w(s)σ 2 (s)m := n(s)/w(s)√≈ L|S| 2≤m≈ ≤ ∑ sw(s)∑s w(s)σ2 (s)L|S| 2Var ∑ ∞k=t γk−t r k ≤ 1(1−γ) 2m(1 − γ) 2Therefore n(s) ≈ w(s)L|S|2ɛ 2 visits to state s needed(1 − γ) 2√|S|σ 2 (s)Ln(s)

Analysisn ≈ w(s)L|S|2ɛ 2 (1 − γ) 2 H := 11 − γ log 1ɛ(1 − γ)◮ w(s) is the discounted future state distribution◮ Expect to visit s at least w(s) times within H time-stepsL|S| 2 H◮ Expect to “know” a state afterɛ 2 (1 − γ) 2 time-steps◮ Once we “know” all states we are optimal◮ Expect sample complexity bounded by |S|2 |A|Hɛ 2 (1 − γ) 2 log 1 δ◮ Analysis harder since w(s) changes over time and we wantresults with high probability

Questions

s1 s2 s3 s4 s5 - of Marcus Hutter

Create successful ePaper yourself

Delete template?

Save as template?