Relax and Randomize: From Value to Algorithms

More documents

Recommendations

Info

Proof of Proposition 5. We would like to show that, with the distribution q ∗ t defined in (15), max { E ∣ŷ t − y t ∣ + Rel T (F∣(x t , y t ))} ≤ Rel T (F∣(x t−1 , y t−1 )) y t∈{±1} ŷ t∼qt ∗ for any x t ∈ X . Let σ ∈ {±1} t−1 and σ t ∈ {±1}. We have Rel T (F∣(x t , y t )) − 2λ(T − t) = 1 λ log ⎛ ⎝ ≤ 1 λ log ⎛ ⎝ ∑ (σ,σ t)∈F∣ x t ∑ σ t∈{±1} g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} exp {−λ∣σ t − y t ∣} ⎞ ⎠ exp {−λ∣σ t − y t ∣} ∑ σ∶(σ,σ t)∈F∣ x t g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} ⎞ ⎠ Just as in the proof of Proposition 3, we may think of the two choices σ t as the two experts whose weighting q ∗ t is given by the sum involving the Littlestone’s dimension of subsets of F. Introducing the normalization term, we arrive at the upper bound 1 λ log (E σ t∼qt ∗ exp {−λ∣σ t − y t ∣}) + 1 λ log ⎛ ⎝ ≤ −E σt∼q ∗ t ∣σ t − y t ∣ + 2λ + 1 λ log ⎛ ⎝ ∑ ∑ ∑ σ t∈{±1} σ∶(σ,σ t)∈F∣ x t ∑ σ t∈{±1} σ∶(σ,σ t)∈F∣ x t g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} ⎞ ⎠ g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} ⎞ ⎠ The last step is due to Lemma A.1 in [6]. It remains to show that the log normalization term is upper bounded by the relaxation at the previous step: 1 λ log ⎛ ⎝ ≤ 1 λ log ⎛ ⎝ ≤ 1 λ log ⎛ ⎝ ∑ ∑ σ t∈{±1} σ∶(σ,σ t)∈F∣ x t ∑ σ∈F∣ x t−1 ∑ σ∈F∣ x t−1 = Rel T (F∣(x t−1 , y t−1 )) g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} ⎞ ⎠ exp {−λL t−1 (σ)} ∑ g(Ldim(F t (σ, σ t )), T − t) ⎞ σ t∈{±1} ⎠ exp {−λL t−1 (σ)} g(Ldim(F t−1 (σ)), T − t + 1) ⎞ ⎠ To justify the last inequality, note that F t−1 (σ) = F t (σ, +1)∪F t (σ, −1) and at most one of F t (σ, +1) or F t (σ, −1) can have Littlestone’s dimension Ldim(F t−1 (σ)). We now appeal to the recursion g(d, T − t) + g(d − 1, T − t) ≤ g(d, T − t + 1) where g(d, T − t) is the size of the zero cover for a class with Littlestone’s dimension d on the worst-case tree of depth T − t (see [14]). This completes the proof of admissibility. Alternative Method problem with the relaxation Let us now derive the algorithm. Once again, consider the optimization max { E ∣ŷ t − y t ∣ + Rel T (F∣(x t , y t ))} y t∈{±1} ŷ t∼qt ∗ Rel T (F∣(x t , y t )) = 1 λ log ⎛ ⎝ ∑ g(Ldim(F t (σ)), T − t) exp {−λL t (σ)} ⎞ σ∈F∣ x ⎠ + λ (T − t) 2 t The maximum can be written explicitly, as in Section 3: ⎧⎪ max ⎨ − qt ⎪⎩1 ∗ + 1 λ log ⎛ ∑ g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} exp {−λ(1 − σ t )} ⎞ ⎝(σ,σ t)∈F∣ x ⎠ , t 1 + qt ∗ + 1 λ log ⎛ ∑ g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} exp {−λ(1 + σ t )} ⎞ ⎫⎪ ⎬ ⎝(σ,σ t)∈F∣ x ⎠ t ⎪⎭ 14
where we have dropped the λ (T − t) term from both sides. Equating the two values, we obtain 2 2qt ∗ = 1 λ log ∑ (σ,σ g(Ldim(F t)∈F∣ x t t(σ, σ t )), T − t) exp {−λL t−1 (σ)} exp {−λ(1 − σ t )} ∑ g(Ldim(F (σ,σt)∈F∣ x t t(σ, σ t )), T − t) exp {−λL t−1 (σ)} exp {−λ(1 + σ t )} The resulting value becomes 1 + λ ⎧ 1 ⎪⎨⎪⎩ (T − t) + 2 2λ log + 1 2λ log ⎧ ⎪⎨⎪⎩ (σ,σ t)∈F∣ x t (σ,σ t)∈F∣ x t = 1 + λ 2 (T − t) + 1 ⎧⎪ λ E ɛ log ⎨ ⎪⎩ ≤ 1 + λ 2 (T − t) + 1 λ log ⎧ ⎪⎨⎪⎩ ⎫⎪ ∑ g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} exp {−λ(1 − σ t )} ⎬ ⎪⎭ ⎫⎪ ∑ g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} exp {−λ(1 + σ t )} ⎬ ⎪⎭ ⎫⎪ ∑ g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} exp {−λ(1 − ɛσ t )} ⎬ (σ,σ t)∈F∣ x t ⎪⎭ ∑ (σ,σ t)∈F∣ x t for a Rademacher random variable ɛ ∈ {±1}. Now, ⎫⎪ g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} E ɛ exp {−λ(1 − ɛσ t )} ⎬ ⎪⎭ E ɛ exp {−λ(1 − ɛσ t )} = e −λ E ɛ e λɛσt ≤ e −λ e λ2 /2 Substituting this into the above expression, we obtain an upper bound of λ 2 (T − t + 1) + 1 ⎧ ⎪⎨⎪⎩ λ log ∑ g(Ldim(F t (σ, σ t )), T − t) exp {−λL t−1 (σ)} ⎬ (σ,σ t)∈F∣ x t ⎪⎭ which completes the proof of admissibility using the same combinatorial argument as in the earlier part of the proof. Arriving at the Relaxation Finally, we show that the relaxation we use arises naturally as an upper bound on the sequential Rademacher complexity. Fix a tree x. Let σ ∈ {±1} t−1 be a sequence of signs. Observe that given history x t = (x 1 , . . . , x t ), the signs ɛ ∈ {±1} T −t , and a tree x, the function class F takes on only a finite number of possible values (σ, σ t , ω) on (x t , x(ɛ)). Here, x(ɛ) denotes the sequences of values along the path ɛ. We have, sup x = sup x ≤ sup x T −t E ɛ sup {2 f∈F ∑ i=1 t ɛ i f(x i (ɛ)) − ∑ ∣f(x i ) − y i ∣} i=1 E ɛ max max {2 σ t∈{±1} (σ,ω)∶(σ,σ t,ω)∈F∣ (x t ,x(ɛ)) T −t ∑ i=1 T −t E ɛ max max max {2 ∑ σ t∈{±1} σ∶(σ,σ t)∈F∣ x t v∈V (F(σ,σ t),x) i=1 t ɛ i ω i − ∑ ∣σ i − y i ∣} i=1 t ɛ i v i (ɛ) − ∑ ∣σ i − y i ∣} where F∣ (xt ,x(ɛ)) is the projection of F onto (x t , x(ɛ)), F(σ, σ t ) = {f ∈ F ∶ f(x t ) = (σ, σ t )}, and V (F(σ, σ t ), x) is the zero-cover of the set F(σ, σ t ) on the tree x. We then have the following relaxation: 1 λ log ⎛ ⎝ sup x E ɛ ∑ ∑ ∑ σ t∈{±1} σ∶(σ,σ t)∈F∣ x t v∈V (F(σ,σ t),x) T −t exp {2λ ∑ i=1 where L t (σ, σ t ) = ∑ t i=1 ∣σ i − y i ∣. The latter quantity can be factorized: 1 λ log ⎛ ⎝ sup x ≤ 1 λ log ⎛ ⎝ sup x ≤ 1 λ log ⎛ ⎝ ∑ ∑ σ t∈{±1} σ∶(σ,σ t)∈F∣ x t ∑ σ t∈{±1} ∑ ∑ σ t∈{±1} σ∶(σ,σ t)∈F∣ x t exp {−λ∣σ t − y t ∣} exp {−λL t (σ, σ t )} E ɛ ∑ v∈V (F(σ,σ t),x) i=1 ⎫⎪ ɛ i v i (ɛ) − λL t (σ, σ t )} ⎞ ⎠ T −t exp {2λ ∑ i=1 ɛ i v i (ɛ)} ⎞ ⎠ exp {−λL t (σ, σ t )} card(V (F(σ, σ t ), x)) exp {2λ 2 (T − t)} ⎞ ⎠ ∑ σ∶(σ,σ t)∈F∣ x t g(Ldim(F(σ, σ t )), T − t) exp {−λL t−1 (σ)} ⎞ + 2λ(T − t) . ⎠ 15
Page 1 and 2: Relax and Randomize: From Value to
Page 3 and 4: play f t ∼ q t and receive x t fr
Page 5 and 6: Proposition 5. The relaxation Rel T
Page 7 and 8: As in the previous example the upda
Page 9 and 10: References [1] J. Abernethy, A. Aga
Page 11 and 12: The last equality is easy to verify
Page 13: where δ = ⟨∇ 1 2 ∥˜xt−1
Page 17 and 18: Last inequality is by Assumption 1,
Page 19 and 20: Proof of Lemma 8. On any round t, t
Page 21 and 22: Pick any direction w ⊥ perpendicu
Page 23 and 24: Thus we see that the relaxation is
Page 25: The above step used the fact that t

Relax and Randomize: From Value to Algorithms

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?