Relax and Randomize: From Value to Algorithms

More documents

Recommendations

Info

[14] for the proof). We arrive at the upper bound 1 λ log ⎛ ⎝ ∑ f∈F exp (−λL t (f)) × exp (2λ 2 max ɛ 1,...ɛ T −t ∈{±1} ≤ 1 λ log ⎛ ⎝ ∑ exp (−λL t (f) + 2λ 2 T −t max f∈F ɛ 1,...ɛ T −t ∈{±1} ≤ 1 λ log ⎛ ⎝ ∑ f∈F exp (−λL t (f)) ⎞ ⎠ + 2λ sup x sup ∑ i=1 max f∈F ɛ 1,...ɛ T −t ∈{±1} T −t ∑ l(f, x i (ɛ)) 2 ) ⎞ i=1 ⎠ l(f, x i (ɛ)) 2 ) ⎞ ⎠ T −t ∑ l(f, x i (ɛ)) 2 i=1 The last term, representing the “worst future”, is upper bounded by 2λ(T − t), assuming that the losses are bounded by 1. This removes the x tree and leads to the relaxation (9) and a computationally tractable algorithm. Proof of Proposition 4. The argument can be seen as a generalization of the Euclidean proof in [2] to general smooth norms. Let ˜x t−1 = ∑ t−1 i=1 x i . The optimal algorithm for the relaxation (10) is √ ∥˜x t−1 ∥ 2 + ⟨∇ 1 2 ∥˜x t−1∥ 2 , x t ⟩ + C(T − t + 1)}} (27) f ∗ t Instead, let = argmin f∈F { sup {⟨f, x t ⟩ + x t∈X f t = − √ ∇ 1 2 ∥˜x t−1∥ 2 . (28) 2 ∥˜x t−1 ∥ 2 + C(T − t + 1) Plugging this choice into the admissibility condition (4), we get ⎧⎪ sup ⎨− ⟨∇ 1 2 ∥˜x t−1∥ 2 , x t ⟩ x t∈X 2 √ A ⎪⎩ √ + ⎫⎪ A + ⟨∇ 1 2 ∥˜x t−1∥ 2 , x t ⟩ ⎬ ⎪⎭ where A = ∥˜x t−1 ∥ 2 + C(T − t + 1). It can be easily verified that an expression of the form −x/(2y) + √ y + x is maximized at x = 0 for a positive y. With these values, inf { sup {⟨f t , x t ⟩ + (∥˜x t−1 ∥ 2 + ⟨∇ 1 2 ∥˜x t−1∥ 2 , x t ⟩ + C(T − t + 1)) 1/2 }} = (∥˜x t−1 ∥ 2 + C(T − t + 1)) 1/2 f t∈F x t∈X ≤ (∥¯x t−2 ∥ 2 + ⟨∇ 1 2 ∥¯x t−2∥ 2 , x t−1 ⟩ + C(T − t + 2)) 1/2 = Rel T (F∣x 1 , . . . , x t−1 ) Hence, the choice (28) is an admissible algorithm for the relaxation (10). Evidently, the above proof of admissibility is very simple, but it might seem that we pulled the algorithm (28) out of a hat. We now show that in fact one can derive this algorithm as a solution f ∗ t in (27). The proof below is not required for admissibility, and we only include it for completeness. The proof uses the fact that for any norm ∥ ⋅ ∥, ⟨∇ 1 2 ∥x∥2 , x⟩ = ∥x∥ 2 . (29) To prove this, observe that by convexity ∥0∥ ≥ ∥x∥ + ⟨∇∥x∥, 0 − x⟩ and ∥2x∥ ≥ ∥x∥ + ⟨∇∥x∥, 2x − x⟩ implying ⟨∇∥x∥, x⟩ = ∥x∥. On the other hand, by the chain rule, ∇ 1 2 ∥x∥2 = ∥x∥⋅∇∥x∥, thus implying (29). Let K ≜ Kernel(∇∥˜x t−1 ∥ 2 ) = {h ∶ ⟨∇ ∥˜x t−1 ∥ 2 , h⟩ = 0} , K ′ ≜ Kernel(˜x t−1 ) = {h ∶ ⟨h, ˜x t−1 ⟩ = 0} . We first claim that x t can always be written as x t = β˜x t−1 + γy for some y ∈ K and for some scalars β, γ. Indeed, suppose that x t = β˜x t−1 + γy + z for some y ∈ K and z ∉ K. Then we can rewrite x t as x t = (β + δ) ˜x t−1 + (γy − δ˜x t−1 + z) 12
where δ = ⟨∇ 1 2 ∥˜xt−1∥2 ,z⟩ ∥˜x t−1∥ 2 . It is enough to check that (γy − δ˜x t−1 + z) ∈ K. Indeed, using (29), ⟨∇ ∥˜x t−1 ∥ 2 , −δ˜x t−1 + z⟩ = −δ ∥˜x t−1 ∥ 2 + ⟨∇ ∥˜x t−1 ∥ 2 , z⟩ = 0 . An analogous proof shows that we may always decompose any f t as f t = −α∇ 1 2 ∥˜x t−1∥ 2 + g for some g ∈ K ′ and a scalar α. Hence, ⟨f t , x t ⟩ + (∥˜x t−1 ∥ 2 + ⟨∇ 1 2 ∥˜x t−1∥ 2 , x t ⟩ + C(T − t + 1)) 1/2 = −αβ∥˜x t−1 ∥ 2 + γ ⟨g, y⟩ + (∥˜x t−1 ∥ 2 + β∥˜x t−1 ∥ 2 + C(T − t + 1)) 1/2 (30) Given any f t = −α∇ 1 2 ∥˜x t−1∥ 2 + g, x t can be picked with y ∈ K that satisfies ⟨g, y⟩ ≥ 0. One can always do this because if for some y ′ , ⟨g, y ′ ⟩ < 0 by picking y = −y ′ we can ensure that ⟨g, y⟩ ≥ 0. Hence the minimizer f t must be once such that f t = −α∇ 1 2 ∥˜x t−1∥ 2 and thus ⟨g, y⟩ = 0. Now, it must be that α ≥ 0 so that x t either increases the first term or second term but not both. Hence we conclude that f t = −α∇ 1 2 ∥˜x t−1∥ 2 for some α ≥ 0. It remains to determine the optimal α. Given such an f t , the sup over x t can be written as supremum over β of a concave function, which gives rise to the derivative condition At this point it is clear that the value of −α ∥˜x t−1 ∥ 2 ∥˜x t−1 ∥ 2 + √ = 0 2 ∥˜x t−1 ∥ 2 + β ∥˜x t−1 ∥ 2 + C(T − t + 1) α = √ 1 2 ∥˜x t−1 ∥ 2 + C(T − t + 1) forces β = 0. Let us in fact show that this value is optimal. We have 1 4α 2 = ∥˜x t−1∥ 2 + β ∥˜x t−1 ∥ 2 + C(T − t + 1) Plugging this value of β back, we optimize 1 4α + α ∥˜x t−1∥ 2 + αC(T − t + 1) over α and obtain the value given in (31). With this value, we have the familiar update (28). Plugging back the value of α, we find that β = 0. We conclude that f t defined in (28) in fact coincides with the optimal solution (27). Arriving at the Relaxation R T (F∣x 1 , . . . , x t ) = sup x The derivation of the relaxation is immediate: ≤ sup x ≤ sup x E ɛt+1∶T ∥ T ∑ s=t+1 E ɛt+1∶T ∥ ∥ t ∑ s=1 t (31) ɛ s x s−t (ɛ t+1∶s−1 ) − ∑ x s ∥ (32) T ∑ s=t+1 s=1 ɛ s x s−t (ɛ t+1∶s−1 ) − ∑ t s=1 2 x s ∥ (33) x s ∥ + CE ɛt+1∶T T ∑ ∥ɛ s x s−t (ɛ t+1∶s−1 )∥ 2 (34) s=t+1 where the last step is due to the smoothness of the norm and the fact that the first-order terms disappear under the expectation. The sum of norms is now upper bounded by T − t, thus removing the dependence on the “future”, and we arrive at ∥ t ∑ s=1 2 x s ∥ + C(T − t) ≤ ∥ t−1 ∑ s=1 2 x s ∥ + ⟨∇ 1 t−1 2 ∥ as a relaxation on the sequential Rademacher complexity. 2 ∑ x s ∥ s=1 , x t ⟩ + C(T − t + 1) 13
Page 1 and 2: Relax and Randomize: From Value to
Page 3 and 4: play f t ∼ q t and receive x t fr
Page 5 and 6: Proposition 5. The relaxation Rel T
Page 7 and 8: As in the previous example the upda
Page 9 and 10: References [1] J. Abernethy, A. Aga
Page 11: The last equality is easy to verify
Page 15 and 16: where we have dropped the λ (T −
Page 17 and 18: Last inequality is by Assumption 1,
Page 19 and 20: Proof of Lemma 8. On any round t, t
Page 21 and 22: Pick any direction w ⊥ perpendicu
Page 23 and 24: Thus we see that the relaxation is
Page 25: The above step used the fact that t

Relax and Randomize: From Value to Algorithms

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?