25.02.2013 Views

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong><br />

<strong>for</strong><br />

<strong>Online</strong> <strong>Learning</strong> <strong>and</strong> <strong>Stochastic</strong> <strong>Optimization</strong><br />

John C. Duchi 1,2 Elad Hazan 3 Yoram Singer 2<br />

1 University of Cali<strong>for</strong>nia, Berkeley<br />

2 Google Research<br />

3 Technion<br />

International Symposium on Mathematical Programming 2012<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 1 / 32


Setting: <strong>Online</strong> Convex <strong>Optimization</strong><br />

<strong>Online</strong> learning task—repeat:<br />

• Learner plays point xt<br />

• Receive function ft<br />

• Suffer loss<br />

ft(xt) + ϕ(xt)<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 2 / 32


Setting: <strong>Online</strong> Convex <strong>Optimization</strong><br />

<strong>Online</strong> learning task—repeat:<br />

• Learner plays point xt<br />

• Receive function ft<br />

• Suffer loss<br />

ft(xt) + ϕ(xt)<br />

• Parameter vector <strong>for</strong> features<br />

• Receive label yt, features φt<br />

• Suffer regularized logistic loss<br />

log [1 + exp(−yt 〈φt, xt〉)] + λ �xt� 1<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 2 / 32


Setting: <strong>Online</strong> Convex <strong>Optimization</strong><br />

<strong>Online</strong> learning task—repeat:<br />

• Learner plays point xt<br />

• Receive function ft<br />

• Suffer loss<br />

ft(xt) + ϕ(xt)<br />

Goal: Attain small regret<br />

T�<br />

t=1<br />

• Parameter vector <strong>for</strong> features<br />

• Receive label yt, features φt<br />

• Suffer regularized logistic loss<br />

log [1 + exp(−yt 〈φt, xt〉)] + λ �xt� 1<br />

�<br />

�T<br />

�<br />

ft(xt) + ϕ(xt) − inf<br />

x∈X<br />

ft(x) + ϕ(x)<br />

t=1<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 2 / 32


Motivation<br />

Text data:<br />

The most unsung birthday<br />

in American business <strong>and</strong><br />

technological history<br />

this year may be the 50th<br />

anniversary of the Xerox<br />

914 photocopier. a<br />

a The Atlantic, July/August 2010.<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 3 / 32


Motivation<br />

Text data:<br />

The most unsung birthday<br />

in American business <strong>and</strong><br />

technological history<br />

this year may be the 50th<br />

anniversary of the Xerox<br />

914 photocopier. a<br />

a The Atlantic, July/August 2010.<br />

High-dimensional image features<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 3 / 32


Motivation<br />

Text data:<br />

The most unsung birthday<br />

in American business <strong>and</strong><br />

technological history<br />

this year may be the 50th<br />

anniversary of the Xerox<br />

914 photocopier. a<br />

a The Atlantic, July/August 2010.<br />

High-dimensional image features<br />

Other motivation: selecting advertisements in online advertising,<br />

document ranking, problems with parameterizations of many<br />

magnitudes...<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 3 / 32


Goal?<br />

Flipping around the usual sparsity game<br />

min<br />

x . �Ax − b� , A = [a1 a2 · · · an] ⊤ ∈ R n×d<br />

Usually in sparsity-focused depend on<br />

�ai� ∞<br />

� �� �<br />

dense<br />

· �x� 1<br />

����<br />

sparse<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 4 / 32


Goal?<br />

Flipping around the usual sparsity game<br />

min<br />

x . �Ax − b� , A = [a1 a2 · · · an] ⊤ ∈ R n×d<br />

Usually in sparsity-focused depend on<br />

What we would like:<br />

�ai� ∞<br />

� �� �<br />

dense<br />

· �x� 1<br />

����<br />

sparse<br />

�ai�1 · �x�∞ � �� � � �� �<br />

sparse dense<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 4 / 32


Goal?<br />

Flipping around the usual sparsity game<br />

min<br />

x . �Ax − b� , A = [a1 a2 · · · an] ⊤ ∈ R n×d<br />

Usually in sparsity-focused depend on<br />

What we would like:<br />

�ai� ∞<br />

� �� �<br />

dense<br />

· �x� 1<br />

����<br />

sparse<br />

�ai�1 · �x�∞ � �� � � �� �<br />

sparse dense<br />

(In general, impossible)<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 4 / 32


Approaches: Gradient Descent <strong>and</strong> Dual Averaging<br />

Let gt ∈ ∂ft(xt):<br />

or<br />

xt+1 = argmin<br />

x∈X<br />

�<br />

ηt<br />

xt+1 = argmin<br />

x∈X t<br />

�<br />

1<br />

2 �x − xt� 2 �<br />

+ ηt 〈gt, x〉<br />

t�<br />

τ=1<br />

f (xt) + 〈gt, x - xt〉<br />

〈gτ, x〉 + 1<br />

2t �x�2<br />

�<br />

f (x)<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 5 / 32


Approaches: Gradient Descent <strong>and</strong> Dual Averaging<br />

Let gt ∈ ∂ft(xt):<br />

or<br />

xt+1 = argmin<br />

x∈X<br />

�<br />

ηt<br />

xt+1 = argmin<br />

x∈X t<br />

�<br />

1<br />

2 �x − xt� 2 �<br />

+ ηt 〈gt, x〉<br />

t�<br />

τ=1<br />

〈gt, x〉 + 1<br />

2<br />

2 ||x - xt||<br />

〈gτ, x〉 + 1<br />

2t �x�2<br />

�<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 5 / 32


What is the problem?<br />

• Gradient steps treat all features as equal<br />

• They are not!<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 6 / 32


Adapting to Geometry of Space<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 7 / 32


Why adapt to geometry?<br />

Hard<br />

Nice<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 8 / 32


Why adapt to geometry?<br />

Hard<br />

Nice<br />

yt φt,1 φt,2 φt,3<br />

1 1 0 0<br />

-1 .5 0 1<br />

1 -.5 1 0<br />

-1 0 0 0<br />

1 .5 0 0<br />

-1 1 0 0<br />

1 -1 1 0<br />

-1 -.5 0 1<br />

1 Frequent, irrelevant<br />

2 Infrequent, predictive<br />

3 Infrequent, predictive<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 8 / 32


Why adapt to geometry?<br />

Hard<br />

Nice<br />

yt φt,1 φt,2 φt,3<br />

1 1 0 0<br />

-1 .5 0 1<br />

1 -.5 1 0<br />

-1 0 0 0<br />

1 .5 0 0<br />

-1 1 0 0<br />

1 -1 1 0<br />

-1 -.5 0 1<br />

1 Frequent, irrelevant<br />

2 Infrequent, predictive<br />

3 Infrequent, predictive<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 8 / 32


Adapting to Geometry of the Space<br />

• Receive gt ∈ ∂ft(xt)<br />

• Earlier:<br />

�<br />

1<br />

xt+1 = argmin<br />

x∈X 2 �x − xt� 2 �<br />

+ η 〈gt, x〉<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 9 / 32


Adapting to Geometry of the Space<br />

• Receive gt ∈ ∂ft(xt)<br />

• Earlier:<br />

�<br />

1<br />

xt+1 = argmin<br />

x∈X 2 �x − xt� 2 �<br />

+ η 〈gt, x〉<br />

• Now: let �x� 2<br />

A = 〈x, Ax〉 <strong>for</strong> A � 0. Use<br />

xt+1 = argmin<br />

x∈X<br />

�<br />

1<br />

2<br />

�x − xt�A 2 + η 〈gt,<br />

�<br />

x〉<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 9 / 32


Regret Bounds<br />

What does adaptation buy?<br />

• St<strong>and</strong>ard regret bound:<br />

T�<br />

t=1<br />

ft(xt) − ft(x ∗ ) ≤ 1<br />

2η �x1 − x ∗ � 2 η<br />

2 +<br />

2<br />

T�<br />

t=1<br />

�gt� 2<br />

2<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 10 / 32


Regret Bounds<br />

What does adaptation buy?<br />

• St<strong>and</strong>ard regret bound:<br />

T�<br />

t=1<br />

ft(xt) − ft(x ∗ ) ≤ 1<br />

2η �x1 − x ∗ � 2 η<br />

2 +<br />

2<br />

• Regret bound with matrix:<br />

T�<br />

t=1<br />

ft(xt) − ft(x ∗ ) ≤ 1<br />

2η �x1 − x ∗ � 2 η<br />

A +<br />

2<br />

T�<br />

t=1<br />

T�<br />

t=1<br />

�gt� 2<br />

2<br />

�gt� 2<br />

A −1<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 10 / 32


Meta <strong>Learning</strong> Problem<br />

• Have regret:<br />

T�<br />

t=1<br />

ft(xt) − ft(x ∗ ) ≤ 1<br />

η �x1 − x ∗ � 2 η<br />

A +<br />

2<br />

• What happens if we minimize A in hindsight?<br />

min<br />

A<br />

T�<br />

t=1<br />

�<br />

gt, A −1 �<br />

gt<br />

T�<br />

t=1<br />

�gt� 2<br />

A −1<br />

subject to A � 0, tr(A) ≤ C<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 11 / 32


Meta <strong>Learning</strong> Problem<br />

• What happens if we minimize A in hindsight?<br />

min<br />

A<br />

T�<br />

t=1<br />

�<br />

gt, A −1 �<br />

gt<br />

subject to A � 0, tr(A) ≤ C<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 12 / 32


Meta <strong>Learning</strong> Problem<br />

• What happens if we minimize A in hindsight?<br />

min<br />

A<br />

T�<br />

t=1<br />

�<br />

gt, A −1 �<br />

gt<br />

• Solution is of <strong>for</strong>m<br />

�<br />

�T<br />

A = c diag<br />

t=1<br />

gtg ⊤ t<br />

subject to A � 0, tr(A) ≤ C<br />

� 1<br />

2<br />

�<br />

�T<br />

A = c<br />

t=1<br />

(diagonal) (full)<br />

(where c chosen to satisfy tr constraint)<br />

gtg ⊤ t<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 12 / 32<br />

� 1<br />

2


Meta <strong>Learning</strong> Problem<br />

• What happens if we minimize A in hindsight?<br />

min<br />

A<br />

T�<br />

t=1<br />

�<br />

gt, A −1 �<br />

gt<br />

• Solution is of <strong>for</strong>m<br />

�<br />

�T<br />

A = c diag<br />

t=1<br />

gtg ⊤ t<br />

subject to A � 0, tr(A) ≤ C<br />

� 1<br />

2<br />

�<br />

�T<br />

A = c<br />

t=1<br />

(diagonal) (full)<br />

gtg ⊤ t<br />

(where c chosen to satisfy tr constraint)<br />

• Let g1:t,j be vector of jth gradient component. Optimal:<br />

Aj,j ∝ �g1:T , j� 2<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 12 / 32<br />

� 1<br />

2


Low regret to the best A<br />

• Let g1:t,j be vector of jth gradient component. At time t, use<br />

st = � �d �g1:t,j�2 j=1 <strong>and</strong> At = diag(st)<br />

�<br />

1<br />

2<br />

xt+1 = argmin �x − xt�At x∈X 2 + η 〈gt,<br />

�<br />

x〉<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 13 / 32


Low regret to the best A<br />

• Let g1:t,j be vector of jth gradient component. At time t, use<br />

• Example:<br />

st = � �d �g1:t,j�2 j=1 <strong>and</strong> At = diag(st)<br />

�<br />

1<br />

2<br />

xt+1 = argmin �x − xt�At x∈X 2 + η 〈gt,<br />

�<br />

x〉<br />

yt gt,1 gt,2 gt,3<br />

1 1 0 0<br />

-1 .5 0 1<br />

1 -.5 1 0<br />

-1 0 0 0<br />

1 1 1 0<br />

-1 1 0 0<br />

s1 = √ 3.5 s2 = √ 2 s3 = 1<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 13 / 32


Final Convergence Guarantee<br />

Algorithm: at time t, set<br />

Define radius<br />

st = � �g1:t,j� 2<br />

xt+1 = argmin<br />

x∈X<br />

R∞ := max<br />

� d<br />

<strong>and</strong> At = diag(st)<br />

j=1<br />

�<br />

1<br />

2<br />

�x − xt�At 2 + η 〈gt,<br />

�<br />

x〉<br />

t �xt − x ∗ �∞ ≤ sup �x − x<br />

x∈X<br />

∗ �∞ .<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 14 / 32


Final Convergence Guarantee<br />

Algorithm: at time t, set<br />

Define radius<br />

st = � �g1:t,j� 2<br />

xt+1 = argmin<br />

x∈X<br />

R∞ := max<br />

� d<br />

Theorem<br />

The final regret bound of AdaGrad:<br />

<strong>and</strong> At = diag(st)<br />

j=1<br />

�<br />

1<br />

2<br />

�x − xt�At 2 + η 〈gt,<br />

�<br />

x〉<br />

t �xt − x ∗ �∞ ≤ sup �x − x<br />

x∈X<br />

∗ �∞ .<br />

T�<br />

ft(xt) − ft(x ∗ ) ≤ 2R∞<br />

t=1<br />

d�<br />

i=1<br />

�g1:T,j� 2 .<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 14 / 32


Underst<strong>and</strong>ing the convergence guarantees I<br />

• <strong>Stochastic</strong> convex optimization:<br />

f(x) := EP [f(x; ξ)]<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 15 / 32


Underst<strong>and</strong>ing the convergence guarantees I<br />

• <strong>Stochastic</strong> convex optimization:<br />

f(x) := EP [f(x; ξ)]<br />

Sample ξt according to P , define ft(x) := f(x; ξt). Then<br />

� �<br />

1<br />

E f<br />

T<br />

T�<br />

t=1<br />

xt<br />

��<br />

− f(x ∗ ) ≤ 2R∞<br />

T<br />

d�<br />

E � �<br />

�g1:T,j�2 Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 15 / 32<br />

i=1


Underst<strong>and</strong>ing the convergence guarantees II<br />

Support vector machine example: define<br />

f(x; ξ) = [1 − 〈x, ξ〉] + , where ξ ∈ {−1, 0, 1} d<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 16 / 32


Underst<strong>and</strong>ing the convergence guarantees II<br />

Support vector machine example: define<br />

f(x; ξ) = [1 − 〈x, ξ〉] + , where ξ ∈ {−1, 0, 1} d<br />

• If ξj �= 0 with probability ∝ j −α <strong>for</strong> α > 1<br />

� �<br />

1<br />

E f<br />

T<br />

T�<br />

t=1<br />

xt<br />

��<br />

− f(x ∗ � ∗ �x �∞<br />

) = O √ · max<br />

T � log d, d 1−α/2��<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 16 / 32


Underst<strong>and</strong>ing the convergence guarantees II<br />

Support vector machine example: define<br />

f(x; ξ) = [1 − 〈x, ξ〉] + , where ξ ∈ {−1, 0, 1} d<br />

• If ξj �= 0 with probability ∝ j −α <strong>for</strong> α > 1<br />

� �<br />

1<br />

E f<br />

T<br />

T�<br />

t=1<br />

xt<br />

��<br />

• Previously best-known method:<br />

� �<br />

1<br />

E f<br />

T<br />

− f(x ∗ � ∗ �x �∞<br />

) = O √ · max<br />

T � log d, d 1−α/2��<br />

T�<br />

t=1<br />

xt<br />

��<br />

− f(x ∗ � ∗ �x �∞<br />

) = O √ ·<br />

T √ �<br />

d .<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 16 / 32


Underst<strong>and</strong>ing the convergence guarantees III<br />

Back to regret minimization<br />

• Convergence almost as good as that of the best geometry<br />

matrix:<br />

T�<br />

ft(xt) − ft(x ∗ )<br />

t=1<br />

≤ 2 √ d �x ∗ � ∞<br />

�<br />

�<br />

�<br />

�<br />

�T<br />

�inf s<br />

t=1<br />

�gt� 2<br />

diag(s) −1<br />

�<br />

: s � 0, 〈 1, s〉 ≤ d<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 17 / 32


Underst<strong>and</strong>ing the convergence guarantees III<br />

Back to regret minimization<br />

• Convergence almost as good as that of the best geometry<br />

matrix:<br />

T�<br />

ft(xt) − ft(x ∗ )<br />

t=1<br />

≤ 2 √ d �x ∗ � ∞<br />

�<br />

�<br />

�<br />

�<br />

�T<br />

�inf s<br />

t=1<br />

�gt� 2<br />

diag(s) −1<br />

• This (<strong>and</strong> other bounds) are minimax optimal<br />

�<br />

: s � 0, 〈 1, s〉 ≤ d<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 17 / 32


The AdaGrad Algorithms<br />

Analysis applies to several algorithms<br />

st = � �d �g1:t,j�2 j=1 , At = diag(st)<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 18 / 32


The AdaGrad Algorithms<br />

Analysis applies to several algorithms<br />

st = � �d �g1:t,j�2 j=1 , At = diag(st)<br />

• Forward-backward splitting (Lions <strong>and</strong> Mercier 1979, Nesterov<br />

2007, Duchi <strong>and</strong> Singer 2009, others)<br />

�<br />

1<br />

2<br />

xt+1 = argmin �x − xt�At x∈X 2 + 〈gt,<br />

�<br />

x〉 + ϕ(x)<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 18 / 32


The AdaGrad Algorithms<br />

Analysis applies to several algorithms<br />

st = � �d �g1:t,j�2 j=1 , At = diag(st)<br />

• Forward-backward splitting (Lions <strong>and</strong> Mercier 1979, Nesterov<br />

2007, Duchi <strong>and</strong> Singer 2009, others)<br />

�<br />

1<br />

2<br />

xt+1 = argmin �x − xt�At x∈X 2 + 〈gt,<br />

�<br />

x〉 + ϕ(x)<br />

• Regularized Dual Averaging (Nesterov 2007, Xiao 2010)<br />

�<br />

t� 1<br />

xt+1 = argmin 〈gτ, x〉 + ϕ(x) +<br />

x∈X t<br />

1<br />

2t �x�2<br />

�<br />

At<br />

τ=1<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 18 / 32


An Example <strong>and</strong> Experimental Results<br />

• ℓ1-regularization<br />

• Text classification<br />

• Image ranking<br />

• Neural network learning<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 19 / 32


AdaGrad with composite updates<br />

Recall more general problem:<br />

T�<br />

t=1<br />

f(xt) + ϕ(xt) − inf<br />

x∗ �<br />

�T<br />

�<br />

∈X<br />

f(x) + ϕ(x)<br />

t=1<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 20 / 32


AdaGrad with composite updates<br />

Recall more general problem:<br />

T�<br />

t=1<br />

f(xt) + ϕ(xt) − inf<br />

x∗ �<br />

�T<br />

�<br />

∈X<br />

f(x) + ϕ(x)<br />

t=1<br />

• Must solve updates of <strong>for</strong>m<br />

argmin<br />

x∈X<br />

�<br />

〈g, x〉 + ϕ(x) + 1<br />

2 �x�2<br />

A<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 20 / 32<br />


AdaGrad with composite updates<br />

Recall more general problem:<br />

T�<br />

t=1<br />

f(xt) + ϕ(xt) − inf<br />

x∗ �<br />

�T<br />

�<br />

∈X<br />

f(x) + ϕ(x)<br />

t=1<br />

• Must solve updates of <strong>for</strong>m<br />

argmin<br />

x∈X<br />

• Luckily, often still simple<br />

�<br />

〈g, x〉 + ϕ(x) + 1<br />

2 �x�2<br />

A<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 20 / 32<br />


AdaGrad with ℓ1 regularization<br />

Set ¯gt = 1<br />

t<br />

� t<br />

τ=1 gτ. Need to solve<br />

min<br />

x<br />

〈¯gt, x〉 + λ �x�1 + 1<br />

〈x, diag(st)x〉<br />

2t<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 21 / 32


AdaGrad with ℓ1 regularization<br />

Set ¯gt = 1<br />

t<br />

� t<br />

τ=1 gτ. Need to solve<br />

min<br />

x<br />

〈¯gt, x〉 + λ �x�1 + 1<br />

〈x, diag(st)x〉<br />

2t<br />

• Coordinate-wise update yields sparsity <strong>and</strong> adaptivity:<br />

¯gt<br />

0<br />

−0.2<br />

−0.4<br />

xt+1,j = sign(−¯gt,j)<br />

t<br />

�g1:t,j� 2<br />

[|¯gt,j| − λ] +<br />

λ truncation<br />

−0.6<br />

0 50 100 150 200 250 300<br />

t<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 21 / 32


Text Classification<br />

Reuters RCV1 document classification task—d = 2 · 10 6 features,<br />

approximately 4000 non-zero features per document<br />

ft(x) := [1 − 〈x, ξt〉] +<br />

where ξt ∈ {−1, 0, 1} d is data sample<br />

1 Crammer et al., 2006<br />

2 Crammer et al., 2009<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 22 / 32


Text Classification<br />

Reuters RCV1 document classification task—d = 2 · 10 6 features,<br />

approximately 4000 non-zero features per document<br />

ft(x) := [1 − 〈x, ξt〉] +<br />

where ξt ∈ {−1, 0, 1} d is data sample<br />

FOBOS AdaGrad PA 1 AROW 2<br />

Ecomonics .058 (.194) .044 (.086) .059 .049<br />

Corporate .111 (.226) .053 (.105) .107 .061<br />

Government .056 (.183) .040 (.080) .066 .044<br />

Medicine .056 (.146) .035 (.063) .053 .039<br />

1 Crammer et al., 2006<br />

2 Crammer et al., 2009<br />

Test set classification error rate<br />

(sparsity of final predictor in parenthesis)<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 22 / 32


Image Ranking<br />

ImageNet (Deng et al., 2009), large-scale hierarchical image database<br />

Fish<br />

Vertebrate<br />

Cow<br />

Animal<br />

Mammal<br />

Invertebrate<br />

Train 15,000 rankers/classifiers to<br />

rank images <strong>for</strong> each noun (as in<br />

Grangier <strong>and</strong> Bengio, 2008)<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 23 / 32


Image Ranking<br />

ImageNet (Deng et al., 2009), large-scale hierarchical image database<br />

Fish<br />

Vertebrate<br />

Cow<br />

Animal<br />

Mammal<br />

Invertebrate<br />

Train 15,000 rankers/classifiers to<br />

rank images <strong>for</strong> each noun (as in<br />

Grangier <strong>and</strong> Bengio, 2008)<br />

Data<br />

ξ = (z 1 , z 2 ) ∈ {0, 1} d × {0, 1} d is<br />

pair of images<br />

f(x; z 1 , z 2 ) = � 1 − � x, z 1 − z 2��<br />

+<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 23 / 32


Image Ranking Results<br />

Precision at k: proportion of examples in top k that belong to<br />

category. Average precision is average placement of all positive<br />

examples.<br />

Algorithm Avg. Prec. P@1 P@5 P@10 Nonzero<br />

AdaGrad 0.6022 0.8502 0.8130 0.7811 0.7267<br />

AROW 0.5813 0.8597 0.8165 0.7816 1.0000<br />

PA 0.5581 0.8455 0.7957 0.7576 1.0000<br />

Fobos 0.5042 0.7496 0.6950 0.6545 0.8996<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 24 / 32


Neural Network <strong>Learning</strong><br />

Wildly non-convex problem:<br />

where<br />

f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))<br />

p(α) =<br />

1<br />

1 + exp(α)<br />

p(hx1, ⇠1i)<br />

x1 x2 x3 x4 x5<br />

⇠1 ⇠2 ⇠3 ⇠5 ⇠4<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 25 / 32


Neural Network <strong>Learning</strong><br />

Wildly non-convex problem:<br />

where<br />

f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))<br />

p(α) =<br />

1<br />

1 + exp(α)<br />

p(hx1, ⇠1i)<br />

x1 x2 x3 x4 x5<br />

⇠1 ⇠2 ⇠3 ⇠5 ⇠4<br />

Idea: Use stochastic gradient methods to solve it anyway<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 25 / 32


Neural Network <strong>Learning</strong><br />

Average Frame Accuracy (%)<br />

25<br />

20<br />

15<br />

10<br />

Accuracy on Test Set<br />

5<br />

SGD<br />

GPU<br />

Downpour SGD<br />

Downpour SGD w/Adagrad<br />

S<strong>and</strong>blaster L−BFGS<br />

0<br />

0 20 40 60 80 100 120<br />

Time (hours)<br />

(Dean et al. 2012)<br />

Distributed, d = 1.7 · 10 9 parameters. SGD <strong>and</strong> AdaGrad use 80<br />

machines (1000 cores), L-BFGS uses 800 (10000 cores)<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 26 / 32


Conclusions <strong>and</strong> Discussion<br />

• Family of algorithms that adapt to geometry of data<br />

• Extendable to full matrix case to h<strong>and</strong>le feature correlation<br />

• Can derive many efficient algorithms <strong>for</strong> high-dimensional<br />

problems, especially with sparsity<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 27 / 32


Conclusions <strong>and</strong> Discussion<br />

• Family of algorithms that adapt to geometry of data<br />

• Extendable to full matrix case to h<strong>and</strong>le feature correlation<br />

• Can derive many efficient algorithms <strong>for</strong> high-dimensional<br />

problems, especially with sparsity<br />

• Future: Efficient full-matrix adaptivity, other types of adaptation<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 27 / 32


Thanks!<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 28 / 32


OGD Sketch: “Almost” Contraction<br />

• Have gt ∈ ∂ft(xt) (ignore ϕ, X <strong>for</strong> simplicity)<br />

• Be<strong>for</strong>e: xt+1 = xt − ηgt<br />

1<br />

2 �xt+1 − x ∗ � 2 1<br />

2 ≤<br />

2 �xt − x ∗ � 2<br />

2 + η (ft(x ∗ ) − ft(xt)) + η2 2<br />

�gt�2 2<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 29 / 32


OGD Sketch: “Almost” Contraction<br />

• Have gt ∈ ∂ft(xt) (ignore ϕ, X <strong>for</strong> simplicity)<br />

• Be<strong>for</strong>e: xt+1 = xt − ηgt<br />

1<br />

2 �xt+1 − x ∗ � 2 1<br />

2 ≤<br />

2 �xt − x ∗ � 2<br />

2 + η (ft(x ∗ ) − ft(xt)) + η2 2<br />

�gt�2 2<br />

• Now: xt+1 = xt − ηA −1 gt<br />

1<br />

2 �xt+1 − x ∗ � 2<br />

A<br />

= 1<br />

2 �xt − x ∗ � 2<br />

A + η 〈gt, x ∗ − xt〉 + η2 2<br />

�gt�A 2 −1<br />

≤ 1<br />

2 �xt − x ∗ � 2<br />

A + η (ft(x ∗ ) − ft(xt)) + η2 2<br />

�gt�A 2 −1<br />

↑<br />

dual norm to �·� A<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 29 / 32


Hindsight minimization<br />

• Focus on diagonal case (full matrix case similar)<br />

min<br />

s<br />

T�<br />

t=1<br />

�<br />

gt, diag(s) −1 �<br />

gt<br />

subject to s � 0, 〈1, s〉 ≤ C<br />

• Let g1:T,j be vector of jth component. Solution is of <strong>for</strong>m<br />

sj ∝ �g1:T,j� 2<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 30 / 32


Low regret to the best A<br />

T�<br />

t=1<br />

≤ 1<br />

2η<br />

ft(xt) + ϕ(xt) − ft(x ∗ ) − ϕ(x ∗ )<br />

T�<br />

t=1<br />

� �xt − x ∗ � 2<br />

At − �xt+1 − x ∗ � 2<br />

At<br />

� �� �<br />

Term I<br />

�<br />

+ η<br />

2<br />

T�<br />

t=1<br />

�gt� 2<br />

A −1<br />

t<br />

� �� �<br />

Term II<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 31 / 32


Bounding Terms<br />

Define D∞ = maxt �xt − x ∗ � ∞ ≤ sup x∈X �x − x ∗ � ∞<br />

• Term I:<br />

T�<br />

t=1<br />

�<br />

�xt − x ∗ � 2<br />

At − �xt+1 − x ∗ � 2 � 2<br />

≤ D At ∞<br />

d�<br />

j=1<br />

�g1:T,j� 2<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 32 / 32


Bounding Terms<br />

Define D∞ = maxt �xt − x ∗ � ∞ ≤ sup x∈X �x − x ∗ � ∞<br />

• Term I:<br />

T�<br />

t=1<br />

• Term II:<br />

T�<br />

t=1<br />

�gt� 2<br />

A −1<br />

t<br />

�<br />

�xt − x ∗ � 2<br />

At − �xt+1 − x ∗ � 2 � 2<br />

≤ D At ∞<br />

≤ 2<br />

T�<br />

t=1<br />

�gt� 2<br />

A −1<br />

T<br />

�<br />

� �<br />

� T�<br />

= 2�inf<br />

〈gt, diag(s) −1gt〉 s<br />

t=1<br />

= 2<br />

d�<br />

j=1<br />

�g1:T,j� 2<br />

d�<br />

j=1<br />

�<br />

�<br />

� s � 0, 〈1, s〉 ≤<br />

�g1:T,j� 2<br />

d�<br />

j=1<br />

�g1:T,j� 2<br />

Duchi et al. (UC Berkeley) <strong>Adaptive</strong> <strong>Subgradient</strong> <strong>Methods</strong> ISMP 2012 32 / 32<br />

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!