Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Adaptive Subgradient Methods 

for 

Online Learning and Stochastic Optimization 

John C. Duchi 1,2 Elad Hazan 3 Yoram Singer 2 

1 University of California, Berkeley 

2 Google Research 

3 Technion 

International Symposium on Mathematical Programming 2012 

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 1 / 32

Setting: Online Convex Optimization 

Online learning task—repeat: 

• Learner plays point xt 

• Receive function ft 

• Suffer loss 

ft(xt) + ϕ(xt) 








• Parameter vector for features 

• Receive label yt, features φt 

• Suffer regularized logistic loss 

log [1 + exp(−yt 〈φt, xt〉)] + λ �xt� 1 








Goal: Attain small regret 

T� 

t=1 

• Parameter vector for features 

• Receive label yt, features φt 

• Suffer regularized logistic loss 

log [1 + exp(−yt 〈φt, xt〉)] + λ �xt� 1 

� 

�T 

� 

ft(xt) + ϕ(xt) − inf 

x∈X 

ft(x) + ϕ(x) 

t=1 


Motivation 

Text data: 

The most unsung birthday 

in American business and 

technological history 

this year may be the 50th 

anniversary of the Xerox 

914 photocopier. a 

a The Atlantic, July/August 2010. 


Motivation 

Text data: 








High-dimensional image features 


Motivation 

Text data: 








High-dimensional image features 

Other motivation: selecting advertisements in online advertising, 

document ranking, problems with parameterizations of many 

magnitudes... 


Goal? 

Flipping around the usual sparsity game 

min 

x . �Ax − b� , A = [a1 a2 · · · an] ⊤ ∈ R n×d 

Usually in sparsity-focused depend on 

�ai� ∞ 

� �� 

dense 

· �x� 1 

�� 

sparse 


Goal? 


min 

x . �Ax − b� , A = [a1 a2 · · · an] ⊤ ∈ R n×d 


What we would like: 

�ai� ∞ 

� �� 

dense 

· �x� 1 

�� 

sparse 

�ai�1 · �x�∞ � �� 

sparse dense 


Goal? 


min 

x . �Ax − b� , A = [a1 a2 · · · an] ⊤ ∈ R n×d 


What we would like: 

�ai� ∞ 

� �� 

dense 

· �x� 1 

�� 

sparse 

�ai�1 · �x�∞ � �� 

sparse dense 

(In general, impossible) 


Approaches: Gradient Descent and Dual Averaging 

Let gt ∈ ∂ft(xt): 

or 

xt+1 = argmin 

x∈X 

� 

ηt 

xt+1 = argmin 

x∈X t 

� 

1 

2 �x − xt� 2 � 

+ ηt 〈gt, x〉 

t� 

τ=1 

f (xt) + 〈gt, x - xt〉 

〈gτ, x〉 + 1 

2t �x�2 

� 

f (x) 


Approaches: Gradient Descent and Dual Averaging 

Let gt ∈ ∂ft(xt): 

or 

xt+1 = argmin 

x∈X 

� 

ηt 

xt+1 = argmin 

x∈X t 

� 

1 

2 �x − xt� 2 � 

+ ηt 〈gt, x〉 

t� 

τ=1 

〈gt, x〉 + 1 

2 

2 ||x - xt|| 

〈gτ, x〉 + 1 

2t �x�2 

� 


What is the problem? 

• Gradient steps treat all features as equal 

• They are not! 


Adapting to Geometry of Space 


Why adapt to geometry? 

Hard 

Nice 



Hard 

Nice 

yt φt,1 φt,2 φt,3 

1 1 0 0 

-1 .5 0 1 

1 -.5 1 0 

-1 0 0 0 

1 .5 0 0 

-1 1 0 0 

1 -1 1 0 

-1 -.5 0 1 

1 Frequent, irrelevant 

2 Infrequent, predictive 




Hard 

Nice 

yt φt,1 φt,2 φt,3 

1 1 0 0 

-1 .5 0 1 

1 -.5 1 0 

-1 0 0 0 

1 .5 0 0 

-1 1 0 0 

1 -1 1 0 

-1 -.5 0 1 

1 Frequent, irrelevant 




Adapting to Geometry of the Space 

• Receive gt ∈ ∂ft(xt) 

• Earlier: 

� 

1 

xt+1 = argmin 

x∈X 2 �x − xt� 2 � 

+ η 〈gt, x〉 


Adapting to Geometry of the Space 

• Receive gt ∈ ∂ft(xt) 

• Earlier: 

� 

1 

xt+1 = argmin 

x∈X 2 �x − xt� 2 � 

+ η 〈gt, x〉 

• Now: let �x� 2 

A = 〈x, Ax〉 for A � 0. Use 

xt+1 = argmin 

x∈X 

� 

1 

2 

�x − xt�A 2 + η 〈gt, 

� 

x〉 


Regret Bounds 

What does adaptation buy? 

• Standard regret bound: 

T� 

t=1 

ft(xt) − ft(x ∗ ) ≤ 1 

2η �x1 − x ∗ � 2 η 

2 + 

2 

T� 

t=1 

�gt� 2 

2 


Regret Bounds 

What does adaptation buy? 

• Standard regret bound: 

T� 

t=1 

ft(xt) − ft(x ∗ ) ≤ 1 

2η �x1 − x ∗ � 2 η 

2 + 

2 

• Regret bound with matrix: 

T� 

t=1 

ft(xt) − ft(x ∗ ) ≤ 1 

2η �x1 − x ∗ � 2 η 

A + 

2 

T� 

t=1 

T� 

t=1 

�gt� 2 

2 

�gt� 2 

A −1 


Meta Learning Problem 

• Have regret: 

T� 

t=1 

ft(xt) − ft(x ∗ ) ≤ 1 

η �x1 − x ∗ � 2 η 

A + 

2 

• What happens if we minimize A in hindsight? 

min 

A 

T� 

t=1 

� 

gt, A −1 � 

gt 

T� 

t=1 

�gt� 2 

A −1 

subject to A � 0, tr(A) ≤ C 




min 

A 

T� 

t=1 

� 

gt, A −1 � 

gt 





min 

A 

T� 

t=1 

� 

gt, A −1 � 

gt 

• Solution is of form 

� 

�T 

A = c diag 

t=1 

gtg ⊤ t 


� 1 

2 

� 

�T 

A = c 

t=1 

(diagonal) (full) 

(where c chosen to satisfy tr constraint) 

gtg ⊤ t 

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 12 / 32 

� 1 

2



min 

A 

T� 

t=1 

� 

gt, A −1 � 

gt 

• Solution is of form 

� 

�T 

A = c diag 

t=1 

gtg ⊤ t 


� 1 

2 

� 

�T 

A = c 

t=1 

(diagonal) (full) 

gtg ⊤ t 

(where c chosen to satisfy tr constraint) 

• Let g1:t,j be vector of jth gradient component. Optimal: 

Aj,j ∝ �g1:T , j� 2 


� 1 

2

Low regret to the best A 

• Let g1:t,j be vector of jth gradient component. At time t, use 

st = � �d �g1:t,j�2 j=1 and At = diag(st) 

� 

1 

2 

xt+1 = argmin �x − xt�At x∈X 2 + η 〈gt, 

� 

x〉 



• Let g1:t,j be vector of jth gradient component. At time t, use 

• Example: 

st = � �d �g1:t,j�2 j=1 and At = diag(st) 

� 

1 

2 

xt+1 = argmin �x − xt�At x∈X 2 + η 〈gt, 

� 

x〉 

yt gt,1 gt,2 gt,3 

1 1 0 0 

-1 .5 0 1 

1 -.5 1 0 

-1 0 0 0 

1 1 1 0 

-1 1 0 0 

s1 = √ 3.5 s2 = √ 2 s3 = 1 


Final Convergence Guarantee 

Algorithm: at time t, set 

Define radius 

st = � �g1:t,j� 2 

xt+1 = argmin 

x∈X 

R∞ := max 

� d 

and At = diag(st) 

j=1 

� 

1 

2 

�x − xt�At 2 + η 〈gt, 

� 

x〉 

t �xt − x ∗ �∞ ≤ sup �x − x 

x∈X 

∗ �∞ . 


Final Convergence Guarantee 

Algorithm: at time t, set 

Define radius 

st = � �g1:t,j� 2 

xt+1 = argmin 

x∈X 

R∞ := max 

� d 

Theorem 

The final regret bound of AdaGrad: 

and At = diag(st) 

j=1 

� 

1 

2 

�x − xt�At 2 + η 〈gt, 

� 

x〉 

t �xt − x ∗ �∞ ≤ sup �x − x 

x∈X 

∗ �∞ . 

T� 

ft(xt) − ft(x ∗ ) ≤ 2R∞ 

t=1 

d� 

i=1 

�g1:T,j� 2 . 


Understanding the convergence guarantees I 

• Stochastic convex optimization: 

f(x) := EP [f(x; ξ)] 


Understanding the convergence guarantees I 

• Stochastic convex optimization: 

f(x) := EP [f(x; ξ)] 

Sample ξt according to P , define ft(x) := f(x; ξt). Then 

� � 

1 

E f 

T 

T� 

t=1 

xt 

�� 

− f(x ∗ ) ≤ 2R∞ 

T 

d� 

E � � 

�g1:T,j�2 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 15 / 32 

i=1

Understanding the convergence guarantees II 

Support vector machine example: define 

f(x; ξ) = [1 − 〈x, ξ〉] + , where ξ ∈ {−1, 0, 1} d 




f(x; ξ) = [1 − 〈x, ξ〉] + , where ξ ∈ {−1, 0, 1} d 

• If ξj �= 0 with probability ∝ j −α for α > 1 

� � 

1 

E f 

T 

T� 

t=1 

xt 

�� 

− f(x ∗ � ∗ �x �∞ 

) = O √ · max 

T � log d, d 1−α/2�� 




f(x; ξ) = [1 − 〈x, ξ〉] + , where ξ ∈ {−1, 0, 1} d 

• If ξj �= 0 with probability ∝ j −α for α > 1 

� � 

1 

E f 

T 

T� 

t=1 

xt 

�� 

• Previously best-known method: 

� � 

1 

E f 

T 

− f(x ∗ � ∗ �x �∞ 

) = O √ · max 

T � log d, d 1−α/2�� 

T� 

t=1 

xt 

�� 

− f(x ∗ � ∗ �x �∞ 

) = O √ · 

T √ � 

d . 


Understanding the convergence guarantees III 

Back to regret minimization 

• Convergence almost as good as that of the best geometry 

matrix: 

T� 

ft(xt) − ft(x ∗ ) 

t=1 

≤ 2 √ d �x ∗ � ∞ 

� 

� 

� 

� 

�T 

�inf s 

t=1 

�gt� 2 

diag(s) −1 

� 

: s � 0, 〈 1, s〉 ≤ d 


Understanding the convergence guarantees III 

Back to regret minimization 

• Convergence almost as good as that of the best geometry 

matrix: 

T� 

ft(xt) − ft(x ∗ ) 

t=1 

≤ 2 √ d �x ∗ � ∞ 

� 

� 

� 

� 

�T 

�inf s 

t=1 

�gt� 2 

diag(s) −1 

• This (and other bounds) are minimax optimal 

� 

: s � 0, 〈 1, s〉 ≤ d 


The AdaGrad Algorithms 

Analysis applies to several algorithms 

st = � �d �g1:t,j�2 j=1 , At = diag(st) 





• Forward-backward splitting (Lions and Mercier 1979, Nesterov 

2007, Duchi and Singer 2009, others) 

� 

1 

2 

xt+1 = argmin �x − xt�At x∈X 2 + 〈gt, 

� 

x〉 + ϕ(x) 





• Forward-backward splitting (Lions and Mercier 1979, Nesterov 

2007, Duchi and Singer 2009, others) 

� 

1 

2 

xt+1 = argmin �x − xt�At x∈X 2 + 〈gt, 

� 

x〉 + ϕ(x) 

• Regularized Dual Averaging (Nesterov 2007, Xiao 2010) 

� 

t� 1 

xt+1 = argmin 〈gτ, x〉 + ϕ(x) + 

x∈X t 

1 

2t �x�2 

� 

At 

τ=1 


An Example and Experimental Results 

• ℓ1-regularization 

• Text classification 

• Image ranking 

• Neural network learning 


AdaGrad with composite updates 

Recall more general problem: 

T� 

t=1 

f(xt) + ϕ(xt) − inf 

x∗ � 

�T 

� 

∈X 

f(x) + ϕ(x) 

t=1 




T� 

t=1 


x∗ � 

�T 

� 

∈X 

f(x) + ϕ(x) 

t=1 

• Must solve updates of form 

argmin 

x∈X 

� 

〈g, x〉 + ϕ(x) + 1 

2 �x�2 

A 


�



T� 

t=1 


x∗ � 

�T 

� 

∈X 

f(x) + ϕ(x) 

t=1 

• Must solve updates of form 

argmin 

x∈X 

• Luckily, often still simple 

� 

〈g, x〉 + ϕ(x) + 1 

2 �x�2 

A 


�

AdaGrad with ℓ1 regularization 

Set ¯gt = 1 

t 

� t 

τ=1 gτ. Need to solve 

min 

x 

〈¯gt, x〉 + λ �x�1 + 1 

〈x, diag(st)x〉 

2t 


AdaGrad with ℓ1 regularization 

Set ¯gt = 1 

t 

� t 

τ=1 gτ. Need to solve 

min 

x 

〈¯gt, x〉 + λ �x�1 + 1 

〈x, diag(st)x〉 

2t 

• Coordinate-wise update yields sparsity and adaptivity: 

¯gt 

0 

−0.2 

−0.4 

xt+1,j = sign(−¯gt,j) 

t 

�g1:t,j� 2 

[|¯gt,j| − λ] + 

λ truncation 

−0.6 

0 50 100 150 200 250 300 

t 


Text Classification 

Reuters RCV1 document classification task—d = 2 · 10 6 features, 

approximately 4000 non-zero features per document 

ft(x) := [1 − 〈x, ξt〉] + 

where ξt ∈ {−1, 0, 1} d is data sample 

1 Crammer et al., 2006 



Text Classification 

Reuters RCV1 document classification task—d = 2 · 10 6 features, 

approximately 4000 non-zero features per document 

ft(x) := [1 − 〈x, ξt〉] + 

where ξt ∈ {−1, 0, 1} d is data sample 

FOBOS AdaGrad PA 1 AROW 2 

Ecomonics .058 (.194) .044 (.086) .059 .049 

Corporate .111 (.226) .053 (.105) .107 .061 

Government .056 (.183) .040 (.080) .066 .044 

Medicine .056 (.146) .035 (.063) .053 .039 



Test set classification error rate 

(sparsity of final predictor in parenthesis) 


Image Ranking 

ImageNet (Deng et al., 2009), large-scale hierarchical image database 

Fish 

Vertebrate 

Cow 

Animal 

Mammal 

Invertebrate 

Train 15,000 rankers/classifiers to 

rank images for each noun (as in 

Grangier and Bengio, 2008) 


Image Ranking 

ImageNet (Deng et al., 2009), large-scale hierarchical image database 

Fish 

Vertebrate 

Cow 

Animal 

Mammal 

Invertebrate 

Train 15,000 rankers/classifiers to 

rank images for each noun (as in 

Grangier and Bengio, 2008) 

Data 

ξ = (z 1 , z 2 ) ∈ {0, 1} d × {0, 1} d is 

pair of images 

f(x; z 1 , z 2 ) = � 1 − � x, z 1 − z 2�� 

+ 


Image Ranking Results 

Precision at k: proportion of examples in top k that belong to 

category. Average precision is average placement of all positive 

examples. 

Algorithm Avg. Prec. P@1 P@5 P@10 Nonzero 

AdaGrad 0.6022 0.8502 0.8130 0.7811 0.7267 

AROW 0.5813 0.8597 0.8165 0.7816 1.0000 

PA 0.5581 0.8455 0.7957 0.7576 1.0000 

Fobos 0.5042 0.7496 0.6950 0.6545 0.8996 


Neural Network Learning 

Wildly non-convex problem: 

where 

f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉)) 

p(α) = 

1 

1 + exp(α) 

p(hx1, ⇠1i) 

x1 x2 x3 x4 x5 

⇠1 ⇠2 ⇠3 ⇠5 ⇠4 



Wildly non-convex problem: 

where 

f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉)) 

p(α) = 

1 

1 + exp(α) 

p(hx1, ⇠1i) 

x1 x2 x3 x4 x5 

⇠1 ⇠2 ⇠3 ⇠5 ⇠4 

Idea: Use stochastic gradient methods to solve it anyway 



Average Frame Accuracy (%) 

25 

20 

15 

10 

Accuracy on Test Set 

5 

SGD 

GPU 

Downpour SGD 

Downpour SGD w/Adagrad 

Sandblaster L−BFGS 

0 

0 20 40 60 80 100 120 

Time (hours) 

(Dean et al. 2012) 

Distributed, d = 1.7 · 10 9 parameters. SGD and AdaGrad use 80 

machines (1000 cores), L-BFGS uses 800 (10000 cores) 


Conclusions and Discussion 

• Family of algorithms that adapt to geometry of data 

• Extendable to full matrix case to handle feature correlation 

• Can derive many efficient algorithms for high-dimensional 

problems, especially with sparsity 


Conclusions and Discussion 

• Family of algorithms that adapt to geometry of data 

• Extendable to full matrix case to handle feature correlation 

• Can derive many efficient algorithms for high-dimensional 

problems, especially with sparsity 

• Future: Efficient full-matrix adaptivity, other types of adaptation 


Thanks! 


OGD Sketch: “Almost” Contraction 

• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity) 

• Before: xt+1 = xt − ηgt 

1 

2 �xt+1 − x ∗ � 2 1 

2 ≤ 

2 �xt − x ∗ � 2 

2 + η (ft(x ∗ ) − ft(xt)) + η2 2 

�gt�2 2 


OGD Sketch: “Almost” Contraction 

• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity) 

• Before: xt+1 = xt − ηgt 

1 

2 �xt+1 − x ∗ � 2 1 

2 ≤ 

2 �xt − x ∗ � 2 

2 + η (ft(x ∗ ) − ft(xt)) + η2 2 

�gt�2 2 

• Now: xt+1 = xt − ηA −1 gt 

1 

2 �xt+1 − x ∗ � 2 

A 

= 1 

2 �xt − x ∗ � 2 

A + η 〈gt, x ∗ − xt〉 + η2 2 

�gt�A 2 −1 

≤ 1 

2 �xt − x ∗ � 2 

A + η (ft(x ∗ ) − ft(xt)) + η2 2 

�gt�A 2 −1 

↑ 

dual norm to �·� A 


Hindsight minimization 

• Focus on diagonal case (full matrix case similar) 

min 

s 

T� 

t=1 

� 

gt, diag(s) −1 � 

gt 

subject to s � 0, 〈1, s〉 ≤ C 

• Let g1:T,j be vector of jth component. Solution is of form 

sj ∝ �g1:T,j� 2 



T� 

t=1 

≤ 1 

2η 

ft(xt) + ϕ(xt) − ft(x ∗ ) − ϕ(x ∗ ) 

T� 

t=1 

� �xt − x ∗ � 2 

At − �xt+1 − x ∗ � 2 

At 

� �� 

Term I 

� 

+ η 

2 

T� 

t=1 

�gt� 2 

A −1 

t 

� �� 

Term II 


Bounding Terms 

Define D∞ = maxt �xt − x ∗ � ∞ ≤ sup x∈X �x − x ∗ � ∞ 

• Term I: 

T� 

t=1 

� 

�xt − x ∗ � 2 

At − �xt+1 − x ∗ � 2 � 2 

≤ D At ∞ 

d� 

j=1 

�g1:T,j� 2 


Bounding Terms 

Define D∞ = maxt �xt − x ∗ � ∞ ≤ sup x∈X �x − x ∗ � ∞ 

• Term I: 

T� 

t=1 

• Term II: 

T� 

t=1 

�gt� 2 

A −1 

t 

� 

�xt − x ∗ � 2 

At − �xt+1 − x ∗ � 2 � 2 

≤ D At ∞ 

≤ 2 

T� 

t=1 

�gt� 2 

A −1 

T 

� 

� � 

� T� 

= 2�inf 

〈gt, diag(s) −1gt〉 s 

t=1 

= 2 

d� 

j=1 

�g1:T,j� 2 

d� 

j=1 

� 

� 

� s � 0, 〈1, s〉 ≤ 

�g1:T,j� 2 

d� 

j=1 

�g1:T,j� 2 


�

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?