08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Stochastic Gradient Descent:<br />

Given: starting point w = w init and learning rates λ 1 , λ 2 , λ 3 , . . .<br />

(e.g., w init = 0 and λ t = 1 for all t, or λ t = 1/ √ t).<br />

Consider a sequence <strong>of</strong> random examples (x 1 , c ∗ (x 1 )), (x 2 , c ∗ (x 2 )), . . ..<br />

1. Given example (x t , c ∗ (x t )), compute the gradient ∇L(f w (x t ), c ∗ (x t )) <strong>of</strong> the loss <strong>of</strong><br />

f w (x t ) with respect to the weights w. This is a vector in R n whose ith component is<br />

∂L(f w(x t),c ∗ (x t))<br />

∂w i<br />

.<br />

2. Update: w ← w − λ t ∇L(f w (x t ), c ∗ (x t )).<br />

Let’s now try to understand the algorithm better by seeing a few examples <strong>of</strong> instantiating<br />

the class <strong>of</strong> functions F and loss function L.<br />

First, consider n = d and f w (x) = w T x, so F is the class <strong>of</strong> linear predictors. Consider<br />

the loss function L(f w (x), c ∗ (x)) = max(0, −c ∗ (x)f w (x)), and recall that c ∗ (x) ∈ {−1, 1}.<br />

In other words, if f w (x) has the correct sign, then we have a loss <strong>of</strong> 0, otherwise we have<br />

a loss equal to the magnitude <strong>of</strong> f w (x). In this case, if f w (x) has the correct sign and is<br />

non-zero, then the gradient will be zero since an infinitesimal change in any <strong>of</strong> the weights<br />

will not change the sign. So, when h w (x) is correct, the algorithm will leave w alone.<br />

On the other hand, if f w (x) has the wrong sign, then ∂L<br />

∂w i<br />

= −c ∗ (x) ∂w·x<br />

∂w i<br />

= −c ∗ (x)x i . So,<br />

using λ t = 1, the algorithm will update w ← w + c ∗ (x)x. Note that this is exactly the<br />

Perceptron algorithm. (Technically we must address the case that f w (x) = 0; in this case,<br />

we should view f w as having the wrong sign just barely.)<br />

As a small modification to the above example, consider the same class <strong>of</strong> linear predictors<br />

F but now modify the loss function to the hinge-loss L(f w (x), c ∗ (x)) = max(0, 1 −<br />

c ∗ (x)f w (x)). This loss function now requires f w (x) to have the correct sign and have magnitude<br />

at least 1 in order to be zero. Hinge loss has the useful property that it is an upper<br />

bound on error rate: for any sample S, the training error is at most ∑ x∈S L(f w(x), c ∗ (x)).<br />

With this loss function, stochastic gradient descent is called the margin perceptron algorithm.<br />

More generally, we could have a much more complex class F. For example, consider<br />

a layered circuit <strong>of</strong> s<strong>of</strong>t threshold gates. Each node in the circuit computes a linear function<br />

<strong>of</strong> its inputs and then passes this value through an “activation function” such as<br />

a(z) = tanh(z) = (e z − e −z )/(e z + e −z ). This circuit could have multiple layers with<br />

the output <strong>of</strong> layer i being used as the input to layer i + 1. The vector w would be the<br />

concatenation <strong>of</strong> all the weight vectors in the network. This is the idea <strong>of</strong> deep neural<br />

networks discussed further in Section 6.13.<br />

While it is difficult to give general guarantees on when stochastic gradient descent will<br />

succeed in finding a hypothesis <strong>of</strong> low error on its training set S, Theorems 6.5 and 6.3<br />

219

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!