22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

gradients (the momentum). Finally, there is Nesterov, which combines both (with a

small tweak).

How different are the updates? Let’s check it out! The plots below show the update

term (before multiplying it by the learning rate) for the weight parameter of our

linear regression.

Figure 6.21 - Update terms corresponding to SGD flavors

Does the shape of the update term for SGD with momentum ring a bell? The

oscillating pattern was already depicted in the path taken by SGD with momentum

while optimizing the two parameters: When it overshoots, it has to reverse

direction, and by repeatedly doing that, these oscillations are produced.

Nesterov momentum seems to do a better job: The look-ahead has the effect of

dampening the oscillations (please do not confuse this effect with the actual

dampening argument). Sure, the idea is to look ahead to avoid going too far, but

could you have told me the difference between the two plots beforehand? Me

neither! Well, I am assuming you replied "no" to this question, and that’s why I

thought it was a good idea to illustrate the patterns above.

"How come the black lines are different in these plots? Isn’t the

underlying gradient supposed to be the same?"

The gradient is indeed computed the same way in all three flavors, but since the

update terms are different, the gradients are computed at different locations of

the loss surface. This becomes clear when we look at the paths taken by each of the

flavors.

476 | Chapter 6: Rock, Paper, Scissors

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!