22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Equation 6.16 - Looking ahead

Once Nesterov’s momentum is computed, it replaces the gradient in the

parameter update, just like regular momentum does:

Equation 6.17 - Parameter update

But, Nesterov actually uses momentum, so we can expand its expression like this:

Equation 6.18 - Parameter update (expanded)

"Why did you do this? What’s the purpose of making the formula more

complicated?"

You’ll understand why in a minute :-)

Flavors of SGD

Let’s compare the three flavors of SGD, vanilla (regular), momentum, and Nesterov,

when it comes to the way they perform the parameter update:

Equation 6.19 - Flavors of parameter update

That’s why I expanded Nesterov’s expression in the last section: It is easier to

compare the updates this way! First, there is regular SGD, which uses the gradient

and nothing else. Then, there is momentum, which uses a "discounted" sum of past

Learning Rates | 475

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!