22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The dampening factor is a way to, well, dampen the effect of the latest gradient.

Instead of having its full value added, the latest gradient gets its contribution to

momentum reduced by the dampening factor. So, if the dampening factor is 0.3,

only 70% of the latest gradient gets added to momentum. Its formula is given by:

Equation 6.13 - Momentum with dampening factor

If the dampening factor equals the momentum factor (beta), it

becomes a true EWMA!

Similar to Adam, SGD with momentum keeps the value of momentum for each

parameter. The beta parameter is stored there as well (momentum). We can take a

peek at it using the optimizer’s state_dict():

{'state': {139863047119488: {'momentum_buffer': tensor([[-

0.0053]])},

139863047119168: {'momentum_buffer': tensor([-0.1568])}},

'param_groups': [{'lr': 0.1,

'momentum': 0.9,

'dampening': 0,

'weight_decay': 0,

'nesterov': False,

'params': [139863047119488, 139863047119168]}]}

Even though old gradients slowly fade away, contributing less and less to the sum,

very recent gradients are taken almost at their face value (assuming a typical

value of 0.9 for beta and no dampening). This means that, given a sequence of all

positive (or all negative) gradients, their sum, that is, the momentum, is going up

really fast (in absolute value). A large momentum gets translated into a large

update since momentum replaces gradients in the parameter update:

Equation 6.14 - Parameter update

This behavior can be easily visualized in the path taken by SGD with momentum.

472 | Chapter 6: Rock, Paper, Scissors

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!