22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 6.19 - Losses (SGD and Adam)

Remember, the losses are computed at the end of each epoch by averaging the

losses of the mini-batches. On the left plot, even if SGD wiggles a bit, we can see

that every epoch shows a lower loss than the previous one. On the right plot, the

overshooting becomes clearly visible as an increase in the training loss. But it is

also clear that Adam achieves a lower loss because it got closer to the optimal

value (the red dot in the previous plot).

In real problems, where it is virtually impossible to plot the loss

surface, we can look at the losses as an "executive summary" of

what’s going on. Training losses will sometimes go up before they

go down again, and this is expected.

Stochastic Gradient Descent (SGD)

Adaptive learning rates are cool, indeed, but good old stochastic gradient descent

(SGD) also has a couple of tricks up its sleeve. Let’s take a closer look at PyTorch’s

SGD optimizer and its arguments:

• params: model’s parameters

• lr: learning rate

• weight_decay: L2 penalty

The three arguments above are already known. But there are three new

arguments:

• momentum: momentum factor, SGD’s own beta argument, is the topic of the next

section

470 | Chapter 6: Rock, Paper, Scissors

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!