22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

computing the loss, as shown in the figure below.

Figure 0.19 - The paths of gradient descent (Obs.: random start is different from Figure 0.4)

You can see that the resulting parameters at the end of Epoch 1 differ greatly from

one another. This is a direct consequence of the number of updates happening

during one epoch, according to the batch size. In our example, for 100 epochs:

• 80 data points (batch): 1 update / epoch, totaling 100 updates

• 16 data points (mini-batch): 5 updates / epoch, totaling 500 updates

• 1 data point (stochastic): 80 updates / epoch, totaling 8,000 updates

So, for both center and right plots, the path between random start and Epoch 1

contains multiple updates, which are not depicted in the plot (otherwise it would

be very cluttered)—that’s why the line connecting two epochs is dashed, instead of

solid. In reality, there would be zig-zagging lines connecting every two epochs.

There are two things to notice:

• It should be no surprise that mini-batch gradient descent is able to get closer to

the minimum point (using the same number of epochs) since it benefits from a

larger number of updates than batch gradient descent.

• The stochastic gradient descent path is somewhat weird: It gets quite close to

the minimum point at the end of Epoch 1 already, but then it seems to fail to

actually reach it. But this is expected since it uses a single data point for each

update; it will never stabilize, forever hovering in the neighborhood of the

minimum point.

Clearly, there is a trade-off here: Either we have a stable and smooth trajectory, or

we move faster toward the minimum.

Step 5 - Rinse and Repeat! | 57

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!