22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

parms, gradients, activations = get_plot_data(

train_loader=ball_loader, model=model

)

Figure E.1 - Vanishing gradients

On the left-most plot, we can see that the initial weights in each layer are uniformly

distributed, but the first hidden layer has a much wider range. This is a

consequence of the default initialization scheme used by PyTorch’s linear layer, but

we’re not delving into these details here.

The activation values are clearly shrinking as data moves from one layer to the

next. Conversely, the gradients are larger in the last layer, and shrink as the

gradient descent algorithm works its way back to the first layer. That’s a simple and

straightforward example of vanishing gradients.

Gradients can also be exploding instead of vanishing. In this case,

activation values grow larger and larger as data moves from one

layer to the next, and gradients are smaller in the last layer,

growing as we move back up to the first layer.

This phenomenon is less common, though, and can be more easily

handled using a technique called gradient clipping that simply

caps the absolute value of the gradients. We’ll get back to that

later.

"How can we prevent the vanishing gradients problem?"

If we manage to get the distribution of activation values similar across all layers,

we may have a shot at it. But, to achieve that, we need to tweak the variance of the

566 | Extra Chapter: Vanishing and Exploding Gradients

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!