20.03.2021 Views

Deep-Learning-with-PyTorch

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

228 CHAPTER 8 Using convolutions to generalize

fragile in convergence. This is why we used more-detailed initializations and trained

our NetRes with a learning rate of 3e – 3 instead of the 1e – 2 we used for the other

networks. We trained none of the networks to convergence, but we would not have

gotten anywhere without these tweaks.

All this shouldn’t encourage us to seek depth on a dataset of 32 × 32 images, but it

clearly demonstrates how this can be achieved on more challenging datasets like Image-

Net. It also provides the key elements for understanding existing implementations for

models like ResNet, for instance, in torchvision.

INITIALIZATION

Let’s briefly comment about the earlier initialization. Initialization is one of the

important tricks in training neural networks. Unfortunately, for historical reasons,

PyTorch has default weight initializations that are not ideal. People are looking at fixing

the situation; if progress is made, it can be tracked on GitHub (https://

github.com/pytorch/pytorch/issues/18182). In the meantime, we need to fix the

weight initialization ourselves. We found that our model did not converge and looked

at what people commonly choose as initialization (a smaller variance in weights; and

zero mean and unit variance outputs for batch norm), and then we halved the output

variance in the batch norm when the network would not converge.

Weight initialization could fill an entire chapter on its own, but we think that

would be excessive. In chapter 11, we’ll bump into initialization again and use what

arguably could be PyTorch defaults without much explanation. Once you’ve progressed

to the point where the details of weight initialization are of specific interest to

you—probably not before finishing this book—you might revisit this topic. 10

8.5.4 Comparing the designs from this section

We summarize the effect of each of our design modifications in isolation in figure

8.13. We should not overinterpret any of the specific numbers—our problem setup

and experiments are simplistic, and repeating the experiment with different random

seeds will probably generate variation at least as large as the differences in validation

accuracy. For this demonstration, we left all other things equal, from learning rate to

number of epochs to train; in practice, we would try to get the best results by varying

those. Also, we would likely want to combine some of the additional design elements.

But a qualitative observation may be in order: as we saw in section 5.5.3, when discussing

validatioin and overfitting, The weight decay and dropout regularizations,

which have a more rigorous statistical estimation interpretation as regularization than

batch norm, have a much narrower gap between the two accuracies. Batch norm, which

10 The seminal paper on the topic is by X. Glorot and Y. Bengio: “Understanding the Difficulty of Training Deep

Feedforward Neural Networks” (2010), which introduces PyTorch’s Xavier initializations (http://

mng.bz/vxz7). The ResNet paper we mentioned expands on the topic, too, giving us the Kaiming initializations

used earlier. More recently, H. Zhang et al. have tweaked initialization to the point that they do not need

batch norm in their experiments with very deep residual networks (https://arxiv.org/abs/1901.09321).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!