20.03.2021 Views

Deep-Learning-with-PyTorch

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

What next? Additional sources of inspiration (and data)

435

CLASSIC REGULARIZATION AND AUGMENTATION

You might have noticed that we did not even use all the regularization techniques

from chapter 8. For example, dropout would be an easy thing to try.

While we have some augmentation in place, we could go further. One relatively powerful

augmentation method we did not attempt to employ is elastic deformations,

where we put “digital crumples” into the inputs. 12 This makes for much more variability

than rotation and flipping alone and would seem to be applicable to our tasks as well.

MORE ABSTRACT AUGMENTATION

So far, our augmentation has been geometrically inspired—we transformed our input

to more or less look like something plausible we might see. It turns out that we need

not limit ourselves to that type of augmentation.

Recall from chapter 8 that mathematically, the cross-entropy loss we have been using

is a measure of the discrepancy between two probability distributions—that of the predictions

and the distribution that puts all probability mass on the label and can be represented

by the one-hot vector for the label. If overconfidence is a problem for our

network, one simple thing we could try is not using the one-hot distribution but rather

putting a small probability mass on the “wrong” classes. 13 This is called label smoothing.

We can also mess with inputs and labels at the same time. A very general and also

easy-to-apply augmentation technique for doing this has been proposed under the

name of mixup: 14 the authors propose to randomly interpolate both inputs and labels.

Interestingly, with a linearity assumption for the loss (which is satisfied by binary cross

entropy), this is equivalent to just manipulating the inputs with a weight drawn from

an appropriately adapted distribution. 15 Clearly, we don’t expect blended inputs to

occur when working on real data, but it seems that this mixing encourages stability of

the predictions and is very effective.

BEYOND A SINGLE BEST MODEL: ENSEMBLING

One perspective we could have on the problem of overfitting is that our model is

capable of working the way we want if we knew the right parameters, but we don’t

actually know them. 16 If we followed this intuition, we might try to come up with several

sets of parameters (that is, several models), hoping that the weaknesses of each

might compensate for the other. This technique of evaluating several models and

combining the output is called ensembling. Simply put, we train several models and

then, in order to predict, run all of them and average the predictions. When each

individual model overfits (or we have taken a snapshot of the model just before we

started to see the overfitting), it seems plausible that the models might start to make

bad predictions on different inputs, rather than always overfit the same sample first.

12 You can find a recipe (albeit aimed at TensorFlow) at http://mng.bz/Md5Q.

13 You can use nn.KLDivLoss loss for this.

14 Hongyi Zhang et al., “mixup: Beyond Empirical Risk Minimization,” https://arxiv.org/abs/1710.09412.

15 See Ferenc Huszár’s post at http://mng.bz/aRJj/; he also provides PyTorch code.

16 We might expand that to be outright Bayesian, but we’ll just go with this bit of intuition.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!