20.03.2021 Views

Deep-Learning-with-PyTorch

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Model design

219

out = out.view(-1, 8 * 8 * self.n_chans1 // 2)

out = torch.tanh(self.fc1(out))

out = self.fc2(out)

return out

The numbers specifying channels and features for each layer are directly related to

the number of parameters in a model; all other things being equal, they increase the

capacity of the model. As we did previously, we can look at how many parameters our

model has now:

# In[44]:

sum(p.numel() for p in model.parameters())

# Out[44]:

38386

The greater the capacity, the more variability in the inputs the model will be able to

manage; but at the same time, the more likely overfitting will be, since the model can

use a greater number of parameters to memorize unessential aspects of the input. We

already went into ways to combat overfitting, the best being increasing the sample size

or, in the absence of new data, augmenting existing data through artificial modifications

of the same data.

There are a few more tricks we can play at the model level (without acting on the

data) to control overfitting. Let’s review the most common ones.

8.5.2 Helping our model to converge and generalize: Regularization

Training a model involves two critical steps: optimization, when we need the loss to

decrease on the training set; and generalization, when the model has to work not only

on the training set but also on data it has not seen before, like the validation set. The

mathematical tools aimed at easing these two steps are sometimes subsumed under

the label regularization.

KEEPING THE PARAMETERS IN CHECK: WEIGHT PENALTIES

The first way to stabilize generalization is to add a regularization term to the loss. This

term is crafted so that the weights of the model tend to be small on their own, limiting

how much training makes them grow. In other words, it is a penalty on larger weight

values. This makes the loss have a smoother topography, and there’s relatively less to

gain from fitting individual samples.

The most popular regularization terms of this kind are L2 regularization, which is

the sum of squares of all weights in the model, and L1 regularization, which is the sum

of the absolute values of all weights in the model. 9 Both of them are scaled by a

(small) factor, which is a hyperparameter we set prior to training.

9

We’ll focus on L2 regularization here. L1 regularization—popularized in the more general statistics literature

by its use in Lasso—has the attractive property of resulting in sparse trained weights.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!