20.03.2021 Views

Deep-Learning-with-PyTorch

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Down along the gradient

119

learning_rate = 1e-4,

params = torch.tensor([1.0, 0.0]),

t_u = t_u,

t_c = t_c)

# Out[18]:

Epoch 1, Loss 1763.884644

Params: tensor([ 0.5483, -0.0083])

Grad: tensor([4517.2969, 82.6000])

Epoch 2, Loss 323.090546

Params: tensor([ 0.3623, -0.0118])

Grad: tensor([1859.5493, 35.7843])

Epoch 3, Loss 78.929634

Params: tensor([ 0.2858, -0.0135])

Grad: tensor([765.4667, 16.5122])

...

Epoch 10, Loss 29.105242

Params: tensor([ 0.2324, -0.0166])

Grad: tensor([1.4803, 3.0544])

Epoch 11, Loss 29.104168

Params: tensor([ 0.2323, -0.0169])

Grad: tensor([0.5781, 3.0384])

...

Epoch 99, Loss 29.023582

Params: tensor([ 0.2327, -0.0435])

Grad: tensor([-0.0533, 3.0226])

Epoch 100, Loss 29.022669

Params: tensor([ 0.2327, -0.0438])

Grad: tensor([-0.0532, 3.0226])

tensor([ 0.2327, -0.0438])

Nice—the behavior is now stable. But there’s another problem: the updates to parameters

are very small, so the loss decreases very slowly and eventually stalls. We could

obviate this issue by making learning_rate adaptive: that is, change according to the

magnitude of updates. There are optimization schemes that do that, and we’ll see one

toward the end of this chapter, in section 5.5.2.

However, there’s another potential troublemaker in the update term: the gradient

itself. Let’s go back and look at grad at epoch 1 during optimization.

5.4.4 Normalizing inputs

We can see that the first-epoch gradient for the weight is about 50 times larger than

the gradient for the bias. This means the weight and bias live in differently scaled

spaces. If this is the case, a learning rate that’s large enough to meaningfully update

one will be so large as to be unstable for the other; and a rate that’s appropriate for

the other won’t be large enough to meaningfully change the first. That means we’re

not going to be able to update our parameters unless we change something about our

formulation of the problem. We could have individual learning rates for each parameter,

but for models with many parameters, this would be too much to bother with; it’s

babysitting of the kind we don’t like.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!