22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

layer_norm = nn.LayerNorm(d_model)

normalized = layer_norm(inputs)

normalized[0][0].mean(), normalized[0][0].std(unbiased=False)

Output

(tensor(-1.4901e-08, grad_fn=<MeanBackward0>),

tensor(1.0000, grad_fn=<StdBackward0>))

Zero mean and unit standard deviation, as expected.

"Why do they have a grad_fn attribute?"

Like batch normalization, layer normalization can learn affine transformations. Yes,

plural: Each feature has its own affine transformation. Since we’re using layer

normalization on d_model, and its dimensionality is four, there will be four weights

and four biases in the state_dict():

layer_norm.state_dict()

Output

OrderedDict([('weight', tensor([1., 1., 1., 1.])),

('bias', tensor([0., 0., 0., 0.]))])

The weights and biases are used to scale and translate, respectively, the

standardized values:

Equation 10.10 - Layer normalization (with affine transformation)

In PyTorch’s documentation, though, you’ll find gamma and beta instead:

Equation 10.11 - Layer Normalization (with affine transformation)

Layer Normalization | 825

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!