22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Batch and layer normalization look quite similar to one another, but there are some

important differences between them that we need to point out.

Batch vs Layer

Although both normalizations compute statistics, namely, mean and biased

standard deviation, to standardize the inputs, only batch norm needs to keep track

of running statistics.

Moreover, since layer normalization considers data points

individually, it exhibits the same behavior whether the model is

in training or in evaluation mode.

To illustrate the difference between the two types of normalization, let’s generate

yet another dummy example (again adding positional encoding to it):

torch.manual_seed(23)

dummy_points = torch.randn(4, 1, 256)

dummy_pe = PositionalEncoding(1, 256)

dummy_enc = dummy_pe(dummy_points)

dummy_enc

Output

tensor([[[-14.4193, 10.0495, -7.8116, ..., -18.0732, -3.9566]],

[[ 2.6628, -3.5462, -23.6461, ..., -18.4375, -37.4197]],

[[-24.6397, -1.9127, -16.4244, ..., -26.0550, -14.0706]],

[[ 13.7988, 21.4612, 10.4125, ..., -17.0188, 3.9237]]])

There are four sequences, so let’s pretend there are two mini-batches of two

sequences each (N=2). Each sequence has a length of one (L=1 is not quite a

sequence, I know), and their sole data points have 256 features (D=256). The figure

below illustrates the difference between applying batch norm (over features /

columns) and layer norm (over data points / rows).

826 | Chapter 10: Transform and Roll Out

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!