22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Output

tensor([[[-1.0000, 0.0000],

[-0.1585, 1.5403]]])

"What am I looking at?"

It turns out, the original coordinates were somewhat crowded-out by the addition of

the positional encoding (especially the first row). This may happen if the data

points have values roughly in the same range as the positional encoding.

Unfortunately, this is fairly common: Both standardized inputs and word

embeddings (we’ll get back to them in Chapter 11) are likely to have most of their

values inside the [-1, 1] range of the positional encoding.

"How can we handle it then?"

That’s what the scaling in the forward() method is for: It’s as if we were "reversing

the standardization" of the inputs (using a standard deviation equal to the square

root of their dimensionality) to retrieve the hypothetical "raw" inputs.

Equation 9.22 - "Reversing" the standardization

By the way, previously, we scaled the dot product using the

inverse of the square root of its dimensionality, which was its

standard deviation.

Even though this is not the same thing, the analogy might help

you remember that the inputs are also scaled by the square root

of their number of dimensions before the positional encoding

gets added to them.

In our example, the dimensionality is two (coordinates), so the inputs are going to

be scaled by the square root of two:

posenc(source_seq)

Positional Encoding (PE) | 777

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!