22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1 First step: Alignment scores (scaled dot product)

2 Second step: Attention scores (alphas)

3 Third step: Context vector

Let’s go over each of the methods:

• In the constructor method, there are the following:

◦ three linear layers corresponding to the affine transformations for "keys"

and "query" (and for the future transformation of "values" too)

◦ one attribute for the number of hidden dimensions (to scale the dot

product)

◦ a placeholder for the attention scores (alphas)

• There is an init_keys() method to receive a batch-first sequence of hidden

states from the encoder.

◦ These are computed once at the beginning and will be used over and over

again with every new "query" that is presented to the attention

mechanism.

◦ Therefore, it is better to initialize "keys" and "values" once than to pass

them as arguments to the forward() method every time.

• The score_function() is simply the scaled dot product, but using an affine

transformation on the "query" this time.

• The forward() method takes a batch-first hidden state as "query" and

performs the three steps of the attention mechanism:

◦ Using "keys" and "query" to compute alignment scores

◦ Using alignment scores to compute attention scores (alphas)

◦ Using "values" and attention scores to generate the context vector

"There is that unexplained mask again!"

I’m on it!

Source Mask

The mask can be used to, well, mask some of the "values" to force the attention

mechanism to ignore them.

726 | Chapter 9 — Part I: Sequence-to-Sequence

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!