22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

hidden states, the decoder will recruit the attention mechanism to help it decide

which parts of the source sequence it must pay attention to.

In our made-up example, the attention mechanism informed the decoder it should

pay 80% of its attention to the encoder’s hidden state corresponding to the word

"the," and the remaining 20% to the word "zone." The diagram below illustrates this.

Figure 9.11 - Paying attention to words

"Values"

From now on, we’ll be referring to the encoder’s hidden states (or their affine

transformations) as "values" (V). The resulting multiplication of a "value" by its

corresponding attention score is called an alignment vector. And, as you can see in

the diagram, the sum of all alignment vectors (that is, the weighted average of the

hidden states) is called a context vector.

Equation 9.1 - Context vector

"OK, but where do the attention scores come from?"

"Keys" and "Queries"

The attention scores are based on matching each hidden state of the decoder (h 2 )

to every hidden state of the encoder (h 0 and h 1 ). Some of them will be good

matches (high attention scores) while some others will be poor matches (low

attention scores).

708 | Chapter 9 — Part I: Sequence-to-Sequence

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!