22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 9 — Part II

Sequence-to-Sequence

Spoilers

In the second half of this chapter, we will:

• use self-attention mechanisms to replace recurrent layers in both the encoder

and the decoder

• understand the importance of the target mask to avoid data leakage

• learn how to use positional encoding

Self-Attention

Here is a radical notion: What if we replaced the recurrent layer with an attention

mechanism?

That’s the main proposition of the famous "Attention Is All You Need" [142] paper by

Vaswani, A., et al. It introduced the Transformer architecture, based on a selfattention

mechanism, that was soon going to completely dominate the NLP

landscape.

"I pity the fool using recurrent layers."

Mr. T

The recurrent layer in the encoder took the source sequence in and, one by one,

generated hidden states. But we don’t have to generate hidden states like that. We

can use another, separate, attention mechanism to replace the encoder (and, wait

for it, the decoder too!).

These separate attention mechanisms are called self-attention

mechanisms since all of their inputs—"keys," "values," and

"query"—are internal to either an encoder or a decoder.

The attention mechanism we discussed in the previous section,

where "keys" and "values" come from the encoder, but the

"query" comes from the decoder, is going to be referred to as

cross-attention from now on.

740 | Chapter 9 — Part II: Sequence-to-Sequence

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!