22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

shifted_seq = torch.cat([source_seq[:, -1:],

target_seq[:, :-1]], dim=1)

The shifted target sequence was already used (even though we didn’t have a name

for it) when we discussed teacher forcing. There, at every step (after the first one), it

randomly chose as the input to the subsequent step either an actual element from

that sequence or a prediction. It worked very well with recurrent layers that were

sequential in nature. But this isn’t the case anymore.

One of the advantages of self-attention over recurrent layers is

that operations can be parallelized. No need to do anything

sequentially anymore, teacher forcing included. This means we’re

using the whole shifted target sequence at once as the "query"

argument of the decoder.

That’s very nice and cool, sure, but it raises one big problem involving the…

Attention Scores

To understand what the problem is, let’s look at the context vector that will result

in the first "hidden state" produced by the decoder, which, in turn, will lead to the

first prediction:

Equation 9.14 - Context vector for the first target

"What’s the problem with it?"

The problem is that it is using a "key" (K 2 ) and a "value" (V 2 ) that are

transformations of the data point it is trying to predict.

In other words, the model is being allowed to cheat by peeking

into the future because we’re giving it all data points in the

target sequence except the very last one.

If we look at the context vector corresponding to the last prediction, it should be

Self-Attention | 751

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!