22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 10.2 - Chunking: the wrong and the right way

On the left, the correct approach: It computes the projections first and chunks

them later. It is clear that each value in the projection (from v 00 to v 03 ) is a linear

combination of all features in the data point.

Since each head is working with a subset of the projected

dimensions, these projected dimensions may end up

representing different aspects of the underlying data. For

natural language processing tasks, for example, some attention

heads may correspond to linguistic notions of syntax and

coherence. A particular head may attend to the direct objects of

verbs, while another head may attend to objects of prepositions,

and so on. [148]

Now, compare it to the wrong approach, on the right: By chunking it first, each

value in the projection is a linear combination of a subset of the features only.

"Why is it so bad?"

First, it is a simpler model (the wrong approach has only eight weights while the

correct one has sixteen), so its learning capacity is limited. Second, since each head

can only look at a subset of the features, they simply cannot learn about longrange

dependencies in the inputs.

Now, let’s use a source sequence of length two as input, with each data point

having four features like the chunking example above, to illustrate our new selfattention

mechanism.

Narrow Attention | 799

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!