22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

output of the pooler given the last hidden state as its input:

(out['pooler_output'] ==

bert_model.pooler(out['last_hidden_state'])).all()

Output

tensor(True)

• attentions returns the self-attention scores for each attention head in each

"layer" of BERT’s encoder:

print(len(out['attentions']))

print(out['attentions'][0].shape)

Output

12

torch.Size([1, 12, 30, 30])

The returned tuple has twelve elements, one for each "layer," and each element

has a tensor containing the scores for the sentences in the mini-batch (only one in

our case). Those scores include each of BERT’s twelve self-attention heads, with

each head indicating how much attention each of the thirty tokens is paying to all

thirty tokens in the input. Are you still with me? That’s 129,600 attention scores in

total! No, I’m not even trying to visualize that :-)

Model IV — Classifying Using BERT

We’ll use a Transformer encoder as a classifier once again (like in "Model II"), but it

will be much easier now because BERT will be our encoder, and it already handles

the special classifier token by itself:

984 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!