22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

If these two sentences were the input of the NSP task, that’s what BERT’s inputs

and outputs would look like.

Figure 11.27 - Pre-training task—next sentence prediction (NSP)

The final hidden state is actually further processed by a pooler (composed of a linear

layer and a hyperbolic tangent activation function) before being fed to the classifier

(FFN, feed-forward network, in the figure above):

bert_model.pooler

Output

BertPooler(

(dense): Linear(in_features=768, out_features=768, bias=True)

(activation): Tanh()

)

Outputs

We’ve seen the many embeddings BERT uses as inputs, but we’re more interested

in its outputs, like the contextual word embeddings, right?

By the way, BERT’s outputs are always batch-first; that is, their

shape is (mini-batch size, sequence length, hidden_dimensions).

BERT | 979

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!