22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 11.26 - Pre-training task—masked language model (MLM)

Also, notice that BERT computes logits for the randomly masked inputs only. The

remaining inputs are not even considered for computing the loss.

"OK, but how can we randomly replace tokens like that?"

One alternative, similar to the way we do data augmentation for images, would be

to implement a custom dataset that performs the replacements on the fly in the

__getitem__() method. There is a better alternative, though: using a collate

function or, better yet, a data collator. There’s a data collator that performs the

replacement procedure prescribed by BERT: DataCollatorForLanguageModeling.

Let’s see an example of it in action, starting with an input sentence:

sentence = 'Alice is inexplicably following the white rabbit'

tokens = bert_tokenizer(sentence)

tokens['input_ids']

Output

[101, 5650, 2003, 1999, 10288, 24759, 5555, 6321, 2206, 1996, 2317,

10442, 102]

976 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!