22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Output

[1219, 5, 229, 200, 1]

"OK, fine, but that doesn’t seem very practical."

You’re absolutely right. We can use the encode() method to perform two steps at

once:

new_ids = tokenizer.encode(new_sentence)

new_ids

Output

[3, 1219, 5, 229, 200, 1, 2]

There we go, from sentence to token IDs in one call!

"Nice try! There are more IDs than tokens in this output! Something

must be wrong…"

Yes, there are more IDs than tokens. No, there’s nothing wrong; it’s actually meant

to be like that. These extra tokens are special tokens too. We could look them up in

the vocabulary using their indices (three and two), but it’s nicer to use the

tokenizer’s convert_ids_to_tokens() method:

tokenizer.convert_ids_to_tokens(new_ids)

Output

['[CLS]', 'follow', 'the', 'white', 'rabbit', '[UNK]', '[SEP]']

The tokenizer not only appended a special separation token ([SEP]) to the output,

but also prepended a special classifier token ([CLS]) to it. We’ve already added a

classifier token to the inputs of a Vision Transformer to use its corresponding output

in a classification task. We can do the same here to classify text using BERT.

908 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!