22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

from transformers import BertTokenizer

tokenizer = BertTokenizer('our_vocab/vocab.txt')

The purpose of this is to illustrate how the tokenizer works using

simple word tokenization only! The (pre-trained) tokenizer you’ll

use for real with a (pre-trained) BERT model does not need a

vocabulary.

The tokenizer class is very rich and offers a plethora of methods

and arguments. We’re just using some basic methods that barely

scratch the surface. For more details, please refer to

HuggingFace’s documentation on the tokenizer [175] and

BertTokenizer [176] classes.

Then, let’s tokenize a new sentence using its tokenize() method:

new_sentence = 'follow the white rabbit neo'

new_tokens = tokenizer.tokenize(new_sentence)

new_tokens

Output

['follow', 'the', 'white', 'rabbit', '[UNK]']

Since Neo (from The Matrix) isn’t part of the original Alice’s Adventures in

Wonderland, it couldn’t possibly be in our vocabulary, and thus it is treated as an

unknown word with its corresponding special token.

"There is nothing new here—wasn’t it supposed to return indices and

more?"

Wait for it… First, we actually can get the indices (the token IDs) using the

convert_tokens_to_ids() method:

new_ids = tokenizer.convert_tokens_to_ids(new_tokens)

new_ids

Word Tokenization | 907

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!