22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

We can take the sentences from our training set, add special tokens to the

vocabulary, filter out any words appearing only once, and save the vocabulary file

to the our_vocab folder:

make_vocab(train_dataset['sentence'],

'our_vocab/',

special_tokens=['[PAD]', '[UNK]'],

min_freq=2)

And now we can use this vocabulary file with a tokenizer.

"But I thought we were already using tokenizers … aren’t we?"

Yes, we are. First, we used a sentence tokenizer to split the texts into sentences.

Then, we used a word tokenizer to split each sentence into words. But there is yet

another tokenizer…

HuggingFace’s Tokenizer

Since we’re using HF’s datasets, it is only logical that we use HF’s tokenizers as

well, right? Besides, in order to use a pre-trained BERT model, we need to use the

model’s corresponding pre-trained tokenizer.

"Why?"

Just like pre-trained computer vision models require that the input images are

standardized using ImageNet statistics, pre-trained language models like BERT

require that the inputs are properly tokenized. The tokenization used in BERT is

different than the simple word tokenization we’ve just discussed. We’ll get back to

that in due time, but let’s stick with the simple tokenization for now.

So, before loading a pre-trained tokenizer, let’s create our own tokenizer using our

own vocabulary. HuggingFace’s tokenizers also expect a sentence as input, and

they also proceed to perform some sort of word tokenization. But, instead of simply

returning the tokens themselves, these tokenizers return the indices in the

vocabulary corresponding to the tokens, and lots of additional information. It’s

like Gensim’s doc2idx(), but on steroids! Let’s see it in action!

We’ll be using the BertTokenizer class to create a tokenizer based on our own

vocabulary:

906 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!