22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Now we can use its encode() method to get the indices for the tokens in a

sentence:

glove_tokenizer.encode('alice followed the white rabbit',

add_special_tokens=False)

Output

[7101, 930, 2, 300, 12427]

These are the indices we’ll use to retrieve the corresponding word embeddings.

There is one small detail we need to take care of first, though…

Special Tokens' Embeddings

Our vocabulary has 400,002 tokens now, but the original pre-trained word

embeddings has only 400,000 entries:

len(glove_tokenizer.vocab), len(glove.vectors)

Output

(400002, 400000)

The difference is due to the two special tokens, [PAD] and [UNK], that were

prepended to the vocabulary when we saved it to disk. Therefore, we need to

prepend their corresponding embeddings too.

"How would I know the embeddings for these tokens?"

Word Embeddings | 935

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!