22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Equation 11.1 - Embedding arithmetic

This arithmetic is cool and all, but you won’t actually be using it much; the whole

point was to show you that the word embeddings indeed capture the relationship

between different words. We can use them to train other models, though.

Using Word Embeddings

It seems easy enough: Get the text corpora tokenized, look the tokens up in the

table of pre-trained word embeddings, and then use the embeddings as inputs of

another model. But, what if the vocabulary of your corpora is not quite properly

represented in the embeddings? Even worse, what if the preprocessing steps you

used resulted in a lot of tokens that do not exist in the embeddings?

"Choose your word embeddings wisely."

Grail Knight

Vocabulary Coverage

Once again, the Grail Knight has a point—the chosen word embeddings must

provide good vocabulary coverage. First and foremost, most of the usual

preprocessing steps do not apply when you’re using pre-trained word embeddings

like GloVe: no lemmatization, no stemming, no stop-word removal. These steps

would likely end up producing a lot of [UNK] tokens.

Second, even without those preprocessing steps, maybe the words used in the

given text corpora are simply not a good match for a particular pre-trained set of

word embeddings.

Let’s see how good a match the glove-wiki-gigaword-50 embeddings are to our

own vocabulary. Our vocabulary has 3,706 words (3,704 from our text corpora

plus the padding and unknown special tokens):

vocab = list(dictionary.token2id.keys())

len(vocab)

Word Embeddings | 931

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!