20.03.2021 Views

Deep-Learning-with-PyTorch

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

98 CHAPTER 4 Real-world data representation using tensors

Input

as chars

as words

i

m

p

o

s

s

i

b

l

e

character

"lOokup"

105

109

112

111

115

115

105

98

108

101

character one-hot

Shape: 10 128

...

0 ... 0 0 1 0 ... 0

0 ... 0 1 0 0 ... 0

...

impoSsible

word lOokup

3394

row 3394

EmbeDding matrix 7261 300

shape: 1 300

word embeDding

-1.70 -0.89 -0.20 1.78 -1.13

col 3394

0 .... 0 1 0 .... 0

word one-hot

shape: 1 7261

conceptuaLly:

multiplication with embeDding matrix

Various poSsibilities for representing

the word “impoSsible”

Figure 4.6

Three ways to encode a word

For most things, our mapping is just splitting by words. But the rarer parts—the capitalized

Impossible and the name Bennet—are composed of subunits.

4.5.4 Text embeddings

One-hot encoding is a very useful technique for representing categorical data in tensors.

However, as we have anticipated, one-hot encoding starts to break down when

the number of items to encode is effectively unbound, as with words in a corpus. In

just one book, we had over 7,000 items!

We certainly could do some work to deduplicate words, condense alternate spellings,

collapse past and future tenses into a single token, and that kind of thing. Still, a

general-purpose English-language encoding would be huge. Even worse, every time we

encountered a new word, we would have to add a new column to the vector, which

would mean adding a new set of weights to the model to account for that new vocabulary

entry—which would be painful from a training perspective.

How can we compress our encoding down to a more manageable size and put a

cap on the size growth? Well, instead of vectors of many zeros and a single one, we can

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!