22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

vector, right? And our vocabulary is not even that large! If we were to use a typical

English vocabulary, we would need vectors of 100,000 dimensions. Clearly, this

isn’t very practical. Nonetheless, the sparse vectors produced by the one-hot

encoding are the basis of a fairly basic NLP model: the bag-of-words (BoW).

Bag-of-Words (BoW)

The bag-of-words model is literally a bag of words: It simply sums up the

corresponding OHE vectors, completely disregarding any underlying structure or

relationships between the words. The resulting vector has only the counts of the

words appearing in the text.

We don’t have to do the counting manually, though, since Gensim’s Dictionary has

a doc2bow() method that does the job for us:

sentence = 'the white rabbit is a rabbit'

bow_tokens = simple_preprocess(sentence)

bow_tokens

Output

['the', 'white', 'rabbit', 'is', 'rabbit']

bow = dictionary.doc2bow(bow_tokens)

bow

Output

[(20, 1), (69, 1), (333, 2), (497, 1)]

The word "rabbit" appears twice in the sentence, so its index (333) shows the

corresponding count (2). Also, notice that the fifth word in the original sentence ("

a") did not qualify as a valid token, because it was filtered out by the

simple_preprocess() function for being too short.

The BoW model is obviously very limited since it represents the frequencies of

each word in a piece of text and nothing else. Moreover, representing words using

one-hot-encoded vectors also presents severe limitations: Not only do the vectors

Before Word Embeddings | 915

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!