22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Another option is to use Gensim’s simple_preprocess(), which converts the text

into a list of lowercase tokens, discarding tokens that are either too short (less than

three characters) or too long (more than fifteen characters):

from gensim.utils import simple_preprocess

tokens = simple_preprocess(sentence)

tokens

Output

['following', 'the', 'white', 'rabbit']

"Why are we using Gensim? Can’t we use NLTK to perform word

tokenization?"

Fair enough. NLTK can be used to tokenize words as well, but Gensim cannot be

used to tokenize sentences. Besides, since Gensim has many other interesting tools

for building vocabularies, bag-of-words (BoW) models, and Word2Vec models (we’ll

get to that soon), it makes sense to introduce it as soon as possible.

898 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!