22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Word Tokenization

The naïve word tokenization, as we’ve already seen, simply splits a sentence into

words using the white space as a separator:

sentence = "I'm following the white rabbit"

tokens = sentence.split(' ')

tokens

Output

["I'm", 'following', 'the', 'white', 'rabbit']

But, as we’ve also seen, there are issues with the naïve approach (how to handle

contractions, for example). Let’s try using Gensim, [170] a popular library for topic

modeling, which offers some out-of-the-box tools for performing word

tokenization:

from gensim.parsing.preprocessing import *

preprocess_string(sentence)

Output

['follow', 'white', 'rabbit']

"That doesn’t look right … some words are simply gone!"

Welcome to the world of tokenization :-) It turns out, Gensim’s

preprocess_string() applies many filters by default, namely:

• strip_tags() (for removing HTML-like tags between brackets)

• strip_punctuation()

• strip_multiple_whitespaces()

• strip_numeric()

The filters above are pretty straightforward, and they are used to remove typical

896 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!