22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Data Augmentation

Let’s briefly address the topic of augmentation for text data. Although we’re

not actually including it in our pipeline here, it’s worth knowing about some

possibilities and techniques regarding data augmentation.

The most basic technique is called word dropout, and, as you probably

guessed, it simply randomly replaces words with some other random word

or a special [UNK] token (word) that indicates a non-existing word.

It is also possible to replace words with their synonyms, so the meaning of

the text is preserved. One can use WordNet, [173] a lexical database for the

English language, to look up synonyms. Finding synonyms is not so easy, and

this approach is limited to the English language.

To circumvent the limitations of the synonyms approach, it is also possible to

replace words with similar words, numerically speaking. We haven’t yet

talked about word embeddings—numerical representations of words—but

they can be used to identify words that may have a similar meaning. For now,

it suffices to say that there are packages that perform data augmentation on

text data using embeddings, like TextAttack. [174]

Let’s try augmenting Richard P. Feynman, as an example:

# !pip install textattack

from textattack.augmentation import EmbeddingAugmenter

augmenter = EmbeddingAugmenter()

feynman = 'What I cannot create, I do not understand.'

for i in range(4):

print(augmenter.augment(feynman))

Output

['What I cannot create, I do not fathom.']

['What I cannot create, I do not understood.']

['What I notable create, I do not understand.']

['What I significant create, I do not understand.']

Word Tokenization | 899

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!