22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Recap

In this chapter, we took a deep dive into the natural language processing world. We

built our own dataset from scratch using two books, Alice’s Adventures in

Wonderland and The Wonderful Wizard of Oz, and performed sentence and word

tokenization. Then, we built a vocabulary and used it with a tokenizer to generate

the primary input of our models: sequences of token IDs. Next, we created

numerical representations for our tokens, starting with a basic one-hot encoding

and working our way to using word embeddings to train a model for classifying the

source of a sentence. We also learned about the limitations of classical

embeddings, and the need for contextual word embeddings produced by language

models like ELMo and BERT. We got to know our Muppet friend in detail: input

embeddings, pre-training tasks, and hidden states (the actual embeddings). We

leveraged the HuggingFace library to fine-tune a pre-trained model using a

Trainer and to deliver predictions using a pipeline. Lastly, we used the famous

GPT-2 model to generate text that, hopefully, looks like it was written by Lewis

Carroll. This is what we’ve covered:

• using NLTK to perform sentence tokenization on our text corpora

• converting each book into a CSV file containing one sentence per line

• building a dataset using HuggingFace’s Dataset to load the CSV files

• creating new columns in the dataset using map()

• learning about data augmentation for text data

• using Gensim to perform word tokenization

• building a vocabulary and using it to get a token ID for each word

• adding special tokens to the vocabulary, like [UNK] and [PAD]

• loading our own vocabulary into HuggingFace’s tokenizer

• understanding the output of a tokenizer: input_ids, token_type_ids, and

attention_mask

• using the tokenizer to tokenize two sentences as a single input

• creating numerical representations for each token, starting with one-hot

encoding

• learning about the simplicity and limitations of the bag-of-words (BoW)

approach

1016 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!