22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Data Preparation

1 auto_tokenizer = AutoTokenizer.from_pretrained('gpt2')

2 def tokenize(row):

3 return auto_tokenizer(row['sentence'])

4

5 tokenized_train_dataset = train_dataset.map(

6 tokenize, remove_columns=['source', 'sentence'], batched=True

7 )

8 tokenized_test_dataset = test_dataset.map(

9 tokenize, remove_columns=['source', 'sentence'], batched=True

10 )

Maybe you’ve already realized that, without padding, the sentences have varied

lengths:

list(map(len, tokenized_train_dataset[0:6]['input_ids']))

Output

[9, 28, 20, 9, 34, 29]

These are the first six sentences, and their lengths range from nine to thirty-four

tokens.

"Can’t we just pack the sequences using

rnn_utils.pack_sequence() like in Chapter 8?"

You get the gist of it: The general idea is to "pack" sequences together, indeed, but

in a different way!

"Packed" Dataset

The "packing" is actually simpler now; it is simply concatenating the inputs together

and then chunking them into blocks.

Putting It All Together | 1009

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!