22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Output

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],

'input_ids': [101, 1998, 1010, 2061, 2521, 2004, 2027, 2354, 1010,

2027, 2020, 3243, 2157, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0],

'labels': 0,

'sentence': 'And, so far as they knew, they were quite right.',

'source': 'wizoz10-1740.txt'}

See? Regular Python lists, not PyTorch tensors. It created new columns

(attention_mask and input_ids) and kept the old ones (labels, sentence, and

source).

But we don’t need all these columns for training; we need only the first three. So,

let’s tidy it up by using the set_format() method of Dataset:

Data Preparation

1 tokenized_train_dataset.set_format(

2 type='torch',

3 columns=['input_ids', 'attention_mask', 'labels']

4 )

5 tokenized_test_dataset.set_format(

6 type='torch',

7 columns=['input_ids', 'attention_mask', 'labels']

8 )

Not only are we specifying the columns we’re actually interested in, but we’re also

telling it to return PyTorch tensors:

tokenized_train_dataset[0]

Fine-Tuning with HuggingFace | 993

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!