22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Both sequences of token IDs and labels are regular PyTorch tensors now, so we can

use the familiar TensorDataset:

Data Preparation

1 train_tensor_dataset = TensorDataset(train_ids, train_labels)

2 generator = torch.Generator()

3 train_loader = DataLoader(

4 train_tensor_dataset, batch_size=32,

5 shuffle=True, generator=generator

6 )

7 test_tensor_dataset = TensorDataset(test_ids, test_labels)

8 test_loader = DataLoader(test_tensor_dataset, batch_size=32)

"Hold on! Why are we going back to TensorDataset instead of using

HF’s Dataset?"

Well, even though HF’s Dataset was extremely useful to load and manipulate all

the files from our text corpora, and it will surely work seamlessly with HF’s pretrained

models, it’s not ideal to work with our regular, pure PyTorch training

routine.

"Why not?"

It boils down to the fact that, while a TensorDataset returns a typical (features,

label) tuple, the HF’s Dataset always returns a dictionary. So, instead of jumping

through hoops to accommodate this difference in their outputs, it’s easier to fall

back to the familiar TensorDataset for now.

We already have token IDs and labels. But we also have to load the pre-trained

embeddings that match the IDs produced by the tokenizer.

938 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!