22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The first few sentences all come from Alice’s Adventures in Wonderland because we

haven’t shuffled the dataset yet.

Methods

The Dataset also has many methods, like unique(), map(), filter(), shuffle(),

and train_test_split() (for a comprehensive list of operations, check HF’s

"Processing data in a Dataset." [169] )

We can easily check the unique sources:

dataset.unique('source')

Output

['alice28-1476.txt', 'wizoz10-1740.txt']

We can use map() to create new columns by using a function that returns a

dictionary with the new column as key:

Data Preparation

1 def is_alice_label(row):

2 is_alice = int(row['source'] == 'alice28-1476.txt')

3 return {'labels': is_alice}

4

5 dataset = dataset.map(is_alice_label)

Each element in the dataset is a row corresponding to a dictionary ({'sentence':

..., 'source': ...}, in our case), so the function has access to all columns in a

given row. Our is_alice_label() function tests the source column and creates a

labels column. There is no need to return the original columns since this is

automatically handled by the dataset.

If we retrieve the third sentence from our dataset once again, the new column will

already be there:

dataset[2]

894 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!