22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The last output, attention_mask, works as the source mask we used in the

Transformer encoder and indicates the padded positions. In a batch of sentences,

for example, we may pad the sequences to get them all with the same length:

separate_sentences = tokenizer([sentence1, sentence2], padding=True)

separate_sentences

Output

{'input_ids': [[3, 1219, 5, 229, 200, 1, 2, 0, 0, 0, 0], [3, 51, 42,

78, 32, 307, 41, 5, 1, 30, 2]], 'token_type_ids': [[0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],

'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1]]}

The tokenizer received a list of two sentences, and it took them as two

independent inputs, thus producing two sequences of IDs. Moreover, since the

padding argument was True, it padded the shortest sequence (five tokens) to match

the longest one (nine tokens). Let’s convert the IDs back to tokens again:

print(

tokenizer.convert_ids_to_tokens(

separate_sentences['input_ids'][0]

)

)

print(separate_sentences['attention_mask'][0])

Output

['[CLS]', 'follow', 'the', 'white', 'rabbit', '[UNK]', '[SEP]',

'[PAD]', '[PAD]', '[PAD]', '[PAD]']

[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

Each padded element in the sequence has a corresponding zero in the attention

mask.

"Then how can I have a batch where each input has two separate

sentences?"

Word Tokenization | 911

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!