22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Model III — Preprocessed Embeddings

Data Preparation

Before, the features were a sequence of token IDs, which were used to look

embeddings up in the embedding layer and return the corresponding bag-ofembeddings

(that was a document embedding too, although less sophisticated).

Now, we’re outsourcing these steps to BERT and getting document embeddings

directly from it. It turns out, using a pre-trained BERT model to retrieve document

embeddings is a preprocessing step in this setup. Consequently, our model is going

to be nothing other than a simple classifier.

"Let the preprocessing begin!"

Maximus Decimus Meridius

The idea is to use get_embeddings() for each and every sentence in our datasets in

order to retrieve their corresponding document embeddings. HuggingFace’s

dataset allows us to easily do that by using its map() method to generate a new

column:

Data Preparation

1 train_dataset_doc = train_dataset.map(

2 lambda row: {'embeddings': get_embeddings(bert_doc,

3 row['sentence'])}

4 )

5 test_dataset_doc = test_dataset.map(

6 lambda row: {'embeddings': get_embeddings(bert_doc,

7 row['sentence'])}

8 )

Moreover, we need the embeddings to be returned as PyTorch tensors:

Data Preparation

1 train_dataset_doc.set_format(type='torch',

2 columns=['embeddings', 'labels'])

3 test_dataset_doc.set_format(type='torch',

4 columns=['embeddings', 'labels'])

962 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!